Download presentation
Presentation is loading. Please wait.
Published byFelicity Harrell Modified over 6 years ago
1
Framework For Exploring Interconnect Level Cache Coherency
Parvinder Pal Singh Sr. R & D Engineer, Synopsys, India. id: © Accellera Systems Initiative
2
© Accellera Systems Initiative
Agenda Introduction to caches Exploration space with Hardware Coherency Current Approaches Problem Explanation Proposed Methodology How SystemC Helps Solving Problem With Our Methodology Case Study Conclusion © Accellera Systems Initiative
3
Cache Coherency Basics
Caches provides way to hide memory latencies Same data view in Multi-CPU system I run Too Fast But Memory Don’t I am Loving it What Can I do Take my copy closer to you Who will maintain the consistency What about me © Accellera Systems Initiative
4
Independent to software Complex and inefficient
Coherency Mechanism Software Based Hardware Based No extra cost Fast Independent to software Extra hardware Complex and inefficient © Accellera Systems Initiative
5
Exploration space with HW coherency
Cache line size Snooping mechanism Speculative fetches Different Buswidth Directory Size Snooping Type Interconnect Rule Talk about What is exploration space…. Parameter sets which you tweak to obtain best results Then talk about each parameter ACE CHI Many More Interface Protocol MSI MOESI MESI Coherency Protocols Utilization Read/Write Latencies Snoop Hit/miss Etc.. System Level Data/Analysis Minimize snooping traffic Sharability © Accellera Systems Initiative
6
© Accellera Systems Initiative
Current Approaches Spreadsheet Accuracy issue Limited view/No system level view Power and Performance issues Hit and Trial Error Prone Wrong design configuration Power and performance issues © Accellera Systems Initiative
7
Problem Explanation – System
COHERENT INTERCONNECT MEM BUS DRAM CPU0 $ L2 CACHE ROM CPU1 $ PERIPH BUS CPU2 TIMER $ CPU3 UART $ © Accellera Systems Initiative
8
Problem Explanation – Objectives
Cache Size and BUS Width is constant Cache Size and BUS Width is constant Cache Size and BUS Width is constant Cache Size and BUS Width is constant Primary Objectives Secondary Objectives Average Read latency for each CPU should be less than 40 cycles Minimize snoop requests Throughput for all the CPUs should be more than 40 MB/s Reduced accesses to DRAM memory © Accellera Systems Initiative
9
Problem Explanation – Exploration Space
Cache Line Directory Size Domain Cache States Snooping Mechanism © Accellera Systems Initiative
10
Proposed Methodology … 1. Platform Assembly and workload modeling
2. Simulation Sweep … 3. Root-cause Analysis 4. Sensitivity Analysis Introduction to PA tool 6. Hand-off 5. Are we done yet? © Accellera Systems Initiative
11
Generic Coherent Interconnect
Easy to configure for any interface protocol Can connect any number of initiator and target Update/Add/Remove any functionality Multiple configurations System Level View © Accellera Systems Initiative
12
How SystemC Helps MASTER1 MASTER2 MASTER3 CACHE CACHE CACHE
Master1 generates Req Payload MASTER1 Payload MASTER2 MASTER3 CACHE CACHE CACHE INTERCONNECT MEMORY © Accellera Systems Initiative
13
How SystemC Helps MASTER1 MASTER2 MASTER3 CACHE CACHE CACHE
Payload CACHE CACHE INTERCONNECT MEMORY SystemC-TLM can trace whether req. completed by cache or not © Accellera Systems Initiative
14
© Accellera Systems Initiative
How SystemC Helps MASTER1 MASTER2 MASTER3 CACHE CACHE CACHE Payload Payload INTERCONNECT Payload Payload MEMORY Interconnect splits payload into multiple payload for snooping and pre-fetch © Accellera Systems Initiative
15
How SystemC Helps MASTER1 MASTER2 MASTER3 CACHE CACHE CACHE
All because Payload Cache Miss Cache Hit INTERCONNECT Payload waiting I am smart I can track the ID for analysis Combine responses Payload MEMORY Can track avg. time completion Can track throughput etc. Can track memory accesses Can track hit/miss per master © Accellera Systems Initiative
16
How SystemC Helps MASTER1 MASTER2 MASTER3 CACHE CACHE CACHE
And SystemC infra Cache Miss Cache Hit INTERCONNECT Payload waiting TLM2/FT provides extensions Payload MEMORY TLM2/FT provides way to add timings Payload can be traced at any level © Accellera Systems Initiative
17
Case Study – Configure System
Cache line size Buswidth Snooping mechanism Cache States Directory Size Snooping Type Speculative Fetch Cache line size Buswidth Cache size Ways Replacement policy States Delays Configure memory controller © Accellera Systems Initiative
18
Case Study – System Analysis
Average Latencies Cache Performance Hit/Miss Per master snoop request Request Type Snoop Performance Memory transaction Transaction type Transaction count Transaction on different targets Count Average Duration © Accellera Systems Initiative
19
Broadcast vs Directory (Snoop Requests)
3468 Explain Snoop Request…. Request received from peer master and interconnect © Accellera Systems Initiative
20
Broadcast vs Directory (Snoop Requests)
3468 Directory 2428 Reduced Traffic (All Hit) Directory based snooping results in reduced snoop requests © Accellera Systems Initiative
21
Broadcast vs Directory (Latencies)
CPU0 60.779ns Directory CPU0 40.6ns © Accellera Systems Initiative
22
Directory Size (Snoop Requests)
Directory Size of entries 8577 460 2 Invalidation From Interconnect © Accellera Systems Initiative
23
Directory Size (Traffic)
Directory Size of entries Invalidation From Interconnect 717 3 Significantly reduced Invalidates from Interconnect 4 © Accellera Systems Initiative
24
Directory Size (Traffic)
35000 NO Invalidation,I WON 50000 HOLD a Sec. © Accellera Systems Initiative
25
Directory Size (Latencies)
50000 © Accellera Systems Initiative
26
Directory Size (Latencies)
Minimize snoop requests 50000 40.8 54.7 35000 40.8 58.7 No Significant Improvement © Accellera Systems Initiative
27
Cache Line Size (Mem Traffic)
64 486 2369 © Accellera Systems Initiative
28
Cache Line Size (Mem Traffic)
64 486 128 2369 491 No Improvement 2370 © Accellera Systems Initiative
29
Cache Line Size (Mem Traffic)
Reduced accesses to DRAM memory 64 486 32 2369 288 Reduced Main Memory Access 934 © Accellera Systems Initiative
30
Cache Line Size (Latencies)
64 40.8 © Accellera Systems Initiative
31
Cache Line Size (Latencies)
64 Average Read latency for each CPU should be less than 40 cycles 40.8 32 26.5 Avg. Latencies Reduced Significantly © Accellera Systems Initiative
32
Throughput(Latencies)
Throughput for all the CPUs should be more than 40 MB/s 32 Objective Achieved © Accellera Systems Initiative
33
© Accellera Systems Initiative
Coherency Domain 409 401 Good Candidate For InnerSharable © Accellera Systems Initiative
34
© Accellera Systems Initiative
Final Result Parameter Values Taken Final Value Snooping Mechanism Broadcast/Directory Directory Directory Size 10000, 35000, 50000 35000 Cache Line Size 128, 64, 32 32 Primary Objectives Secondary Objectives Average Read latency for each CPU should be less than 40 cycles Minimize snoop requests Throughput for all the CPUs should be more than 40 MB/s Reduced accesses to DRAM memory © Accellera Systems Initiative
35
© Accellera Systems Initiative
Conclusion Architecture definition of cache coherent interconnect is challenging Many design parameters, performance difficult to predict Use SystemC-TLM2 modeling for early quantitative analysis Leverage generic configurable performance model Systematically explore impact of architecture options In-depth analysis to understand the root-cause of issues © Accellera Systems Initiative
36
© Accellera Systems Initiative
Questions © Accellera Systems Initiative
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.