Presentation is loading. Please wait.

Presentation is loading. Please wait.

Framework For Exploring Interconnect Level Cache Coherency

Similar presentations


Presentation on theme: "Framework For Exploring Interconnect Level Cache Coherency"— Presentation transcript:

1 Framework For Exploring Interconnect Level Cache Coherency
Parvinder Pal Singh Sr. R & D Engineer, Synopsys, India. id: © Accellera Systems Initiative

2 © Accellera Systems Initiative
Agenda Introduction to caches Exploration space with Hardware Coherency Current Approaches Problem Explanation Proposed Methodology How SystemC Helps Solving Problem With Our Methodology Case Study Conclusion © Accellera Systems Initiative

3 Cache Coherency Basics
Caches provides way to hide memory latencies Same data view in Multi-CPU system I run Too Fast But Memory Don’t I am Loving it What Can I do Take my copy closer to you Who will maintain the consistency What about me © Accellera Systems Initiative

4 Independent to software Complex and inefficient
Coherency Mechanism Software Based Hardware Based No extra cost Fast Independent to software Extra hardware Complex and inefficient © Accellera Systems Initiative

5 Exploration space with HW coherency
Cache line size Snooping mechanism Speculative fetches Different Buswidth Directory Size Snooping Type Interconnect Rule Talk about What is exploration space…. Parameter sets which you tweak to obtain best results Then talk about each parameter ACE CHI Many More Interface Protocol MSI MOESI MESI Coherency Protocols Utilization Read/Write Latencies Snoop Hit/miss Etc.. System Level Data/Analysis Minimize snooping traffic Sharability © Accellera Systems Initiative

6 © Accellera Systems Initiative
Current Approaches Spreadsheet Accuracy issue Limited view/No system level view Power and Performance issues Hit and Trial Error Prone Wrong design configuration Power and performance issues © Accellera Systems Initiative

7 Problem Explanation – System
COHERENT INTERCONNECT MEM BUS DRAM CPU0 $ L2 CACHE ROM CPU1 $ PERIPH BUS CPU2 TIMER $ CPU3 UART $ © Accellera Systems Initiative

8 Problem Explanation – Objectives
Cache Size and BUS Width is constant Cache Size and BUS Width is constant Cache Size and BUS Width is constant Cache Size and BUS Width is constant Primary Objectives Secondary Objectives Average Read latency for each CPU should be less than 40 cycles Minimize snoop requests Throughput for all the CPUs should be more than 40 MB/s Reduced accesses to DRAM memory © Accellera Systems Initiative

9 Problem Explanation – Exploration Space
Cache Line Directory Size Domain Cache States Snooping Mechanism © Accellera Systems Initiative

10 Proposed Methodology … 1. Platform Assembly and workload modeling
2. Simulation Sweep 3. Root-cause Analysis 4. Sensitivity Analysis Introduction to PA tool 6. Hand-off 5. Are we done yet? © Accellera Systems Initiative

11 Generic Coherent Interconnect
Easy to configure for any interface protocol Can connect any number of initiator and target Update/Add/Remove any functionality Multiple configurations System Level View © Accellera Systems Initiative

12 How SystemC Helps MASTER1 MASTER2 MASTER3 CACHE CACHE CACHE
Master1 generates Req Payload MASTER1 Payload MASTER2 MASTER3 CACHE CACHE CACHE INTERCONNECT MEMORY © Accellera Systems Initiative

13 How SystemC Helps MASTER1 MASTER2 MASTER3 CACHE CACHE CACHE
Payload CACHE CACHE INTERCONNECT MEMORY SystemC-TLM can trace whether req. completed by cache or not © Accellera Systems Initiative

14 © Accellera Systems Initiative
How SystemC Helps MASTER1 MASTER2 MASTER3 CACHE CACHE CACHE Payload Payload INTERCONNECT Payload Payload MEMORY Interconnect splits payload into multiple payload for snooping and pre-fetch © Accellera Systems Initiative

15 How SystemC Helps MASTER1 MASTER2 MASTER3 CACHE CACHE CACHE
All because Payload Cache Miss Cache Hit INTERCONNECT Payload waiting I am smart I can track the ID for analysis Combine responses Payload MEMORY Can track avg. time completion Can track throughput etc. Can track memory accesses Can track hit/miss per master © Accellera Systems Initiative

16 How SystemC Helps MASTER1 MASTER2 MASTER3 CACHE CACHE CACHE
And SystemC infra Cache Miss Cache Hit INTERCONNECT Payload waiting TLM2/FT provides extensions Payload MEMORY TLM2/FT provides way to add timings Payload can be traced at any level © Accellera Systems Initiative

17 Case Study – Configure System
Cache line size Buswidth Snooping mechanism Cache States Directory Size Snooping Type Speculative Fetch Cache line size Buswidth Cache size Ways Replacement policy States Delays Configure memory controller © Accellera Systems Initiative

18 Case Study – System Analysis
Average Latencies Cache Performance Hit/Miss Per master snoop request Request Type Snoop Performance Memory transaction Transaction type Transaction count Transaction on different targets Count Average Duration © Accellera Systems Initiative

19 Broadcast vs Directory (Snoop Requests)
3468 Explain Snoop Request…. Request received from peer master and interconnect © Accellera Systems Initiative

20 Broadcast vs Directory (Snoop Requests)
3468 Directory 2428 Reduced Traffic (All Hit) Directory based snooping results in reduced snoop requests © Accellera Systems Initiative

21 Broadcast vs Directory (Latencies)
CPU0 60.779ns Directory CPU0 40.6ns © Accellera Systems Initiative

22 Directory Size (Snoop Requests)
Directory Size of entries 8577 460 2 Invalidation From Interconnect © Accellera Systems Initiative

23 Directory Size (Traffic)
Directory Size of entries Invalidation From Interconnect 717 3 Significantly reduced Invalidates from Interconnect 4 © Accellera Systems Initiative

24 Directory Size (Traffic)
35000 NO Invalidation,I WON 50000 HOLD a Sec. © Accellera Systems Initiative

25 Directory Size (Latencies)
50000 © Accellera Systems Initiative

26 Directory Size (Latencies)
Minimize snoop requests 50000 40.8 54.7 35000 40.8 58.7 No Significant Improvement © Accellera Systems Initiative

27 Cache Line Size (Mem Traffic)
64 486 2369 © Accellera Systems Initiative

28 Cache Line Size (Mem Traffic)
64 486 128 2369 491 No Improvement 2370 © Accellera Systems Initiative

29 Cache Line Size (Mem Traffic)
Reduced accesses to DRAM memory 64 486 32 2369 288 Reduced Main Memory Access 934 © Accellera Systems Initiative

30 Cache Line Size (Latencies)
64 40.8 © Accellera Systems Initiative

31 Cache Line Size (Latencies)
64 Average Read latency for each CPU should be less than 40 cycles 40.8 32 26.5 Avg. Latencies Reduced Significantly © Accellera Systems Initiative

32 Throughput(Latencies)
Throughput for all the CPUs should be more than 40 MB/s 32 Objective Achieved © Accellera Systems Initiative

33 © Accellera Systems Initiative
Coherency Domain 409 401 Good Candidate For InnerSharable © Accellera Systems Initiative

34 © Accellera Systems Initiative
Final Result Parameter Values Taken Final Value Snooping Mechanism Broadcast/Directory Directory Directory Size 10000, 35000, 50000 35000 Cache Line Size 128, 64, 32 32 Primary Objectives Secondary Objectives Average Read latency for each CPU should be less than 40 cycles Minimize snoop requests Throughput for all the CPUs should be more than 40 MB/s Reduced accesses to DRAM memory © Accellera Systems Initiative

35 © Accellera Systems Initiative
Conclusion Architecture definition of cache coherent interconnect is challenging Many design parameters, performance difficult to predict Use SystemC-TLM2 modeling for early quantitative analysis Leverage generic configurable performance model Systematically explore impact of architecture options In-depth analysis to understand the root-cause of issues © Accellera Systems Initiative

36 © Accellera Systems Initiative
Questions © Accellera Systems Initiative


Download ppt "Framework For Exploring Interconnect Level Cache Coherency"

Similar presentations


Ads by Google