Framework For Exploring Interconnect Level Cache Coherency Parvinder Pal Singh Sr. R & D Engineer, Synopsys, India. email id: ppsingh@synopsys.com © Accellera Systems Initiative
© Accellera Systems Initiative Agenda Introduction to caches Exploration space with Hardware Coherency Current Approaches Problem Explanation Proposed Methodology How SystemC Helps Solving Problem With Our Methodology Case Study Conclusion © Accellera Systems Initiative
Cache Coherency Basics Caches provides way to hide memory latencies Same data view in Multi-CPU system I run Too Fast But Memory Don’t I am Loving it What Can I do Take my copy closer to you Who will maintain the consistency What about me © Accellera Systems Initiative
Independent to software Complex and inefficient Coherency Mechanism Software Based Hardware Based No extra cost Fast Independent to software Extra hardware Complex and inefficient © Accellera Systems Initiative
Exploration space with HW coherency Cache line size Snooping mechanism Speculative fetches Different Buswidth Directory Size Snooping Type Interconnect Rule Talk about What is exploration space…. Parameter sets which you tweak to obtain best results Then talk about each parameter ACE CHI Many More Interface Protocol MSI MOESI MESI Coherency Protocols Utilization Read/Write Latencies Snoop Hit/miss Etc.. System Level Data/Analysis Minimize snooping traffic Sharability © Accellera Systems Initiative
© Accellera Systems Initiative Current Approaches Spreadsheet Accuracy issue Limited view/No system level view Power and Performance issues Hit and Trial Error Prone Wrong design configuration Power and performance issues © Accellera Systems Initiative
Problem Explanation – System COHERENT INTERCONNECT MEM BUS DRAM CPU0 $ L2 CACHE ROM CPU1 $ PERIPH BUS CPU2 TIMER $ CPU3 UART $ © Accellera Systems Initiative
Problem Explanation – Objectives Cache Size and BUS Width is constant Cache Size and BUS Width is constant Cache Size and BUS Width is constant Cache Size and BUS Width is constant Primary Objectives Secondary Objectives Average Read latency for each CPU should be less than 40 cycles Minimize snoop requests Throughput for all the CPUs should be more than 40 MB/s Reduced accesses to DRAM memory © Accellera Systems Initiative
Problem Explanation – Exploration Space Cache Line Directory Size Domain Cache States Snooping Mechanism © Accellera Systems Initiative
Proposed Methodology … 1. Platform Assembly and workload modeling 2. Simulation Sweep … 3. Root-cause Analysis 4. Sensitivity Analysis Introduction to PA tool 6. Hand-off 5. Are we done yet? © Accellera Systems Initiative
Generic Coherent Interconnect Easy to configure for any interface protocol Can connect any number of initiator and target Update/Add/Remove any functionality Multiple configurations System Level View © Accellera Systems Initiative
How SystemC Helps MASTER1 MASTER2 MASTER3 CACHE CACHE CACHE Master1 generates Req Payload MASTER1 Payload MASTER2 MASTER3 CACHE CACHE CACHE INTERCONNECT MEMORY © Accellera Systems Initiative
How SystemC Helps MASTER1 MASTER2 MASTER3 CACHE CACHE CACHE Payload CACHE CACHE INTERCONNECT MEMORY SystemC-TLM can trace whether req. completed by cache or not © Accellera Systems Initiative
© Accellera Systems Initiative How SystemC Helps MASTER1 MASTER2 MASTER3 CACHE CACHE CACHE Payload Payload INTERCONNECT Payload Payload MEMORY Interconnect splits payload into multiple payload for snooping and pre-fetch © Accellera Systems Initiative
How SystemC Helps MASTER1 MASTER2 MASTER3 CACHE CACHE CACHE All because Payload Cache Miss Cache Hit INTERCONNECT Payload waiting I am smart I can track the ID for analysis Combine responses Payload MEMORY Can track avg. time completion Can track throughput etc. Can track memory accesses Can track hit/miss per master © Accellera Systems Initiative
How SystemC Helps MASTER1 MASTER2 MASTER3 CACHE CACHE CACHE And SystemC infra Cache Miss Cache Hit INTERCONNECT Payload waiting TLM2/FT provides extensions Payload MEMORY TLM2/FT provides way to add timings Payload can be traced at any level © Accellera Systems Initiative
Case Study – Configure System Cache line size Buswidth Snooping mechanism Cache States Directory Size Snooping Type Speculative Fetch Cache line size Buswidth Cache size Ways Replacement policy States Delays Configure memory controller © Accellera Systems Initiative
Case Study – System Analysis Average Latencies Cache Performance Hit/Miss Per master snoop request Request Type Snoop Performance Memory transaction Transaction type Transaction count Transaction on different targets Count Average Duration © Accellera Systems Initiative
Broadcast vs Directory (Snoop Requests) 3468 Explain Snoop Request…. Request received from peer master and interconnect © Accellera Systems Initiative
Broadcast vs Directory (Snoop Requests) 3468 Directory 2428 Reduced Traffic (All Hit) Directory based snooping results in reduced snoop requests © Accellera Systems Initiative
Broadcast vs Directory (Latencies) CPU0 60.779ns Directory CPU0 40.6ns © Accellera Systems Initiative
Directory Size (Snoop Requests) Directory Size of 10000 entries 8577 460 2 Invalidation From Interconnect © Accellera Systems Initiative
Directory Size (Traffic) Directory Size of 35000 entries Invalidation From Interconnect 717 3 Significantly reduced Invalidates from Interconnect 4 © Accellera Systems Initiative
Directory Size (Traffic) 35000 NO Invalidation,I WON 50000 HOLD a Sec. © Accellera Systems Initiative
Directory Size (Latencies) 50000 © Accellera Systems Initiative
Directory Size (Latencies) Minimize snoop requests 50000 40.8 54.7 35000 40.8 58.7 No Significant Improvement © Accellera Systems Initiative
Cache Line Size (Mem Traffic) 64 486 2369 © Accellera Systems Initiative
Cache Line Size (Mem Traffic) 64 486 128 2369 491 No Improvement 2370 © Accellera Systems Initiative
Cache Line Size (Mem Traffic) Reduced accesses to DRAM memory 64 486 32 2369 288 Reduced Main Memory Access 934 © Accellera Systems Initiative
Cache Line Size (Latencies) 64 40.8 © Accellera Systems Initiative
Cache Line Size (Latencies) 64 Average Read latency for each CPU should be less than 40 cycles 40.8 32 26.5 Avg. Latencies Reduced Significantly © Accellera Systems Initiative
Throughput(Latencies) Throughput for all the CPUs should be more than 40 MB/s 32 Objective Achieved © Accellera Systems Initiative
© Accellera Systems Initiative Coherency Domain 409 401 Good Candidate For InnerSharable © Accellera Systems Initiative
© Accellera Systems Initiative Final Result Parameter Values Taken Final Value Snooping Mechanism Broadcast/Directory Directory Directory Size 10000, 35000, 50000 35000 Cache Line Size 128, 64, 32 32 Primary Objectives Secondary Objectives Average Read latency for each CPU should be less than 40 cycles Minimize snoop requests Throughput for all the CPUs should be more than 40 MB/s Reduced accesses to DRAM memory © Accellera Systems Initiative
© Accellera Systems Initiative Conclusion Architecture definition of cache coherent interconnect is challenging Many design parameters, performance difficult to predict Use SystemC-TLM2 modeling for early quantitative analysis Leverage generic configurable performance model Systematically explore impact of architecture options In-depth analysis to understand the root-cause of issues © Accellera Systems Initiative
© Accellera Systems Initiative Questions © Accellera Systems Initiative