Framework For Exploring Interconnect Level Cache Coherency

Framework For Exploring Interconnect Level Cache Coherency
Parvinder Pal Singh Sr. R & D Engineer, Synopsys, India. id: © Accellera Systems Initiative

© Accellera Systems Initiative
Agenda Introduction to caches Exploration space with Hardware Coherency Current Approaches Problem Explanation Proposed Methodology How SystemC Helps Solving Problem With Our Methodology Case Study Conclusion © Accellera Systems Initiative

Cache Coherency Basics
Caches provides way to hide memory latencies Same data view in Multi-CPU system I run Too Fast But Memory Don’t I am Loving it What Can I do Take my copy closer to you Who will maintain the consistency What about me © Accellera Systems Initiative

Independent to software Complex and inefficient
Coherency Mechanism Software Based Hardware Based No extra cost Fast Independent to software Extra hardware Complex and inefficient © Accellera Systems Initiative

Exploration space with HW coherency
Cache line size Snooping mechanism Speculative fetches Different Buswidth Directory Size Snooping Type Interconnect Rule Talk about What is exploration space…. Parameter sets which you tweak to obtain best results Then talk about each parameter ACE CHI Many More Interface Protocol MSI MOESI MESI Coherency Protocols Utilization Read/Write Latencies Snoop Hit/miss Etc.. System Level Data/Analysis Minimize snooping traffic Sharability © Accellera Systems Initiative

Current Approaches Spreadsheet Accuracy issue Limited view/No system level view Power and Performance issues Hit and Trial Error Prone Wrong design configuration Power and performance issues © Accellera Systems Initiative

Problem Explanation – System
COHERENT INTERCONNECT MEM BUS DRAM CPU0 $ L2 CACHE ROM CPU1 $ PERIPH BUS CPU2 TIMER $ CPU3 UART $ © Accellera Systems Initiative

Problem Explanation – Objectives
Cache Size and BUS Width is constant Cache Size and BUS Width is constant Cache Size and BUS Width is constant Cache Size and BUS Width is constant Primary Objectives Secondary Objectives Average Read latency for each CPU should be less than 40 cycles Minimize snoop requests Throughput for all the CPUs should be more than 40 MB/s Reduced accesses to DRAM memory © Accellera Systems Initiative

Problem Explanation – Exploration Space
Cache Line Directory Size Domain Cache States Snooping Mechanism © Accellera Systems Initiative

Proposed Methodology … 1. Platform Assembly and workload modeling
2. Simulation Sweep … 3. Root-cause Analysis 4. Sensitivity Analysis Introduction to PA tool 6. Hand-off 5. Are we done yet? © Accellera Systems Initiative

Generic Coherent Interconnect
Easy to configure for any interface protocol Can connect any number of initiator and target Update/Add/Remove any functionality Multiple configurations System Level View © Accellera Systems Initiative

How SystemC Helps MASTER1 MASTER2 MASTER3 CACHE CACHE CACHE
Master1 generates Req Payload MASTER1 Payload MASTER2 MASTER3 CACHE CACHE CACHE INTERCONNECT MEMORY © Accellera Systems Initiative

Payload CACHE CACHE INTERCONNECT MEMORY SystemC-TLM can trace whether req. completed by cache or not © Accellera Systems Initiative

How SystemC Helps MASTER1 MASTER2 MASTER3 CACHE CACHE CACHE Payload Payload INTERCONNECT Payload Payload MEMORY Interconnect splits payload into multiple payload for snooping and pre-fetch © Accellera Systems Initiative

All because Payload Cache Miss Cache Hit INTERCONNECT Payload waiting I am smart I can track the ID for analysis Combine responses Payload MEMORY Can track avg. time completion Can track throughput etc. Can track memory accesses Can track hit/miss per master © Accellera Systems Initiative

And SystemC infra Cache Miss Cache Hit INTERCONNECT Payload waiting TLM2/FT provides extensions Payload MEMORY TLM2/FT provides way to add timings Payload can be traced at any level © Accellera Systems Initiative

Case Study – Configure System
Cache line size Buswidth Snooping mechanism Cache States Directory Size Snooping Type Speculative Fetch Cache line size Buswidth Cache size Ways Replacement policy States Delays Configure memory controller © Accellera Systems Initiative

Case Study – System Analysis
Average Latencies Cache Performance Hit/Miss Per master snoop request Request Type Snoop Performance Memory transaction Transaction type Transaction count Transaction on different targets Count Average Duration © Accellera Systems Initiative

Broadcast vs Directory (Snoop Requests)
3468 Explain Snoop Request…. Request received from peer master and interconnect © Accellera Systems Initiative

Broadcast vs Directory (Snoop Requests)
3468 Directory 2428 Reduced Traffic (All Hit) Directory based snooping results in reduced snoop requests © Accellera Systems Initiative

Broadcast vs Directory (Latencies)
CPU0 60.779ns Directory CPU0 40.6ns © Accellera Systems Initiative

Directory Size (Snoop Requests)
Directory Size of entries 8577 460 2 Invalidation From Interconnect © Accellera Systems Initiative

Directory Size (Traffic)
Directory Size of entries Invalidation From Interconnect 717 3 Significantly reduced Invalidates from Interconnect 4 © Accellera Systems Initiative

Directory Size (Traffic)
35000 NO Invalidation,I WON 50000 HOLD a Sec. © Accellera Systems Initiative

Directory Size (Latencies)
Minimize snoop requests 50000 40.8 54.7 35000 40.8 58.7 No Significant Improvement © Accellera Systems Initiative

Cache Line Size (Latencies)
64 Average Read latency for each CPU should be less than 40 cycles 40.8 32 26.5 Avg. Latencies Reduced Significantly © Accellera Systems Initiative

Throughput(Latencies)
Throughput for all the CPUs should be more than 40 MB/s 32 Objective Achieved © Accellera Systems Initiative

Final Result Parameter Values Taken Final Value Snooping Mechanism Broadcast/Directory Directory Directory Size 10000, 35000, 50000 35000 Cache Line Size 128, 64, 32 32 Primary Objectives Secondary Objectives Average Read latency for each CPU should be less than 40 cycles Minimize snoop requests Throughput for all the CPUs should be more than 40 MB/s Reduced accesses to DRAM memory © Accellera Systems Initiative

Conclusion Architecture definition of cache coherent interconnect is challenging Many design parameters, performance difficult to predict Use SystemC-TLM2 modeling for early quantitative analysis Leverage generic configurable performance model Systematically explore impact of architecture options In-depth analysis to understand the root-cause of issues © Accellera Systems Initiative

Framework For Exploring Interconnect Level Cache Coherency

Similar presentations

Presentation on theme: "Framework For Exploring Interconnect Level Cache Coherency"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Framework For Exploring Interconnect Level Cache Coherency

Similar presentations

Presentation on theme: "Framework For Exploring Interconnect Level Cache Coherency"— Presentation transcript:

Similar presentations

About project

Feedback