Framework For Exploring Interconnect Level Cache Coherency

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

High Performing Cache Hierarchies for Server Workloads

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

4/16/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Multiprocessor Cache Coherency

Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.

Quantifying and Comparing the Impact of Wrong-Path Memory References in Multiple-CMP Systems Ayse Yilmazer, University of Rhode Island Resit Sendag, University.

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

The University of Adelaide, School of Computer Science

An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.

Bus Interfacing Processor-Memory Bus Backplane Bus I/O Bus

COMP 740: Computer Architecture and Implementation

Predictable Cache Coherence for Multi-Core Real-Time Systems

Processor support devices Part 2: Caches and the MESI protocol

Cache Organization of Pentium

Analytic Evaluation of Shared-Memory Systems with ILP Processors

Zhichun Zhu Zhao Zhang ECE Department ECE Department

SystemC Simulation Based Memory Controller Optimization

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

A New Coherence Method Using A Multicast Address Network

A Study on Snoop-Based Cache Coherence Protocols

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

12.4 Memory Organization in Multiprocessor Systems

Multiprocessor Cache Coherency

Cache Memory Presentation I

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

ECE 445 – Computer Organization

CMSC 611: Advanced Computer Architecture

CSCI206 - Computer Organization & Programming

Krste Asanovic Electrical Engineering and Computer Sciences

Example Cache Coherence Problem

Cache Memories September 30, 2008

The University of Adelaide, School of Computer Science

Lecture 2: Snooping-Based Coherence

Using Dead Blocks as a Virtual Victim Cache

Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.

Performance metrics for caches

Performance metrics for caches

11 – Snooping Cache and Directory Based Multiprocessors

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Performance metrics for caches

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

/ Computer Architecture and Design

Lecture 25: Multiprocessors

High Performance Computing

CS 3410, Spring 2014 Computer Science Cornell University

Lecture 25: Multiprocessors

Lecture: Cache Hierarchies

Chapter 4 Multiprocessors

The University of Adelaide, School of Computer Science

Performance metrics for caches

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 24: Virtual Memory, Multiprocessors

Lecture 24: Multiprocessors

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

Lecture 17 Multiprocessors and Thread-Level Parallelism

The University of Adelaide, School of Computer Science

CSE 486/586 Distributed Systems Cache Coherence

Performance metrics for caches

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Framework For Exploring Interconnect Level Cache Coherency Parvinder Pal Singh Sr. R & D Engineer, Synopsys, India. email id: ppsingh@synopsys.com © Accellera Systems Initiative

© Accellera Systems Initiative Agenda Introduction to caches Exploration space with Hardware Coherency Current Approaches Problem Explanation Proposed Methodology How SystemC Helps Solving Problem With Our Methodology Case Study Conclusion © Accellera Systems Initiative

Cache Coherency Basics Caches provides way to hide memory latencies Same data view in Multi-CPU system I run Too Fast But Memory Don’t I am Loving it What Can I do Take my copy closer to you Who will maintain the consistency What about me © Accellera Systems Initiative

Independent to software Complex and inefficient Coherency Mechanism Software Based Hardware Based No extra cost Fast Independent to software Extra hardware Complex and inefficient © Accellera Systems Initiative

Exploration space with HW coherency Cache line size Snooping mechanism Speculative fetches Different Buswidth Directory Size Snooping Type Interconnect Rule Talk about What is exploration space…. Parameter sets which you tweak to obtain best results Then talk about each parameter ACE CHI Many More Interface Protocol MSI MOESI MESI Coherency Protocols Utilization Read/Write Latencies Snoop Hit/miss Etc.. System Level Data/Analysis Minimize snooping traffic Sharability © Accellera Systems Initiative

© Accellera Systems Initiative Current Approaches Spreadsheet Accuracy issue Limited view/No system level view Power and Performance issues Hit and Trial Error Prone Wrong design configuration Power and performance issues © Accellera Systems Initiative

Problem Explanation – System COHERENT INTERCONNECT MEM BUS DRAM CPU0 $ L2 CACHE ROM CPU1 $ PERIPH BUS CPU2 TIMER $ CPU3 UART $ © Accellera Systems Initiative

Problem Explanation – Objectives Cache Size and BUS Width is constant Cache Size and BUS Width is constant Cache Size and BUS Width is constant Cache Size and BUS Width is constant Primary Objectives Secondary Objectives Average Read latency for each CPU should be less than 40 cycles Minimize snoop requests Throughput for all the CPUs should be more than 40 MB/s Reduced accesses to DRAM memory © Accellera Systems Initiative

Problem Explanation – Exploration Space Cache Line Directory Size Domain Cache States Snooping Mechanism © Accellera Systems Initiative

Proposed Methodology … 1. Platform Assembly and workload modeling 2. Simulation Sweep … 3. Root-cause Analysis 4. Sensitivity Analysis Introduction to PA tool 6. Hand-off 5. Are we done yet? © Accellera Systems Initiative

Generic Coherent Interconnect Easy to configure for any interface protocol Can connect any number of initiator and target Update/Add/Remove any functionality Multiple configurations System Level View © Accellera Systems Initiative

How SystemC Helps MASTER1 MASTER2 MASTER3 CACHE CACHE CACHE Master1 generates Req Payload MASTER1 Payload MASTER2 MASTER3 CACHE CACHE CACHE INTERCONNECT MEMORY © Accellera Systems Initiative

How SystemC Helps MASTER1 MASTER2 MASTER3 CACHE CACHE CACHE Payload CACHE CACHE INTERCONNECT MEMORY SystemC-TLM can trace whether req. completed by cache or not © Accellera Systems Initiative

© Accellera Systems Initiative How SystemC Helps MASTER1 MASTER2 MASTER3 CACHE CACHE CACHE Payload Payload INTERCONNECT Payload Payload MEMORY Interconnect splits payload into multiple payload for snooping and pre-fetch © Accellera Systems Initiative

How SystemC Helps MASTER1 MASTER2 MASTER3 CACHE CACHE CACHE All because Payload Cache Miss Cache Hit INTERCONNECT Payload waiting I am smart I can track the ID for analysis Combine responses Payload MEMORY Can track avg. time completion Can track throughput etc. Can track memory accesses Can track hit/miss per master © Accellera Systems Initiative

How SystemC Helps MASTER1 MASTER2 MASTER3 CACHE CACHE CACHE And SystemC infra Cache Miss Cache Hit INTERCONNECT Payload waiting TLM2/FT provides extensions Payload MEMORY TLM2/FT provides way to add timings Payload can be traced at any level © Accellera Systems Initiative

Case Study – Configure System Cache line size Buswidth Snooping mechanism Cache States Directory Size Snooping Type Speculative Fetch Cache line size Buswidth Cache size Ways Replacement policy States Delays Configure memory controller © Accellera Systems Initiative

Case Study – System Analysis Average Latencies Cache Performance Hit/Miss Per master snoop request Request Type Snoop Performance Memory transaction Transaction type Transaction count Transaction on different targets Count Average Duration © Accellera Systems Initiative

Broadcast vs Directory (Snoop Requests) 3468 Explain Snoop Request…. Request received from peer master and interconnect © Accellera Systems Initiative

Broadcast vs Directory (Snoop Requests) 3468 Directory 2428 Reduced Traffic (All Hit) Directory based snooping results in reduced snoop requests © Accellera Systems Initiative

Broadcast vs Directory (Latencies) CPU0 60.779ns Directory CPU0 40.6ns © Accellera Systems Initiative

Directory Size (Snoop Requests) Directory Size of 10000 entries 8577 460 2 Invalidation From Interconnect © Accellera Systems Initiative

Directory Size (Traffic) Directory Size of 35000 entries Invalidation From Interconnect 717 3 Significantly reduced Invalidates from Interconnect 4 © Accellera Systems Initiative

Directory Size (Traffic) 35000 NO Invalidation,I WON 50000 HOLD a Sec. © Accellera Systems Initiative

Directory Size (Latencies) 50000 © Accellera Systems Initiative

Directory Size (Latencies) Minimize snoop requests 50000 40.8 54.7 35000 40.8 58.7 No Significant Improvement © Accellera Systems Initiative

Cache Line Size (Mem Traffic) 64 486 2369 © Accellera Systems Initiative

Cache Line Size (Mem Traffic) 64 486 128 2369 491 No Improvement 2370 © Accellera Systems Initiative

Cache Line Size (Mem Traffic) Reduced accesses to DRAM memory 64 486 32 2369 288 Reduced Main Memory Access 934 © Accellera Systems Initiative

Cache Line Size (Latencies) 64 40.8 © Accellera Systems Initiative

Cache Line Size (Latencies) 64 Average Read latency for each CPU should be less than 40 cycles 40.8 32 26.5 Avg. Latencies Reduced Significantly © Accellera Systems Initiative

Throughput(Latencies) Throughput for all the CPUs should be more than 40 MB/s 32 Objective Achieved © Accellera Systems Initiative

© Accellera Systems Initiative Coherency Domain 409 401 Good Candidate For InnerSharable © Accellera Systems Initiative

© Accellera Systems Initiative Final Result Parameter Values Taken Final Value Snooping Mechanism Broadcast/Directory Directory Directory Size 10000, 35000, 50000 35000 Cache Line Size 128, 64, 32 32 Primary Objectives Secondary Objectives Average Read latency for each CPU should be less than 40 cycles Minimize snoop requests Throughput for all the CPUs should be more than 40 MB/s Reduced accesses to DRAM memory © Accellera Systems Initiative

© Accellera Systems Initiative Conclusion Architecture definition of cache coherent interconnect is challenging Many design parameters, performance difficult to predict Use SystemC-TLM2 modeling for early quantitative analysis Leverage generic configurable performance model Systematically explore impact of architecture options In-depth analysis to understand the root-cause of issues © Accellera Systems Initiative

© Accellera Systems Initiative Questions © Accellera Systems Initiative