Supporting Cache Coherence in Heterogeneous Multiprocessor Systems Taeweon Suh, Douglas M. Blough, and Hsien-Hsin S. Lee Georgia Institute of Technology.

Slides:



Advertisements
Similar presentations
Implementation and Verification of a Cache Coherence protocol using Spin Steven Farago.
Advertisements

§ Georgia Institute of Technology, Intel Corporation Cache Coherence Support for Non-Shared Bus Architecture on Heterogeneous MPSoCs Taeweon Suh §, Daehyun.
Extra Cache Coherence Examples In the following examples there are a couple questions. You can answer these for practice by ing Colin at
Lecture 7. Multiprocessor and Memory Coherence
EZ-COURSEWARE State-of-the-Art Teaching Tools From AMS Teaching Tomorrow’s Technology Today.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
Manager-Client Pairing: A Framework for Implementing Coherence Hierarchies Jesse G. Beu Michael C. Rosier Thomas M. Conte Tinker Research Georgia Institute.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Snoopy Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Shared-memory.
The University of Adelaide, School of Computer Science
Transactional Memory Yujia Jin. Lock and Problems Lock is commonly used with shared data Priority Inversion –Lower priority process hold a lock needed.
The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.
CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.
Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.
1 Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Nov 14, 2005 Topic: Cache Coherence.
April 13, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 18: Snoopy Caches Krste Asanovic Electrical Engineering and Computer.
(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Logical Protocol to Physical Design
CPU Chips The logical pinout of a generic CPU. The arrows indicate input signals and output signals. The short diagonal lines indicate that multiple pins.
Cache Organization of Pentium
Lect 13-1 Lect 13: and Pentium. Lect Microprocessor Family  Microprocessor  Introduced in 1989  High Integration  On-chip 8K.
INTRODUCTION TO MICROPROCESSORS
Hybrid System Emulation Taeweon Suh Computer Science Education Korea University January 2010.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Presenter: Hong-Wei Zhuang On-Chip SOC Test Platform Design Based on IEEE 1500 Standard Very Large Scale Integration (VLSI) Systems, IEEE Transactions.
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
CS492B Analysis of Concurrent Programs Coherence Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Presented By:- Prerna Puri M.Tech(C.S.E.) Cache Coherence Protocols MSI & MESI.
Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.
Spring EE 437 Lillevik 437s06-l21 University of Portland School of Engineering Advanced Computer Architecture Lecture 21 MSP shared cached MSI protocol.
F. Gharsalli, S. Meftali, F. Rousseau, A.A. Jerraya TIMA laboratory 46 avenue Felix Viallet Grenoble Cedex - France Embedded Memory Wrapper Generation.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Evaluating the Performance of Four Snooping Cache Coherency Protocols Susan J. Eggers, Randy H. Katz.
DISTRIBUTED COMPUTING
CE 478: Microcontroller Systems University of Wisconsin-Eau Claire Dan Ernst The Pentium Pro® (P6) Bus Reference: “Penium Pro and Pentium II System Architecture”
1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.
Understanding Parallel Computers Parallel Processing EE 613.
ECE/CS 552: Shared Memory © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith.
An FPGA Approach to Quantifying Coherence Traffic Efficiency on Multiprocessor Systems Taeweon Suh ┼, Shih-Lien L. Lu ¥, and Hsien-Hsin S. Lee § Platform.
Corse Overview Miodrag Bolic ELG7187 Topics in Computers: Multiprocessor Systems on Chip.
The Pentium Series CS 585: Computer Architecture Summer 2002 Tim Barto.
§ Georgia Institute of Technology, † Intel Corporation Initial Observations of Hardware/Software Co-simulation using FPGA in Architecture Research Taeweon.
The University of Adelaide, School of Computer Science
Elec/Comp 526 Spring 2015 High Performance Computer Architecture Instructor Peter Varman DH 2022 (Duncan Hall) rice.edux3990 Office Hours Tue/Thu.
Software Coherence Management on Non-Coherent-Cache Multicores
תרגול מס' 5: MESI Protocol
Lecture 21 Synchronization
Cache Coherence in Shared Memory Multiprocessors
Taeweon Suh §, Daehyun Kim †, and Hsien-Hsin S. Lee § June 15, 2005
12.4 Memory Organization in Multiprocessor Systems
Taeweon Suh § Hsien-Hsin S. Lee § Shih-Lien Lu † John Shen †
Cache Coherence (controllers snoop on bus transactions)
Lecture 2: Snooping-Based Coherence
Comparison of Two Processors
Comparison of AMD64, IA-32e extensions and the Itanium architecture
Lecture 25: Multiprocessors
High Performance Computing
Lecture 25: Multiprocessors
Lecture 24: Multiprocessors
A. T. Clements, M. F. Kaashoek, N. Zeldovich, R. T. Morris, and E
Supporting Cache Coherence in Heterogeneous Multiprocessor Systems
Presentation transcript:

Supporting Cache Coherence in Heterogeneous Multiprocessor Systems Taeweon Suh, Douglas M. Blough, and Hsien-Hsin S. Lee Georgia Institute of Technology

2 Introduction Cache Coherence Well-known technique for data consistency among multiprocessor Shared memory MEI, MSI, MESI and MOESI protocols PowerPC755 : MEI protocol Pentium class: MESI protocol UltraSPARC: MOESI protocol AMD64 class: MOESI protocol Distributed shared memory Directory-based coherence

3 Motivation SoC capacity increases as lithography technology advances Applications demand heterogeneous multiprocessor and/or IPs on a chip DiMeNsion 8650 (LSI logic) AD6525 (Analog Device) Nexperia pnx8500 (Philips) Snoop-based protocols fail to address coherence among heterogeneous processors

4 Contributions Systematic integration methods of distinct coherence protocols in heterogeneous multiprocessor SoC designs Performance improvements Possible power savings

5 Integration Methods Techniques to integrate coherence protocols Read-to-Write conversion S (Shared) state removal Shared signal assertion / de-assertion E (Exclusive) / S (Shared) state removal Integrated coherence protocol Common states from distinct protocols ex) MEI, MESI integration: MEI protocol Snoop-hit Buffer Performance booster Power saving

6 Read-to-Write Conversion S (Shared) state removal MEI – MESI integration example Proc 1 (MEI) Wrapper 1 Proc 2 (MESI) Memory Controller Read/Write Write Wrapper 2 Bus (1) P2 read (2) P1 read (3) P1 write (4) P2 read Without our technique With our technique I E I Proc1 (MEI) Proc2 (MESI) (1) P2 read (2) P1 read (3) P1 write (4) P2 read E SI E S (Stale)E M S (Stale)M I E I E II E E MI M I I E Operations on cache line X With our technique (1) P2 read (2) P1 read (3) P1 write (4) P2 read Without our technique I E I E SI E S (Stale)E M S (Stale)M (1) P2 read (2) P1 read (3) P1 write (4) P2 read (1) P2 read (2) P1 read (4) P2 read (1) P2 read (2) P1 read (3) P1 write (4) P2 read

7 Shared Signal Assertion E (Exclusive) state removal MSI - MESI integration example Proc 1 (MSI) Wrapper 1 Proc 2 (MESI) Memory Controller Shared Wrapper 2 Bus (1) P1 read (2) P2 read (3) P2 write (4) P1 read Without our technique With Our technique I SI Proc1 (MSI) Proc2 (MESI) Operations on cache line X (1) P1 read (2) P2 read (3) P2 write (4) P1 read S(Stale)M I E S S(Stale)E M I SI SS M I I SM S Read With Our technique (1) P1 read (2) P2 read (3) P2 write (4) P1 read (1) P1 read (2) P2 read (3) P2 write (4) P1 read Without our technique I SI S(Stale)M I E S S(Stale)E M (1) P1 read (2) P2 read (3) P2 write (4) P1 read

8 Snoop-hit Buffer Snoop-hit on M-line requires 2 transactions intended for the same address Performance enhancement and power saving Proc 1 (MEI) Wrapper 1 Proc 2 (MESI) Memory Controller Wrapper 2 Bus Snoop-hit Buffer (single cache line) Read Write-back To memory Read

9 Simulation Environment 3 PowerPC755 (MEI) + 1 ARM920T (no coherence) Verilog-HDL implementation Simulators: Seamless CVE + VCS Baseline: Software solution nFIQ Arbiter ASB ARM920T (None) PowerPC755 (MEI) Wrapper ARTRY Snoop logic

10 Performance Evaluation (1/3) Worst-case simulation Each task accesses the same critical sections 0.97 % 57 %

11 Performance Evaluation (2/3) Best-case simulation Each task accesses different critical sections 426% 51%

12 Performance Evaluation (3/3) Typical-case simulation Each task randomly selects critical sections 68% 22%

13 Performance Evaluation (3/3) 226% 22% 68% 26% Typical-case simulation Each task randomly selects critical sections

14 Conclusions Propose an integration method of cache coherence protocols for heterogeneous processors Retain common states from distinct coherence protocols Performance improved by Up to 5.26X with 96-cycle miss penalty at the expense of simple hardware Possible power savings from snoop-hit buffer Useful and effective methods for heterogeneous multiprocessor SoC designs

15 Questions ? Thanks for your attention!

16 Backup Slides

17 Performance Evaluation (2/5) Seamless CVE (Mentor Graphics) VCS (Synopsys) Simulators Simulation environments (cont.) Baseline: software solution Lock mechanism: SoCLC [Bilge’02] Operating Frequencies PowerPC755: 100MHz ARM920T: 50MHz ASB: 50MHz I$ / D$Enabled Memory Access Time 6 cycles for 1 st word 1 cycles for each subsequent word

18 Introduction (2/2) PowerPC755 #1 D$ Memory PowerPC755 #2 D$ PowerPC755 #3 D$ PowerPC755 #4 D$ 32 GBL ARTRY TT ADDR Cache Coherence Example PowerPC755: MEI protocol

19 Implementation Examples (1/2) Intel486: Modified MESI protocol PowerPC755: MEI protocol Intel486 (MESI) Wrapper PowerPC755 (MEI) Arbiter Wrapper Bus INVARTRY HLDA BOFF BREQ BG_BAR BR_BAR HOLD HITM

20 Implementation Examples (2/2) PowerPC755: MEI protocol ARM920T: No cache coherence support Arbiter ASB ARM920T (None) PowerPC755 (MEI) Wrapper ARTRY BG_BAR BR_BAR Snoop logic BGNT BREQ nFIQ Problem: Hardware deadlock due to interrupt response time