§ Georgia Institute of Technology, Intel Corporation Cache Coherence Support for Non-Shared Bus Architecture on Heterogeneous MPSoCs Taeweon Suh §, Daehyun.

Slides:

Advertisements

Similar presentations

TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST

Advertisements

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

1 Concurrency: Deadlock and Starvation Chapter 6.

Technische universität dortmund fakultät für informatik informatik 12 Specifications and Modeling Peter Marwedel TU Dortmund, Informatik

Slide 1 Insert your own content. Slide 2 Insert your own content.

Embedded Systems & Parallel Programming P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2007 Universität Dortmund A view on embedded systems.

Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 4 Computing Platforms.

Introduction Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint manual.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.

By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.

Reconsidering Reliable Transport Protocol in Heterogeneous Wireless Networks Wang Yang Tsinghua University 1.

Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

MULTIPLYING MONOMIALS TIMES POLYNOMIALS (DISTRIBUTIVE PROPERTY)

SUBTRACTING INTEGERS 1. CHANGE THE SUBTRACTION SIGN TO ADDITION

MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

1 Interprocess Communication 1. Ways of passing information 2. Guarded critical activities (e.g. updating shared data) 3. Proper sequencing in case of.

Real Time Versions of Linux Operating System Present by Tr n Duy Th nh Quách Phát Tài 1.

SE-292 High Performance Computing

1 Dynamic Interconnection Networks Buses CEG 4131 Computer Architecture III Miodrag Bolic.

L.N. Bhuyan Adapted from Patterson’s slides

Homework Reading Machine Projects Labs

1 Peripheral Component Interconnect (PCI). 2 PCI based System.

1 SoC (DSP+ARM) Platform SungKyunKwan University VADA Lab. ( )

Presenter : Cheng-Ta Wu Kenichiro Anjo, Member, IEEE, Atsushi Okamura, and Masato Motomura IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39,NO. 5, MAY 2004.

Learning Cache Models by Measurements Jan Reineke joint work with Andreas Abel Uppsala University December 20, 2012.

Bypass and Insertion Algorithms for Exclusive Last-level Caches

© S Haughton more than 3?

Copyright  2003 Dan Gajski and Lukai Cai 1 Transaction Level Modeling: An Overview Daniel Gajski Lukai Cai Center for Embedded Computer Systems University.

© 2012 National Heart Foundation of Australia. Slide 2.

Addition 1’s to 20.

25 seconds left…...

Test B, 100 Subtraction Facts

Håkan Sundell, Chalmers University of Technology 1 Evaluating the performance of wait-free snapshots in real-time systems Björn Allvin.

We will resume in: 25 Minutes.

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

1 Unit 1 Kinematics Chapter 1 Day

The University of Adelaide, School of Computer Science

Lecture 7. Multiprocessor and Memory Coherence

Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S.

Reporter :LYWang We propose a multimedia SoC platform with a crossbar on-chip bus which can reduce the bottleneck of on-chip communication.

Reporter:PCLee With a significant increase in the design complexity of cores and associated communication among them, post-silicon validation.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Shared-memory.

1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

NoC for Cache Coherence NoC Seminar Technion Vainbaum Yuri Mentor I.Keidar.

Presented By:- Prerna Puri M.Tech(C.S.E.) Cache Coherence Protocols MSI & MESI.

Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.

Supporting Cache Coherence in Heterogeneous Multiprocessor Systems Taeweon Suh, Douglas M. Blough, and Hsien-Hsin S. Lee Georgia Institute of Technology.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

Multiprocessor SoC integration Method: A Case Study on Nexperia, Li Bin, Mengtian Rong Presented by Pei-Wei Li.

Understanding Parallel Computers Parallel Processing EE 613.

§ Georgia Institute of Technology, † Intel Corporation Initial Observations of Hardware/Software Co-simulation using FPGA in Architecture Research Taeweon.

Taeweon Suh §, Daehyun Kim †, and Hsien-Hsin S. Lee § June 15, 2005

12.4 Memory Organization in Multiprocessor Systems

Taeweon Suh § Hsien-Hsin S. Lee § Shih-Lien Lu † John Shen †

Cache Coherence (controllers snoop on bus transactions)

Taeweon Suh §, Hsien-Hsin S. Lee §, Sally A. Mckee †,

Lecture 24: Virtual Memory, Multiprocessors

Supporting Cache Coherence in Heterogeneous Multiprocessor Systems

Presentation transcript:

§ Georgia Institute of Technology, Intel Corporation Cache Coherence Support for Non-Shared Bus Architecture on Heterogeneous MPSoCs Taeweon Suh §, Daehyun Kim, and Hsien-Hsin S. Lee § June 15, 2005

2 MPSoCs IP ADC Memory Controller uP Time-to-Market Time-to-Market Flexibility Flexibility Low cost Low cost –Share memory interface to reduce pin count –However, shared bus arch. hinders the versatility provided by each processor –Non-Shared bus arch. Real-time property Real-time property –communication between processors Wireless IP Memory SDRAM uP DSP

3 Introduction Cache Coherence Cache Coherence –Well known technique for data consistency for multiprocessor systems Protocol States Modified Exclusive Owned Shared Invalid P0 D$ (MOESI) Memory P1 D$ (MOESI) 1234 Example operation sequence E 1234S 1234 shared M abcd invalidate I 1234 cache-to-cache O abcd S abcd P0: read P1: read P1: write (abcd) P0: read I -----

4 Memory Controller Wrapper 0 Proc 0 (MSI) Bus Wrapper 1 Proc 1 (MESI) Shared-signal assertion Previous Work Integration techniques for shared-bus based platform [1][2][3] Integration techniques for shared-bus based platform [1][2][3] [1] Taeweon Suh, Douglas M. Blough, and Hsien-Hsin S. Lee, Supporting cache coherence in heterogeneous multiprocessor systems, In DATE04, Feb [2] Taeweon Suh, Hsien-Hsin S. Lee, and Douglas M. Blough, Integrating cache coherence protocols for heterogeneous multiprocessor systems, Part 1, In IEEE Micro, July/August 2004 [3] Taeweon Suh, Hsien-Hsin S. Lee, and Douglas M. Blough, Integrating cache coherence protocols for heterogeneous multiprocessor systems, Part 2, In IEEE Micro, September/October 2004 Memory Controller Wrapper 0 Proc 0 (MEI) Bus Wrapper 1 Proc 1 (MESI) Read-to-write conversion Read Shared Read/Write Write Memory Controller Wrapper 0 Proc 0 (MEI) Bus Snoop-hit Buffer (single cache line) Wrapper 1 Proc 1 (MESI) Snoop-hit buffer Write-back To memory Read

5 Proposal Cache Coherence-enforced Memory Controller (ccMC) for Non-Shared bus based MPSoCs Cache Coherence-enforced Memory Controller (ccMC) for Non-Shared bus based MPSoCs –Bypass approach –Bookkeeping approach Integration of invalidation-based protocols such as MEI, MSI, MESI, and MOESI Integration of invalidation-based protocols such as MEI, MSI, MESI, and MOESI ccMC Bus 0 Proc 1 (MEI) Bus 1 Proc 0 (MESI) Memory MPSoC

6 Bypass Approach Blindly pass bus transactions if in shared range Blindly pass bus transactions if in shared range Very inexpensive in terms of silicon area Very inexpensive in terms of silicon area ccMC Bus 0 Proc 1 (MEI) Bus 1 Proc 0 (MESI) Memory MPSoC ccMC Bus 0 Bus 1 Start_addr_reg Range_reg Snoop-hit buffer mux comparator Bus request 0 1 addr.

7 Bookkeeping Approach Selectively pass bus transactions if in shared range Selectively pass bus transactions if in shared range Expensive compared to bypass approach Expensive compared to bypass approach ccMC Bus 0 Proc 1 (MEI) Bus 1 Proc 0 (MESI) Memory MPSoC ccMC Bus 0 Bus 1 Snoop-hit buffer Bus request if M Start_addr_reg Range_reg addr. I I S I S S M I I I I I States P0 P1 if inside shared range

8 MPSoC ccMC Bus 0 Proc 1 (MESI) Bus 1 Proc 0 (MSI) Memory I I I I P0 P1 Example Bookkeeping approach Bookkeeping approach P1: read P1: write (abcd) P0: read Example operation sequence S S M abcd S S shared invalidate M Breq abcd 1234 S S

9 Integration with no-coherence support processor No-coherence support processors work like having MEI w/o snooping: MEI-like integrated protocol No-coherence support processors work like having MEI w/o snooping: MEI-like integrated protocol Interrupt is used to inform possible snoop-hits Interrupt is used to inform possible snoop-hits ccMC Bus 0 Proc 1 (no hardware support) Bus 1 Proc 0 (MESI) Memory MPSoC IRQ

10 Simulation Model Atalanta [4] RTOS Atalanta [4] RTOS –Home-grown RTOS in Georgia Tech –Designed for heterogeneous multiprocessor SoCs Atalanta kernel simulation Atalanta kernel simulation –Task insertion/deletion –Tasks are managed in TCB (Task Control Block) –TCBs are connected through doubly-linked list –Each others TCB is accessible by other processor –Update the highest priority TCB, waiting for system objects such as semaphore, when a system object is ready [4] Di-Shi Sun, Douglas M. Blough, and Vincent J. Mooney, A New Multiprocessor RTOS Kernel for System-on-a-Chip Applications. Technical Report GIT-CC-02-09, CERCS

11 Simulation Environment Processors Processors –Platform1: PPC755 (MEI) + ARM9 with MESI –Platform2: ARM9 with MSI + ARM9 with MESI Simulators: Seamless CVE + ModelSim Simulators: Seamless CVE + ModelSim ccMC Bus 0 Proc 1 Bus 1 Proc 0 Memory DMA0 DMA1 100Mbps Ethernet 320X240 LCD controller

12 Simulation Results Bypass Approach: 2 tasks on each processor Bypass Approach: 2 tasks on each processor

13 Simulation Results Bypass Approach: 32 tasks on each processor Bypass Approach: 32 tasks on each processor

14 Simulation Results Bookkeeping Approach Bookkeeping Approach –Platform 2, Miss penalty 14 cycles –Microbench simulation

15 Conclusions Proposed integration techniques for cache coherence on Non-shared bus based-MPSoCs Proposed integration techniques for cache coherence on Non-shared bus based-MPSoCs – Bypass approach, Bookkeeping approach Bypass approach Bypass approach –Blindly pass shared memory operations –Very cheap in terms of silicon area Bookkeeping approach Bookkeeping approach –Selectively pass shared memory operations –Expensive compared to bypass approach Effective solutions for communication as more and more heterogeneous processors are integrated in a single chip Effective solutions for communication as more and more heterogeneous processors are integrated in a single chip

16 Questions, Comments? Thanks for your attention!

17 Backup Slides

18 Motivation Embedded systems more and more require heterogeneous processors on a chip according to applications needs Embedded systems more and more require heterogeneous processors on a chip according to applications needs Efficient communication is imperative to meet real- time property of embedded applications Efficient communication is imperative to meet real- time property of embedded applications Shared-bus architecture using AMBA, CoreConnect compromises the versatility provided by each processor Shared-bus architecture using AMBA, CoreConnect compromises the versatility provided by each processor Pin count restricts to use dedicated memory interface for each processor on SoCs Pin count restricts to use dedicated memory interface for each processor on SoCs –Commercial MP SoCs such as TI OMAP and Philips Nexperia employ Non-shared bus architecture sharing memory interface (check Nexperia)

19 MPSoC ccMC Bus 0 Proc 1 (MESI) Bus 1 Proc 0 (MSI) Memory I I I I P0 P1 Bookkeeping Approach (contd) Problem with E-state Problem with E-state P1: read P1: write P0: read Example operation sequence E E M abcd E E 1234

20 MPSoC ccMC Bus 0 Proc 1 (MESI) Bus 1 Proc 0 (MSI) Memory I I I I P0 P1 Bookkeeping Approach (contd) Solution: Prohibit E-state (shared signal assertion) Solution: Prohibit E-state (shared signal assertion) P1: read P1: write P0: read Example operation sequence S S M abcd S S shared invalidate M Breq abcd 1234 S S

21 Previous Work (contd) Snoop-hit Buffer [2][3] Snoop-hit Buffer [2][3] Region-Based Cache Coherence (RBCC) [2][3] Region-Based Cache Coherence (RBCC) [2][3] [2] Taeweon Suh, Hsien-Hsin S. Lee, and Douglas M. Blough, Integrating cache coherence protocols for heterogeneous multiprocessor systems, Part 1, In IEEE Micro, July/August 2004 [3] Taeweon Suh, Hsien-Hsin S. Lee, and Douglas M. Blough, Integrating cache coherence protocols for heterogeneous multiprocessor systems, Part 2, In IEEE Micro, September/October 2004 Memory Controller Wrapper 0 Proc 0 (MEI) Bus Snoop-hit Buffer (single cache line) Wrapper 1 Proc 1 (MESI) Snoop-hit buffer Write-back To memory Read Memory Controller Wrapper 2 Proc 0 (MEI) Bus Wrapper 1 Proc 1 (MESI) RBCC Wrapper 0 Proc 0 (MESI) MESI MEI