NoC for Cache Coherence NoC Seminar Technion Vainbaum Yuri Mentor I.Keidar.

Slides:



Advertisements
Similar presentations
Best of Both Worlds: A Bus-Enhanced Network on-Chip (BENoC) Ran Manevich, Isask har (Zigi) Walter, Israel Cidon, and Avinoam Kolodny Technion – Israel.
Advertisements

IP Router Architectures. Outline Basic IP Router Functionalities IP Router Architectures.
Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.
Quality-of-Service Routing in IP Networks Donna Ghosh, Venkatesh Sarangan, and Raj Acharya IEEE TRANSACTIONS ON MULTIMEDIA JUNE 2001.
The University of Adelaide, School of Computer Science
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
4/16/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
Multiple Processor Systems
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
DDM – A Cache Only Memory Architecture Hagersten, Landin, and Haridi (1991) Presented by Patrick Eibl.
L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.
Network based System on Chip Final Presentation Part B Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.
1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
1 Lecture 21: Coherence and Interconnection Networks Papers: Flexible Snooping: Adaptive Filtering and Forwarding in Embedded Ring Multiprocessors, UIUC,
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
Itrat Rasool Quadri ST ID COE-543 Wireless and Mobile Networks
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC Architecture Using TLM 林孟諭 Dept. of Electrical Engineering National Cheng Kung.
Cache Coherence Protocols A. Jantsch / Z. Lu / I. Sander.
RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 7, 2005 Session 23.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
The University of Adelaide, School of Computer Science
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University
Cache Coherence: Directory Protocol
Architecture and Design of AlphaServer GS320
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
12.4 Memory Organization in Multiprocessor Systems
Multiprocessor Cache Coherency
CMSC 611: Advanced Computer Architecture
The University of Adelaide, School of Computer Science
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
Lecture 25: Multiprocessors
High Performance Computing
Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini
CS 6290 Many-core & Interconnect
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Cache coherence CEG 4131 Computer Architecture III
Lecture 24: Virtual Memory, Multiprocessors
Lecture 24: Multiprocessors
Lecture: Coherence Topics: wrap-up of snooping-based coherence,
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Multiprocessors and Multi-computers
Presentation transcript:

NoC for Cache Coherence NoC Seminar Technion Vainbaum Yuri Mentor I.Keidar

Cache coherence problem in NUCA The cache coherency problem appears when tasks running on different processors in the SoC share data stored in the system memory. When a task T1 running on a processor P1 modifies a data shared with task T2, which runs on the processor P2, that data’s copy on P2 processor’s cache must be either updated or invalidated, before a new access to it. P1 P2 L2$ A T1 P3 P4 A T2 L2$ B Update other L2$ B

MESI - maintain the coherence in cached systems Invalid: It is a non-valid state. The data you are looking for are not in the cache, or the local copy of these data is not correct because another processor has updated the corresponding memory position. Shared: Shared without having been modified. Another processor can have the data into the cache memory and both copies are in their current version. Exclusive: Exclusive without having been modified. That is, this cache is the only one that has the correct value of the block. Data blocks are according to the existing ones in the main memory. Modified: Actually, it is an exclusive- modified state. It means that the cache has the only copy that is correct in the whole system. The data which are in the main memory are wrong.

In-Network Cache Coherence Propose : Implementation of the coherence protocol and directories within the network at each router node. This opens up the possibility of optimizing a protocol with in-transit actions In-Network Cache Coherence, Noel Eisley, The 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06 )

In-Network Read optimization HAB Read request To sheerer data Directory based MSI Three end-to-end messages In-Network Cache Coherence, Noel Eisley, The 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06 )

In-Network Read optimization HAB Node B “bumps” into node A While message in-transit to the home node H obtain the data directly from A Read request data In-Network MSI In-Network Cache Coherence, Noel Eisley, The 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06 )

In-Network Write optimization HAB write request Inv Directory based MSI Ack data C Ack In-Network Cache Coherence, Noel Eisley, The 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06 )

In-Network Write optimization HAB write request Inv+Ack In-network MSI data C Ack +inv This in-transit optimization can reduce write communication from two round-trips to a single round-trip from C to H and back In-Network Cache Coherence, Noel Eisley, The 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06 )

In-Network cache coherence protocol H Idea: move coherence directories from the nodes into the network fabric Virtual trees, one for each cache line, are maintained within the network in place of coherence directories to keep track of sharers The virtual tree consists of one root node R which is the node that first loads a cache line from off-chip memory, all nodes that sharing this line and intermediate nodes between root and sharers Nodes of the tree are connected by virtual links Virtual trees are stored in virtual tree caches at each router within the network Reads and writes are routed towards the home node, if they encounter a virtual tree in-transit, the virtual tree takes over as the routing function and steers read requests and write invalidates appropriately towards the sharers instead. R In-Network Cache Coherence, Noel Eisley, The 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06 )

In-Network-Read access example H R1 New read request –Read1 1.Towards home node 2. Load line from off-chip 3. Constructs virtual tree H R1 Second read request to the same line –Read2 5. Steered to nearest copy 6. Returns data and constructs new virtual tree links 4. Hits virtual tree on the way to home node read1 R2 read2

In-Network Router micro-architecture Virtual tree cache serves to steer head flits towards the appropriate output ports. Virtual tree cache points them towards caches housing the most up-to-date data requested Memory address contained in each packet’s header is first parsed into if the tag matches, there is a hit in the tree cache, and its prescribed direction is used as the desired output port Flit

In-Network Results & Summary Proposed an approach of cache coherence for chip multiprocessors where the coherence protocol and directories are all embedded within network routers. This approach has a low hardware overhead which quickly leads to hardware savings, compared to the standard directory protocol, as the number of cores per chip increases In-Network Cache Coherence, Noel Eisley, The 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06 )

DCOS-Directory Cache On a Switch To reduce cache-to-cache data transfer time proposed architecture implemented inside each switch 4x2 2D mesh topology MIPS R10000 core model,Directory based cache coherence MSI protocol DCOS: Cache Embedded Switch Architecture for Distributed Shared Memory Multiprocessor SoCs, Daewook Kim 2006 IEEE

DCOS-Directory Cache On a Switch State entry assigned to a memory block holds current state of block : empty, shared, modified /invalid No data items are copied to caches or memories :Marked as “E” DCOS: Cache Embedded Switch Architecture for Distributed Shared Memory Multiprocessor SoCs, Daewook Kim 2006 IEEE

DCOS-Directory Cache On a Switch Data is shared with other caches and memories DCOS: Cache Embedded Switch Architecture for Distributed Shared Memory Multiprocessor SoCs, Daewook Kim 2006 IEEE

DCOS-Switch architecture All directory caches are embedded within crossbar switch

DCOS-Results DCOS: Cache Embedded Switch Architecture for Distributed Shared Memory Multiprocessor SoCs, Daewook Kim 2006 IEEE

Cache Coherency Communication Cost How costly is cache coherency in interconnection terms? This paper focuses on bringing light onto this question. Directory based mechanism to maintain coherence among all caches in the system Cache Coherency Communication Cost in a NoC-based MPSoC Platform, Gustavo Girão, SBCCI’07, September 3–6, 2007, Rio de Janeiro, Brazil.

Cache Coherency Communication Cost The amount of data on the NoC for regular operations is much larger than the amount of data for cache coherence maintenance for almost all the cache sizes The increase in cache size decreases the amount of data for regular operations, and so the amount of data for cache coherence becomes more significant Cache Coherency Communication Cost in a NoC-based MPSoC Platform, Gustavo Girão, SBCCI’07, September 3–6, 2007, Rio de Janeiro, Brazil.

Cache Coherency Communication Cost Graph shows that the amount of page replacement requests is the most responsible for the cache coherence injected load for small cache sizes. This happens because the amount of replacements increases as cache size decreases Cache Coherency Communication Cost in a NoC-based MPSoC Platform, Gustavo Girão, SBCCI’07, September 3–6, 2007, Rio de Janeiro, Brazil. 8 CPUs, 1 directory

BeNOC –Bus enhanced Network on Chip Low latency, low bandwidth specialized bus, optimized for system- wide distribution of control signals (ack,invl) High performance distributed network that handles high throughput data communication between pairs of modules BENoC: A Bus-Enhanced Network on-Chip for a Power Efficient CMP Isask'har Walter, Israel Cidon, and Avinoam Kolodny

BeNOC –Bus enhanced Network on Chip BENoC: A Bus-Enhanced Network on-Chip for a Power Efficient CMP Isask'har Walter, Israel Cidon, and Avinoam Kolodny β -reflects the network-to-bus broadcast latency ratio n- The number of modules in the system When broadcast operations are compared,the bus is considerably more energy efficient than the network

Network topology awareness P1 L2$ P2 L2$ P3 L2$ Invl. Wait for furthest invalidation acknowledgment therefore send Invl to P3 first Cache coherence protocol should be aware of the network topology Send invalidation messages according to distances from the directory

Network topology awareness Calculate at each transaction furthest sharing node and send invalidation The total delay will be roundtrip time of invalidation/acknowledge to furthest node Send long delay roundtrip messages first to mask short delay messages.

References In-Network Cache Coherence, Noel Eisley, The 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06 ) DCOS: Cache Embedded Switch Architecture for Distributed Shared Memory Multiprocessor SoCs, Daewook Kim 2006 IEEE BENoC: A Bus-Enhanced Network on-Chip for a Power Efficient CMP Isask'har Walter, Israel Cidon, and Avinoam Kolodny Cache Coherency Communication Cost in a NoC-based MPSoC Platform, Gustavo Girão, SBCCI’07, September 3–6, 2007, Rio de Janeiro, Brazil. TEACHING THE CACHE MEMORY COHERENCE WITH THE MESI PROTOCOL SIMULATOR F. J. JIMÉNEZ1, J. GÓMEZ1, A. MESONES1, E. HERRUZO1, J. I. BENAVIDES1 Y F. J. SÁNCHEZ2 1Dpto. Electrotecnia y Electrónica. Escuela Politécnica Superior. Universidad de Córdoba. Av. Menéndez Pidal s/n Córdoba. Spain. On cache coherency and memory consistency issues in NoC based shared memory multiprocessor SoC architectures Proceedings of the 9th EUROMICRO Conference on Digital System Design (DSD'06) Exploration of distributed shared memory architectures for NoC-based multiprocessors Matteo Monchiero, Gianluca Palermo, Cristina Silvano *, Oreste Villa