1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

Slides:



Advertisements
Similar presentations
Best of Both Worlds: A Bus-Enhanced Network on-Chip (BENoC) Ran Manevich, Isask har (Zigi) Walter, Israel Cidon, and Avinoam Kolodny Technion – Israel.
Advertisements

Directory-Based Cache Coherence Marc De Melo. Outline Non-Uniform Cache Architecture (NUCA) Cache Coherence Implementation of directories in multicore.
Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
Handling Global Traffic in Future CMP NoCs Ran Manevich, Israel Cidon, and Avinoam Kolodny. Group Research QNoC Electrical Engineering Department Technion.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
1 Lecture 16: Large Cache Design Papers: An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity.
1 Asynchronous Bit-stream Compression (ABC) IEEE 2006 ABC Asynchronous Bit-stream Compression Arkadiy Morgenshtein, Avinoam Kolodny, Ran Ginosar Technion.
Technion – Israel Institute of Technology Qualcomm Corp. Research and Development, San Diego, California Leveraging Application-Level Requirements in the.
Module R R RRR R RRRRR RR R R R R Efficient Link Capacity and QoS Design for Wormhole Network-on-Chip Zvika Guz, Isask ’ har Walter, Evgeny Bolotin, Israel.
Network based System on Chip Final Presentation Part B Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.
1 Evgeny Bolotin – Efficient Routing, DATE 2007 Routing Table Minimization for Irregular Mesh NoCs Evgeny Bolotin, Israel Cidon, Ran Ginosar, Avinoam Kolodny.
NoC for Cache Coherence NoC Seminar Technion Vainbaum Yuri Mentor I.Keidar.
MICRO-MODEM RELIABILITY SOLUTION FOR NOC COMMUNICATIONS Arkadiy Morgenshtein, Evgeny Bolotin, Israel Cidon, Avinoam Kolodny, Ran Ginosar Technion – Israel.
1 Lecture 16: Large Cache Innovations Today: Large cache design and other cache innovations Midterm scores  91-80: 17 students  79-75: 14 students 
LOW-LEAKAGE REPEATERS FOR NETWORK-ON-CHIP INTERCONNECTS Arkadiy Morgenshtein, Israel Cidon, Avinoam Kolodny, Ran Ginosar Technion – Israel Institute of.
Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel.
Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim
(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.
1 Link Division Multiplexing (LDM) for NoC Links IEEE 2006 LDM Link Division Multiplexing Arkadiy Morgenshtein, Avinoam Kolodny, Ran Ginosar Technion –
1 Evgeny Bolotin – ClubNet Nov 2003 Network on Chip (NoC) Evgeny Bolotin Supervisors: Israel Cidon, Ran Ginosar and Avinoam Kolodny ClubNet - November.
1 Evgeny Bolotin – ICECS 2004 Automatic Hardware-Efficient SoC Integration by QoS Network on Chip Electrical Engineering Department, Technion, Haifa, Israel.
Architecture and Routing for NoC-based FPGA Israel Cidon* *joint work with Roman Gindin and Idit Keidar.
1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,
1 Lecture 15: Large Cache Design Topics: innovations for multi-mega-byte cache hierarchies Reminders:  Assignment 5 posted.
CS 7810 Lecture 17 Managing Wire Delay in Large CMP Caches B. Beckmann and D. Wood Proceedings of MICRO-37 December 2004.
University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.
Dragonfly Topology and Routing
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.
Non-Uniform Cache Architectures for Wire Delay Dominated Caches Abhishek Desai Bhavesh Mehta Devang Sachdev Gilles Muller.
Javier Lira (Intel-UPC, Spain)Timothy M. Jones (U. of Cambridge, UK) Carlos Molina (URV, Spain)Antonio.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.
Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
Analytic Evaluation of Shared-Memory Systems with ILP Processors Daniel J. Sorin, Vijay S. Pai, Sarita V. Adve, Mary K. Vernon, and David A. Wood Presented.
CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.
Dynamic Traffic Distribution among Hierarchy Levels in Hierarchical Networks-on-Chip Ran Manevich, Israel Cidon, and Avinoam Kolodny Group Research QNoC.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
Module R R RRR R RRRRR RR R R R R Access Regulation to Hot-Modules in Wormhole NoCs Isask’har (Zigi) Walter Supervised by: Israel Cidon, Ran Ginosar and.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.
Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.
Timestamp snooping: an approach for extending SMPs Milo M. K. Martin et al. Summary by Yitao Duan 3/22/2002.
Corse Overview Miodrag Bolic ELG7187 Topics in Computers: Multiprocessor Systems on Chip.
Network On Chip Cache Coherency Midterm presentation Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter Isaschar.
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
SpaceWire Hot Modules International SpaceWire Conference 2007
Israel Cidon, Ran Ginosar and Avinoam Kolodny
Parallel and Multiprocessor Architectures – Shared Memory
Lecture: Cache Innovations, Virtual Memory
Impact of Interconnection Network resources on CMP performance
Improving Multiple-CMP Systems with Token Coherence
/ Computer Architecture and Design
Design and Management of 3D CMP’s using Network-in-Memory
CS 6290 Many-core & Interconnect
Lecture: Cache Hierarchies
The University of Adelaide, School of Computer Science
Lecture 24: Multiprocessors
A Case for Interconnect-Aware Architectures
The University of Adelaide, School of Computer Science
Presentation transcript:

1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon, Ran Ginosar, Avinoam Kolodny QNoC Research Group Technion EE Department Technion, Haifa, Israel

2 E. Bolotin – The Power of Priority, NoCs 2007 Chip Multi-Processor (CMP) Dual-Core Monolithic shared cache Multi-Core Large cache Shared cache Distributed cache NoC-based: How?

3 E. Bolotin – The Power of Priority, NoCs 2007 Global wires delay Global wire delay Gate delay Source: ITRS 2003 Global Wires Delay Future Cache - Physics Perspective Large cache  Large access time Fraction of chip reachable in 1 clock cycle Source: Keckler et al. ISSCC 2003 Distance reached in single cycle  Today: ~25% of chip  In 10 years: ~1% of chip Large monolithic cache is not scalable

4 E. Bolotin – The Power of Priority, NoCs 2007 NUCA - Non Uniform Cache Architecture NUCA= Non uniform access times Banked cache over NoC  Smaller bank  Smaller Access Time  Multiple banks  Multiple Ports  Closer bank  Smaller Access Time Cache-line placement policy Static NUCA (SNUCA) Dynamic NUCA (DNUCA) Sources: Kim et al. ASPLOS 2002 Beckmann et al. MICRO 2004

5 E. Bolotin – The Power of Priority, NoCs 2007 Issues in NUCA-based CMP NoC performance  CMP performance Cache coherency and transaction order (correctness) Search (in DNUCA) Different traffic types (e.g. fetch vs. prefetch) Synchronization (locks) NoC Services for CMP?

6 E. Bolotin – The Power of Priority, NoCs 2007 Cache Coherency over NoC How do we maintain coherency over NoC? Snooping Central directory Cache bank with distributed directory Distributed directory

7 E. Bolotin – The Power of Priority, NoCs 2007 Distributed Cache Coherency Example: Simple read transaction Cache access  Multiple NoC transactions Ctrl. packet Data packet

8 E. Bolotin – The Power of Priority, NoCs 2007 Read Transaction of Modified Block Ctrl. packet Data packet

9 E. Bolotin – The Power of Priority, NoCs 2007 Read Exclusive of Shared Block Ctrl. packet Data packet

10 E. Bolotin – The Power of Priority, NoCs 2007 Smart interfaces Basic NoC to Support CMP Can We Do Better? Off-the-shelf (Vanilla) NoC: Grid of wormhole routers Unicast only Ordering in network  Static routing  No virtual channels Vanilla NoC

11 E. Bolotin – The Power of Priority, NoCs 2007 Observations: L2 Access A) Delay = Queueing + NoC transactions B) All NoC transactions are equally important C) NoC transactions consist of: Short ctrl. packets Long data packets Idea: Differentiate between Ctrl. and Data Solution: Preemptive Priority NoC  Give priority to short ctrl. packets

12 E. Bolotin – The Power of Priority, NoCs 2007 Preemptive Priority NoC: QNoC Multiple SL link QNoC Service Levels: Dedicated wormhole buffer Preemptive priority scheduling Multiple SL Router

13 E. Bolotin – The Power of Priority, NoCs 2007 Example: Vanilla NoC Blue delay ~X Red delay ~ 2X+δ Average delay ~ 1.5X Vanilla NoC example A B Without contention: X:Delay of long packet δ:Delay of short packet Long Data Transaction 1 Short Req. Long Resp. Transaction 2

14 E. Bolotin – The Power of Priority, NoCs 2007 Example: Priority NoC Blue delay=X Red delay = 2X+δ Average delay ~ 1.5X Without contention: X:Delay of long packet δ:Delay of short packet Vanilla NoC example A B Blue delay= X+δ Red delay = X+δ Average delay ~ X Potential delay reduction ~ 0.5X Priority NoC example Long Data Transaction 1 Short Req. Long Resp. Transaction 2

15 E. Bolotin – The Power of Priority, NoCs 2007 Priority NoC: Different Destinations Very important in wormhole When ctrl. packet is blocked by other worms Short Req. Long Data

16 E. Bolotin – The Power of Priority, NoCs 2007 Protocol Correctness Need state-preserving serialization of transactions in the processor interface

17 E. Bolotin – The Power of Priority, NoCs 2007 Numerical Evaluation CMP simulator (SIMICS)  Simulate parallel benchmarks  Obtain L2-cache access traces QNoC simulator (OPNET)  Simulate distributed coherence protocol over NoC  Measure total RD/RX L2-access delay  Measure total program throughput

18 E. Bolotin – The Power of Priority, NoCs 2007 Priority NoC: Results Short ctrl. packet gets high priority Long data packet gets low priority Delay Reduction vs. Network Load RD Delay - ApacheRD/RX Delay Reduction - Apache

19 E. Bolotin – The Power of Priority, NoCs 2007 Priority NoC: Several Benchmarks Delay ReductionProgram Speedup

20 E. Bolotin – The Power of Priority, NoCs 2007 So Far: The Power of Priority Simplicity - Almost for Free Significant CMP Speed-up Good For: Coherency Traffic differentiation (e.g. Fetch vs. Pre-Fetch) Search in DNUCA Synchronization (Locks)

21 E. Bolotin – The Power of Priority, NoCs 2007 Special Broadcast for Short Messages  Broadcast service (e.g. search in DNUCA)  Wormhole broadcast slow and expensive  S&F broadcast embedded in wormhole Virtual Ring  No Additional Cost  For Invalidation Multicast  Snooping or synchronization Advanced Support Functions

22 E. Bolotin – The Power of Priority, NoCs 2007 Summary NoC at CMP Service! Shared cache over NoC Priority is powerful Built-in support functions