Interaction of NoC design and Coherence Protocol in 3D-stacked CMPs

Slides:

Advertisements

Similar presentations

Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

A Novel 3D Layer-Multiplexed On-Chip Network

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

High Performing Cache Hierarchies for Server Workloads

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.

University of Utah1 Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian.

CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

1 Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control.

(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.

Rotary Router : An Efficient Architecture for CMP Interconnection Networks Pablo Abad, Valentín Puente, Pablo Prieto, and Jose Angel Gregorio University.

1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

Issues in System-Level Direct Networks Jason D. Bakos.

1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control.

(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.

Dragonfly Topology and Routing

McRouter: Multicast within a Router for High Performance NoCs

On-Chip Networks and Testing

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

Networks-on-Chips (NoCs) Basics

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, Kevin Chang, Greg Nazario, Reetuparna.

TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.

Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

The Alpha Network Architecture Mukherjee, Bannon, Lang, Spink, and Webb Summary Slides by Fred Bower ECE 259, Spring 2004.

Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi.

Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.

Design Space Exploration for NoC Topologies ECE757 6 th May 2009 By Amit Kumar, Kanchan Damle, Muhammad Shoaib Bin Altaf, Janaki K.M Jillella Course Instructor:

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University

Advanced Caches Smruti R. Sarangi.

ASR: Adaptive Selective Replication for CMP Caches

Lecture 23: Interconnection Networks

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

University of Cantabria

Exploring Concentration and Channel Slicing in On-chip Network Router

Interconnection Networks: Flow Control

Azeddien M. Sllame, Amani Hasan Abdelkader

Cache Memory Presentation I

Lecture 16: On-Chip Networks

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

CMSC 611: Advanced Computer Architecture

Israel Cidon, Ran Ginosar and Avinoam Kolodny

Lecture 12: Cache Innovations

Using Packet Information for Efficient Communication in NoCs

Lecture: Cache Innovations, Virtual Memory

Storage area network and System area network (SAN)

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Impact of Interconnection Network resources on CMP performance

Improving Multiple-CMP Systems with Token Coherence

Adapted from slides by Sally McKee Cornell University

Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti

Design and Management of 3D CMP’s using Network-in-Memory

CANDY: Enabling Coherent DRAM Caches for Multi-node Systems

Lecture: Interconnection Networks

Lucía G. Menezo Valentín Puente Jose Ángel Gregorio

CS 6290 Many-core & Interconnect

Lecture: Cache Hierarchies

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Lecture 25: Interconnection Networks

CMP Design Choices Finding Parameters that Impact CMP Performance

Multiprocessors and Multi-computers

Presentation transcript:

Interaction of NoC design and Coherence Protocol in 3D-stacked CMPs Pablo Abad, Pablo Prieto, Lucia G. Menezo, Adrian Colaso, Valentin Puente and Jose-Angel Gregorio University of Cantabria 3D-NET@DSD2013

Must Increase On-Chip Storage Capacity The Memory Wall Processor speed improvement largely exceeds DRAM. Larger Latencies to access data at main memory. Core L1 L2 DRAM Core L1 L2 New Problem: BW Wall Core count increases faster than I/O Bandwidth (pin & power) Off-Chip BW scarce resource Contention increases latency DRAM Must Increase On-Chip Storage Capacity 3D-NET@DSD2013

Must Increase On-Chip Storage Capacity Cores + L1 LLC Must Increase On-Chip Storage Capacity 3D-Stacking Cores + L1 Non-SRAM technology Cores + L1 LLC Through Silicon Vias On-Chip Bandwidth Improvement Minimal latency in Z dimension Power? Temperature? PCRAM, STTRAM… More Density in the same area Minimal Static Power Endurance? 3D-NET@DSD2013

Coherence Protocol Network Organization 3D Stacking Interconnection Network Organization 3D Stacking 3D-NET@DSD2013

Outline Motivation Introduction NoC Support Evaluation Broadcast-Based Coherence 3d stacking – Coherence – Network interaction NoC Support Class-Aware Routing Congestion-Aware Missrouting Deadlock Avoidance Critical Flit First Evaluation Conclusions & Future work <Literal> 3D-NET@DSD2013

Broadcast-Based Coherence (Token) Broadcast: Efficient cache-to-cache transfers Avoid indirections but higher bandwidth requirements On-chip environment: high bandwidth availability AMD Opteron, IBM Power7, Intel Quickpath TokenB: a fixed number of tickets (tokens) associated to each block One token to read All tokens to write Coherence enforce by counting/exchanging tokens 3D-NET@DSD2013

Main Memory Controller Broadcast-Based Coherence Core A L1 Core 2 L1 Core 3 L1 LOAD L2 L2 L2 Main Memory Controller MISS Core 4 L1 Core 5 L1 Core 6 L1 L2 L2 L2 Core 7 L1 Core 8 L1 Core 9 L1 L2 L2 L2 3D-NET@DSD2013

3D–Coherence–Network Interaction Core A L1 R0 R1 R3 R2 R6 R7 R4 R5 R8 Core A L1 R0 R1 R3 R2 R6 R7 R4 R5 R8 L2 R9 R10 R12 R11 R15 R16 R13 R14 R17 L2 R9 R10 R12 R11 R15 R16 R13 R14 R17 R9 R10 R12 R11 R15 R16 R13 R14 R17 R0 R1 R3 R2 R6 R7 R4 R5 R8 CORE-L1 LAYER LLC LAYER 3D-NET@DSD2013

3D–Coherence–Network Interaction Request-Reply transaction: 8 link traversals at core-l1 layer … Vs 2 link traversals at LLC layer R0 R1 R3 R2 R6 R7 R4 R5 R8 CORE-L1 LAYER R9 R10 R12 R11 R15 R16 R13 R14 R17 LLC LAYER Unbalanced Network Utilization 3D-NET@DSD2013

3D–Coherence–Network Interaction Routing restrictions (deadlock) delay some transactions R0 R1 R3 R2 R6 R7 R4 R5 R8 In the example, XYZ routing artificially delays LLC reqs, routing them through congested resources. Dimension Ordered Routing also affects L1-to-L1 reqs, due to the congestion levels at Core-L1 layer. CORE-L1 LAYER R9 R10 R12 R11 R15 R16 R13 R14 R17 LLC LAYER 3D-NET@DSD2013

Outline Motivation Introduction NoC Support Evaluation Broadcast-Based Coherence 3d stacking – Coherence – Network interaction NoC Support Class-Aware Routing Congestion-Aware Missrouting Deadlock Avoidance Critical Flit First Evaluation Conclusions & Future work <Literal> 3D-NET@DSD2013

Class-Aware Routing How do we solve LLC-Request Delay problem? CORE-L1 LAYER If we change routing from XYZ to ZYX we fix this issue … But we will be degrading Reply latency !!! LLC LAYER R9 R10 R12 R11 R15 R16 R13 R14 R17 R0 R1 R3 R2 R6 R7 R4 R5 R8 Must avoid Global routing strategies. Can move to per-Message Class routing. Requests are routed in ZYX order while Replies keep original order (XYZ). 3D-NET@DSD2013

Longer distance but Better latency Congestion-Aware Missrouting What about messages with src & dst at Core-L1 layer? Requests to other L1 caches find lots of contention to access shared resources They are in the critical path If we find congested links at intermediate nodes, we could missroute messages to the LLC layer Messages could reach destination faster due to much lower contention R0 R1 R3 R2 R6 R7 R4 R5 R8 CORE-L1 LAYER R9 R10 R12 R11 R15 R16 R13 R14 R17 Longer distance but Better latency LLC LAYER 3D-NET@DSD2013

Deadlock Avoidance Previous methods must keep network Deadlock-Free Virtual Channels avoid end-to-end deadlock (only worry about routing deadlock) Virtual Channels also help to eliminate cyclic dependencies for both solutions proposed Class-Aware Routing Each message class employs its own buffering resources No cycles can be formed between requests (ZYX) and replies (XYZ) Congestion-Aware Missrouting Once a message is missrouted, must follow through LLC layer until destination. This way Z+→X or Z+→Y turns are not allowed and deadlock is avoided R0 R1 R3 R2 R6 R7 R4 R5 R8 R0 R1 R3 R2 R6 R7 R4 R5 R8 R9 R10 R12 R11 R15 R16 R13 R14 R17 R9 R10 R12 R11 R15 R16 R13 R14 R17 3D-NET@DSD2013

Critical Flit First Network support for Critical Word First Technique As Mem blocks are larger than words requested by the processor, missed word is given priority, requesting it from memory in first place Network messages are usually broken into smaller pieces (flits) with a similar size to processor words Block re-ordering can be implemented by network components with very low overhead. Req word Memory Block Header Body (flit 2) Body (flit 3) Body (flit 1) Tail (flit 4) CFF Header Body (flit 1) Body (flit 2) Body (flit 3) Tail (flit 4) Conventional 3D-NET@DSD2013

Outline Motivation Introduction NoC Support Evaluation Broadcast-Based Coherence 3d stacking – Coherence – Network interaction NoC Support Class-Aware Routing Congestion-Aware Missrouting Deadlock Avoidance Critical Flit First Evaluation Conclusions & Future work 3D-NET@DSD2013

Static, interlieved across slices 4GB/250 cyc/4 centered/ 320GB/s Evaluation (Simulation Framework) Sim. Infrastructure Simulated System TOPAZ Ruby Opal GEMS Simics Workloads Benchmark Description Wisconsin Commercial Workload Suite Apache Task-parallel web server Jbb Java middleware application Zeus Pipelined web server Oltp Pseudo TCP-C on-line trans. processing NAS Parallel benchmark FT 3-D partial diff. eq. solution using FFTs IS Integer sort SP Scalar Pentadiagonal solver MG Multi-grid on a sequence of meshes LU LU solver Processor Config. Number of Cores 16@4GHz IWin Size/Issue 128/4-way L1 Cache Size/Assoc/Blk/ Time 32KB, 2-way, 64B, 2-cyc Outst. Mem. Operations 16 L2 Cache Size/Assoc/Blk Size/Time 16MB/16-way/ 64B/5-cyc NUCA Mapping Static, interlieved across slices Memory Capacity/Access Time/Controllers/BW 4GB/250 cyc/4 centered/ 320GB/s Network Topology/Link Lat/Link Width 4x4x2 Mesh/1 cyc/128 bits (or 64) Router Lat/Buff Size/Rtg 3 cyc/10 flits per VC/DOR Simics: full system simulation GEMS: timing infrastructure, substitutes simics models of some components Opal: Processor detailed simulation Ruby: Memory hierarchy implementation Topaz: Replaces Ruby network models, near-RTL detail level. 3D-NET@DSD2013

Evaluation (Improved Routing) Class Aware Routing L1-Core Layer LLC Layer Z X Y Congestion Aware Missrouting L1-Core Layer LLC Layer Z X Y 3D-NET@DSD2013

Evaluation (Critical Flit First) HEAD DATA TAIL Base Latency Spooling HEAD DATA TAIL Base Latency Spooling 5 10 15 20 25 3D-NET@DSD2013

Evaluation (All Together) 3D-NET@DSD2013

Outline Motivation Introduction NoC Support Evaluation Broadcast-Based Coherence 3d stacking – Coherence – Network interaction NoC Support Class-Aware Routing Congestion-Aware Missrouting Deadlock Avoidance Critical Flit First Evaluation Conclusions & Future work <Literal> 3D-NET@DSD2013

Conclusions & Future work The study of Network, 3D organization and Traffic structure (coherence protocol) can significantly improve CMP performance. Small but smart router modifications can provide improvements with minimal HW overhead (energy & area). Adaptive routing policies could help to improve present results even more. Routing strategies with a target different to performance (Temperature?) could also be interesting. 3D-NET@DSD2013

Thanks for your attention <Literal> 3D-NET@DSD2013