18-742 Spring 2011 Parallel Computer Architecture Lecture 14: Interconnects III Prof. Onur Mutlu Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
A Novel 3D Layer-Multiplexed On-Chip Network
Advertisements

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.
Aérgia: Exploiting Packet Latency Slack in On-Chip Networks
Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis Stanford University.
$ A Case for Bufferless Routing in On-Chip Networks A Case for Bufferless Routing in On-Chip Networks Onur Mutlu CMU TexPoint fonts used in EMF. Read the.
Destination-Based Adaptive Routing for 2D Mesh Networks ANCS 2010 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering University of California,
Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.
Miguel Gorgues, Dong Xiang, Jose Flich, Zhigang Yu and Jose Duato Uni. Politecnica de Valencia, Spain School of Software, Tsinghua University, China, Achieving.
1 Lecture 17: On-Chip Networks Today: background wrap-up and innovations.
1 Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control.
L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.
Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim
1 Lecture 21: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
1 Lecture 13: Interconnection Networks Topics: flow control, router pipelines, case studies.
1 Lecture 25: Interconnection Networks Topics: flow control, router microarchitecture Final exam:  Dec 4 th 9am – 10:40am  ~15-20% on pre-midterm  post-midterm:
Predictive Load Balancing Reconfigurable Computing Group.
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control Final exam reminders:  Plan well – attempt every question.
Issues in System-Level Direct Networks Jason D. Bakos.
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control.
1 Lecture 26: Interconnection Networks Topics: flow control, router microarchitecture.
1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.
1 Near-Optimal Oblivious Routing for 3D-Mesh Networks ICCD 2008 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering Department University.
Dragonfly Topology and Routing
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Switching, routing, and flow control in interconnection networks.
McRouter: Multicast within a Router for High Performance NoCs
1 Lecture 23: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm Next semester:
1 The Turn Model for Adaptive Routing. 2 Summary Introduction to Direct Networks. Deadlocks in Wormhole Routing. System Model. Partially Adaptive Routing.
Interconnect Basics 1. Where Is Interconnect Used? To connect components Many examples  Processors and processors  Processors and memories (banks) 
On-Chip Networks and Testing
José Vicente Escamilla José Flich Pedro Javier García 1.
High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.
Networks-on-Chips (NoCs) Basics
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.
DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee VSSAD, Alpha Development Group.
Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, ​ Kevin Chang, Greg Nazario, Reetuparna.
ECE669 L21: Routing April 15, 2004 ECE 669 Parallel Computer Architecture Lecture 21 Routing.
Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Switch Microarchitecture Basics.
1 Lecture 15: Interconnection Routing Topics: deadlock, flow control.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
Yu Cai Ken Mai Onur Mutlu
Lecture 16: Router Design
Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.
Topology-aware QOS Support in Highly Integrated CMPs Boris Grot (UT-Austin) Stephen W. Keckler (NVIDIA/UT-Austin) Onur Mutlu (CMU) WIOSCA '10.
Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220.
1 Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.
Boris Grot, Joel Hestness, Stephen W. Keckler 1 The University of Texas at Austin 1 NVIDIA Research Onur Mutlu Carnegie Mellon University.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
$ Advances in On-Chip Networks Advances in On-Chip Networks TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A Thomas.
Computer Architecture: Interconnects (Part II) Michael Papamichael Carnegie Mellon University Material partially based on Onur Mutlu’s lecture slides.
1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.
1 Lecture 22: Interconnection Networks Topics: Routing, deadlock, flow control, virtual channels.
Chris Fallin Carnegie Mellon University
Lecture 23: Interconnection Networks
Rachata Ausavarungnirun, Kevin Chang
Azeddien M. Sllame, Amani Hasan Abdelkader
Lecture 23: Router Design
Lecture 16: On-Chip Networks
Interconnect Basics.
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
Lecture: Interconnection Networks
CS 6290 Many-core & Interconnect
Lecture 25: Interconnection Networks
A Case for Bufferless Routing in On-Chip Networks
Presentation transcript:

Spring 2011 Parallel Computer Architecture Lecture 14: Interconnects III Prof. Onur Mutlu Carnegie Mellon University

Announcements No class Wednesday (Feb 23) No class Friday (Feb 25) – Open House Project Milestone due Thursday (Feb 24) 2

Reviews Due Today (Feb 21)  Smith, “A pipelined, shared resource MIMD computer,” ICPP  Chappell et al., “Simultaneous Subordinate Microthreading (SSMT),” ISCA Due Sunday (Feb 27)  Reinhardt and Mukherjee, “Transient Fault Detection via Simultaneous Multithreading,” ISCA  Mukherjee et al., “Detailed Design and Evaluation of Redundant Multithreading Alternatives,” ISCA

Last Lectures Interconnection Networks  Introduction & Terminology  Topology  Buffering and Flow control  Routing  Router design  Network performance metrics  On-chip vs. off-chip differences Research on NoCs and packet scheduling  The problem with packet scheduling  Application-aware packet scheduling  Aergia: Latency slack based packet scheduling 4

Today Recap interconnection networks Multithreading 5

Some Questions What are the possible ways of handling contention in a router? What is head-of-line blocking? What is a non-minimal routing algorithm? What is the difference between deterministic, oblivious, and adaptive routing algorithms? What routing algorithms need to worry about deadlock? What routing algorithms need to worry about livelock? How to handle deadlock How to handle livelock? What is zero-load latency? What is saturation throughput? What is an application-aware packet scheduling algorithm? 6

Routing Mechanism Arithmetic  Simple arithmetic to determine route in regular topologies  Dimension order routing in meshes/tori Source Based  Source specifies output port for each switch in route + Simple switches no control state: strip output port off header - Large header Table Lookup Based  Index into table for output port + Small header - More complex switches 7

Routing Algorithm Types  Deterministic: always choose the same path  Oblivious: do not consider network state (e.g., random)  Adaptive: adapt to state of the network How to adapt  Local/global feedback  Minimal or non-minimal paths 8

Deterministic Routing All packets between the same (source, dest) pair take the same path Dimension-order routing  E.g., XY routing (used in Cray T3D, and many on-chip networks)  First traverse dimension X, then traverse dimension Y + Simple + Deadlock freedom (no cycles in resource allocation) - Could lead to high contention - Does not exploit path diversity 9

Deadlock No forward progress Caused by circular dependencies on resources Each packet waits for a buffer occupied by another packet downstream 10

Handling Deadlock Avoid cycles in routing  Dimension order routing Cannot build a circular dependency  Restrict the “turns” each packet can take Avoid deadlock by adding virtual channels Detect and break deadlock  Preemption of buffers 11

Turn Model to Avoid Deadlock Idea  Analyze directions in which packets can turn in the network  Determine the cycles that such turns can form  Prohibit just enough turns to break possible cycles Glass and Ni, “The Turn Model for Adaptive Routing,” ISCA

Valiant’s Algorithm An example of oblivious algorithm Goal: Balance network load Idea: Randomly choose an intermediate destination, route to it first, then route from there to destination  Between source-intermediate and intermediate-dest, can use dimension order routing + Randomizes/balances network load - Non minimal (packet latency can increase) Optimizations:  Do this on high load  Restrict the intermediate node to be close (in the same quadrant) 13

Adaptive Routing Minimal adaptive  Router uses network state (e.g., downstream buffer occupancy) to pick which “productive” output port to send a packet to  Productive output port: port that gets the packet closer to its destination + Aware of local congestion - Minimality restricts achievable link utilization (load balance) Non-minimal (fully) adaptive  “Misroute” packets to non-productive output ports based on network state + Can achieve better network utilization and load balance - Need to guarantee livelock freedom 14

More on Adaptive Routing Can avoid faulty links/routers Idea: Route around faults + Deterministic routing cannot handle faulty components - Need to change the routing table to disable faulty routes - Assuming the faulty link/router is detected 15

Real On-Chip Network Designs Tilera Tile64 and Tile100 Larrabee Cell 16

On-Chip vs. Off-Chip Differences Advantages of on-chip Wires are “free”  Can build highly connected networks with wide buses Low latency  Can cross entire network in few clock cycles High Reliability  Packets are not dropped and links rarely fail Disadvantages of on-chip Sharing resources with rest of components on chip  Area  Power Limited buffering available Not all topologies map well to 2D plane 17

Tilera Networks 2D Mesh Five networks Four packet switched  Dimension order routing, wormhole flow control  TDN: Cache request packets  MDN: Response packets  IDN: I/O packets  UDN: Core to core messaging One circuit switched  STN: Low-latency, high- bandwidth static network  Streaming data 18

Research Topics in Interconnects Plenty of topics in on-chip networks. Examples: Energy/power efficient/proportional design Reducing Complexity: Simplified router and protocol designs Adaptivity: Ability to adapt to different access patterns QoS and performance isolation  Reducing and controlling interference, admission control Co-design of NoCs with other shared resources  End-to-end performance, QoS, power/energy optimization Scalable topologies to many cores Fault tolerance Request prioritization, priority inversion, coherence, … New technologies (optical, 3D) 19

Bufferless Routing (these slides are not covered in class; they are for your benefit) Moscibroda and Mutlu, “A Case for Bufferless Routing in On-Chip Networks,” ISCA

$ Connect cores, caches, memory controllers, etc… Examples: Intel 80-core Terascale chip MIT RAW chip Design goals in NoC design: High throughput, low latency Fairness between cores, QoS, … Low complexity, low cost Power, low energy consumption On-Chip Networks (NoC) Energy/Power in On-Chip Networks Power is a key constraint in the design of high-performance processors NoCs consume substantial portion of system power ~30% in Intel 80-core Terascale [IEEE Micro’07] ~40% in MIT RAW Chip [ISCA’04] NoCs estimated to consume 100s of Watts [Borkar, DAC’07] Energy/Power in On-Chip Networks Power is a key constraint in the design of high-performance processors NoCs consume substantial portion of system power ~30% in Intel 80-core Terascale [IEEE Micro’07] ~40% in MIT RAW Chip [ISCA’04] NoCs estimated to consume 100s of Watts [Borkar, DAC’07]

$ Existing approaches differ in numerous ways: Network topology [Kim et al, ISCA’07, Kim et al, ISCA’08 etc] Flow control [Michelogiannakis et al, HPCA’09, Kumar et al, MICRO’08, etc] Virtual Channels [Nicopoulos et al, MICRO’06, etc] QoS & fairness mechanisms [Lee et al, ISCA’08, etc] Routing algorithms [Singh et al, CAL’04] Router architecture [Park et al, ISCA’08] Broadcast, Multicast [Jerger et al, ISCA’08, Rodrigo et al, MICRO’08] Current NoC Approaches Existing work assumes existence of buffers in routers! Existing work assumes existence of buffers in routers!

$ A Typical Router Routing Computation VC Arbiter Switch Arbiter VC1 VC2 VCv VC1 VC2 VCv Input Port N Input Port 1 N x N Crossbar Input Channel 1 Input Channel N Scheduler Output Channel 1 Output Channel N Credit Flow to upstream router Buffers are integral part of existing NoC Routers Credit Flow to upstream router

$ Buffers are necessary for high network throughput  buffers increase total available bandwidth in network Buffers in NoC Routers Injection Rate Avg. packet latency large buffers medium buffers small buffers

$ Buffers are necessary for high network throughput  buffers increase total available bandwidth in network Buffers consume significant energy/power Dynamic energy when read/write Static energy even when not occupied Buffers add complexity and latency Logic for buffer management Virtual channel allocation Credit-based flow control Buffers require significant chip area E.g., in TRIPS prototype chip, input buffers occupy 75% of total on-chip network area [Gratz et al, ICCD’06] Buffers in NoC Routers

$ How much throughput do we lose?  How is latency affected? Up to what injection rates can we use bufferless routing?  Are there realistic scenarios in which NoC is operated at injection rates below the threshold? Can we achieve energy reduction?  If so, how much…? Can we reduce area, complexity, etc…? Going Bufferless…? Injection Rate latency buffers no buffers

$ Introduction and Background Bufferless Routing (BLESS) FLIT-BLESS WORM-BLESS BLESS with buffers Advantages and Disadvantages Evaluations ConclusionsOverview

$ Always forward all incoming flits to some output port If no productive direction is available, send to another direction  packet is deflected  Hot-potato routing [Baran’64, etc] BLESS: Bufferless Routing Buffered BLESS Deflected!

$ BLESS: Bufferless Routing Routing VC Arbiter Switch Arbiter Flit-Ranking Port- Prioritization arbitration policy Flit-Ranking 1.Create a ranking over all incoming flits Port- Prioritization 2.For a given flit in this ranking, find the best free output-port Apply to each flit in order of ranking

$ Each flit is routed independently. Oldest-first arbitration (other policies evaluated in paper) Network Topology:  Can be applied to most topologies (Mesh, Torus, Hypercube, Trees, …) 1) #output ports ¸ #input ports at every router 2) every router is reachable from every other router Flow Control & Injection Policy:  Completely local, inject whenever input port is free Absence of Deadlocks: every flit is always moving Absence of Livelocks: with oldest-first ranking FLIT-BLESS: Flit-Level Routing Flit-Ranking 1.Oldest-first ranking Port- Prioritization 2.Assign flit to productive port, if possible. Otherwise, assign to non-productive port.

$ Potential downsides of FLIT-BLESS Not-energy optimal (each flits needs header information) Increase in latency (different flits take different path) Increase in receive buffer size BLESS with wormhole routing…? Problems: Injection Problem (not known when it is safe to inject) Livelock Problem (packets can be deflected forever) WORM-BLESS: Wormhole Routing new worm! [Dally, Seitz’86]

$ WORM-BLESS: Wormhole Routing Flit-Ranking 1.Oldest-first ranking Port-Prioritization 2.If flit is head-flit a) assign flit to unallocated, productive port b) assign flit to allocated, productive port c) assign flit to unallocated, non-productive port d) assign flit to allocated, non-productive port else, a) assign flit to port that is allocated to worm Deflect worms if necessary! Truncate worms if necessary! Head-flit: West This worm is truncated! & deflected! At low congestion, packets travel routed as worms allocated to North allocated to West Body-flit turns into head-flit See paper for details…

$ BLESS without buffers is extreme end of a continuum BLESS can be integrated with buffers FLIT-BLESS with Buffers WORM-BLESS with Buffers Whenever a buffer is full, it’s first flit becomes must-schedule must-schedule flits must be deflected if necessary BLESS with Buffers See paper for details…

$ Introduction and Background Bufferless Routing (BLESS) FLIT-BLESS WORM-BLESS BLESS with buffers Advantages and Disadvantages Evaluations ConclusionsOverview

$ Advantages No buffers Purely local flow control Simplicity - no credit-flows - no virtual channels - simplified router design No deadlocks, livelocks Adaptivity - packets are deflected around congested areas! Router latency reduction Area savings BLESS: Advantages & Disadvantages Disadvantages Increased latency Reduced bandwidth Increased buffering at receiver Header information at each flit Oldest-first arbitration complex QoS becomes difficult Impact on energy…?

$ BLESS gets rid of input buffers and virtual channels Reduction of Router Latency BW RC VA SA ST LT BW SA ST LT RC ST LT RC ST LT LA LT BW: Buffer Write RC: Route Computation VA: Virtual Channel Allocation SA: Switch Allocation ST: Switch Traversal LT: Link Traversal LA LT: Link Traversal of Lookahead BW: Buffer Write RC: Route Computation VA: Virtual Channel Allocation SA: Switch Allocation ST: Switch Traversal LT: Link Traversal LA LT: Link Traversal of Lookahead Baseline Router (speculative) Baseline Router (speculative) head flit body flit BLESS Router (standard) BLESS Router (standard) RC ST LT RC ST LT Router 1 Router 2 Router 1 Router 2 BLESS Router (optimized) BLESS Router (optimized) Router Latency = 3 Router Latency = 2 Router Latency = 1 Can be improved to 2. [Dally, Towles’04] our evaluations

$ Advantages No buffers Purely local flow control Simplicity - no credit-flows - no virtual channels - simplified router design No deadlocks, livelocks Adaptivity - packets are deflected around congested areas! Router latency reduction Area savings BLESS: Advantages & Disadvantages Disadvantages Increased latency Reduced bandwidth Increased buffering at receiver Header information at each flit Impact on energy…? Extensive evaluations in the paper!

$ 2D mesh network, router latency is 2 cycles o 4x4, 8 core, 8 L2 cache banks (each node is a core or an L2 bank) o 4x4, 16 core, 16 L2 cache banks (each node is a core and an L2 bank) o 8x8, 16 core, 64 L2 cache banks (each node is L2 bank and may be a core) o 128-bit wide links, 4-flit data packets, 1-flit address packets o For baseline configuration: 4 VCs per physical input port, 1 packet deep Benchmarks o Multiprogrammed SPEC CPU2006 and Windows Desktop applications o Heterogeneous and homogenous application mixes o Synthetic traffic patterns: UR, Transpose, Tornado, Bit Complement x86 processor model based on Intel Pentium M o 2 GHz processor, 128-entry instruction window o 64Kbyte private L1 caches o Total 16Mbyte shared L2 caches; 16 MSHRs per bank o DRAM model based on Micron DDR2-800 Evaluation Methodology

$ Energy model provided by Orion simulator [MICRO’02] o 70nm technology, 2 GHz routers at 1.0 V dd For BLESS, we model o Additional energy to transmit header information o Additional buffers needed on the receiver side o Additional logic to reorder flits of individual packets at receiver We partition network energy into buffer energy, router energy, and link energy, each having static and dynamic components. Comparisons against non-adaptive and aggressive adaptive buffered routing algorithms (DO, MIN-AD, ROMM) Evaluation Methodology

$ Evaluation – Synthethic Traces First, the bad news Uniform random injection BLESS has significantly lower saturation throughput compared to buffered baseline. BLESS Best Baseline

$ Evaluation – Homogenous Case Study milc benchmarks (moderately intensive) Perfect caches! Very little performance degradation with BLESS (less than 4% in dense network) With router latency 1, BLESS can even outperform baseline (by ~10%) Significant energy improvements (almost 40%) Baseline BLESS RL=1

$ Evaluation – Homogenous Case Study Baseline BLESS RL=1 milc benchmarks (moderately intensive) Perfect caches! Very little performance degradation with BLESS (less than 4% in dense network) With router latency 1, BLESS can even outperform baseline (by ~10%) Significant energy improvements (almost 40%) Observations: 1)Injection rates not extremely high on average  self-throttling! 2)For bursts and temporary hotspots, use network links as buffers! Observations: 1)Injection rates not extremely high on average  self-throttling! 2)For bursts and temporary hotspots, use network links as buffers!

$ Evaluation – Further Results BLESS increases buffer requirement at receiver by at most 2x  overall, energy is still reduced Impact of memory latency  with real caches, very little slowdown! (at most 1.5%) See paper for details…

$ Evaluation – Further Results BLESS increases buffer requirement at receiver by at most 2x  overall, energy is still reduced Impact of memory latency  with real caches, very little slowdown! (at most 1.5%) Heterogeneous application mixes (we evaluate several mixes of intensive and non-intensive applications)  little performance degradation  significant energy savings in all cases  no significant increase in unfairness across different applications Area savings: ~60% of network area can be saved! See paper for details…

$ Aggregate results over all 29 applications Evaluation – Aggregate Results Sparse Network Perfect L2Realistic L2 AverageWorst-CaseAverageWorst-Case ∆ Network Energy -39.4%-28.1%-46.4%-41.0% ∆ System Performance -0.5%-3.2%-0.15%-0.55% FLIT WORMBASE FLIT WORMBASE FLIT WORM BASE

$ Aggregate results over all 29 applications Evaluation – Aggregate Results Sparse Network Perfect L2Realistic L2 AverageWorst-CaseAverageWorst-Case ∆ Network Energy -39.4%-28.1%-46.4%-41.0% ∆ System Performance -0.5%-3.2%-0.15%-0.55% Dense Network Perfect L2Realistic L2 AverageWorst-CaseAverageWorst-Case ∆ Network Energy -32.8%-14.0%-42.5%-33.7% ∆ System Performance -3.6%-17.1%-0.7%-1.5%

$ For a very wide range of applications and network settings, buffers are not needed in NoC Significant energy savings (32% even in dense networks and perfect caches) Area-savings of 60% Simplified router and network design (flow control, etc…) Performance slowdown is minimal (can even increase!)  A strong case for a rethinking of NoC design! Future research: Support for quality of service, different traffic classes, energy- management, etc… BLESS Conclusions