Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.

Slides:

Advertisements

Similar presentations

EE384y: Packet Switch Architectures

Advertisements

IP Router Architectures. Outline Basic IP Router Functionalities IP Router Architectures.

A Novel 3D Layer-Multiplexed On-Chip Network

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.

Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.

1 Outline  Why Maximal and not Maximum  Definition and properties of Maximal Match  Parallel Iterative Matching (PIM)  iSLIP  Wavefront Arbiter (WFA)

Router Architecture : Building high-performance routers Ian Pratt

Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.

Nick McKeown CS244 Lecture 6 Packet Switches. What you said The very premise of the paper was a bit of an eye- opener for me, for previously I had never.

CS 268: Router Design Ion Stoica March 1, 2004.

1 Comnet 2006 Communication Networks Recitation 5 Input Queuing Scheduling & Combined Switches.

1 ENTS689L: Packet Processing and Switching Buffer-less Switch Fabric Architectures Buffer-less Switch Fabric Architectures Vahid Tabatabaee Fall 2006.

Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim

048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion Scaling.

1 Lecture 21: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.

Rotary Router : An Efficient Architecture for CMP Interconnection Networks Pablo Abad, Valentín Puente, Pablo Prieto, and Jose Angel Gregorio University.

CS 268: Lecture 12 (Router Design) Ion Stoica March 18, 2002.

1 Achieving 100% throughput Where we are in the course… 1. Switch model 2. Uniform traffic  Technique: Uniform schedule (easy) 3. Non-uniform traffic,

048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion Maximal.

Dragonfly Topology and Routing

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Pipelined Two Step Iterative Matching Algorithms for CIOQ Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York, Stony Brook.

Localized Asynchronous Packet Scheduling for Buffered Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York Stony Brook.

Load Balanced Birkhoff-von Neumann Switches

Belgrade University Aleksandra Smiljanić: High-Capacity Switching Switches with Input Buffers (Cisco)

Networks-on-Chips (NoCs) Basics

Dynamic Networks CS 213, LECTURE 15 L.N. Bhuyan CS258 S99.

CS 552 Computer Networks IP forwarding Fall 2005 Rich Martin (Slides from D. Culler and N. McKeown)

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

The Alpha Network Architecture By Shubhendu S. Mukherjee, Peter Bannon Steven Lang, Aaron Spink, and David Webb Compaq Computer Corporation Presented.

High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.

Summary of switching theory Balaji Prabhakar Stanford University.

DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee VSSAD, Alpha Development Group.

Author : Jing Lin, Xiaola Lin, Liang Tang Publish Journal of parallel and Distributed Computing MAKING-A-STOP: A NEW BUFFERLESS ROUTING ALGORITHM FOR ON-CHIP.

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

Shubhendu S. Mukherjee, Peter Bannon, Steven Lang, Aaron Spink, and David Webb Alpha Development Group, Compaq HOT Interconnects 9 (2001) Presented by.

ISLIP Switch Scheduler Ali Mohammad Zareh Bidoki April 2002.

Packet Forwarding. A router has several input/output lines. From an input line, it receives a packet. It will check the header of the packet to determine.

Crossbar Switch Project

Stress Resistant Scheduling Algorithms for CIOQ Switches Prashanth Pappu Applied Research Laboratory Washington University in St Louis “Stress Resistant.

The Alpha Network Architecture Mukherjee, Bannon, Lang, Spink, and Webb Summary Slides by Fred Bower ECE 259, Spring 2004.

Belgrade University Aleksandra Smiljanić: High-Capacity Switching Switches with Input Buffers (Cisco)

Yu Cai Ken Mai Onur Mutlu

Buffered Crossbars With Performance Guarantees Shang-Tse (Da) Chuang Cisco Systems EE384Y Thursday, April 27, 2006.

1 Lecture 15: NoC Innovations Today: power and performance innovations for NoCs.

Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.

Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.

Input buffered switches (1)

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University

1 Computer Architecture & Assembly Language Spring 2001 Dr. Richard Spillman Lecture 26 – Alternative Architectures.

scheduling for local-area networks”

How to Train your Dragonfly

Lecture 23: Interconnection Networks

CS 268: Router Design Ion Stoica February 27, 2003.

Packet Forwarding.

Rachata Ausavarungnirun, Kevin Chang

Addressing: Router Design

Lecture 23: Router Design

Lecture 17: NoC Innovations

Static and Dynamic Networks

Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.

Using Packet Information for Efficient Communication in NoCs

Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti

Outline Why Maximal and not Maximum

Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.

CS 6290 Many-core & Interconnect

Multiprocessors and Multi-computers

Presentation transcript:

Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel Emer*, Steve Lang*, & Dave Webb $ (ack: Richard Kessler) Intel*, UPV !, & HP $ Tenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2002

Intel Slide 2 Alpha Network Chip (including Router) Rambus Memory I/O M IO M M M M M M M M M M M L2 Cache Data L2 Cache Data Router MC2 MC1 L2 Cache Tags CORE

Intel Slide 3 The Alpha x7 Router CROSSBARCROSSBAR Input Ports Output Ports Distributed Arbitration Algorithm Controls the Crossbar 8 Input ports: 4 network, 2 memory, 1 cache, 1 I/O 7 Output ports: 4 network, 2 memory/cache, 1 I/O Router Pipeline Length = 13/14 cycles Virtual Cut-Through

Intel Slide 4 Problem: Maximize # Matches Input Port 012 Input Port 1123 Input Port 2123 Input Port 3123 Input Port 4163 Input Port 5023 Input Port 6423 Input Port Oldest Packet First: one match Smarter algorithm (shaded boxes): 7 matches (perfect) numbers in table cells: destination output port older packet at input port 3

Intel Slide 5 Simpler Algorithms Have Fewer Matches Assumes all output ports are free complexity

Intel Slide 6 Complexity may not pay off 30% input buffer occupancy

Intel Slide 7 Key Results  Arbitration Algorithms –WFA: Wave Front Arbitration Algorithm (Tamir & Chi, 1993, SGI Spider) –PIM1: Parallel Iterative Matching with one iteration (Anderson, et al., ASPLOS 1992) –SPAA: Simple, Pipelined Arbitration Algorithm (21364)  SPAA outperforms WFA & PIM1 + SPAA’s matching power similar to WFA & PIM1 (when many output ports are busy) + SPAA minimizes interactions between ports + SPAA can be pipelined more effectively  Rotary Rule + avoids network saturation under very heavy load

Intel Slide 8 Wave Front Arbiter (WFA)  Proposed by Tamir & Chi, 1993 –used in the SGI Spider/Origin switch  Implement via “connection” matrix E N S W Grant Request i,j output ports Grant = Request & N & W S = N & NOT(Grant) E = W & NOT(Grant) input port 0 input port 1 input port 2 input port 3

Intel Slide 9 WFA Advantage & Pipeline + High degree of interaction among output ports reduces arbitration collisions & improves # of matches Algorithm (implemented via a connection matrix) (1) Select packet at input port & load matrix (1.5 cycles) (2) Run through matrix and inform input ports (1.5 cycles) (3) Forward arbitration to output ports (1 cycle) (1) (2) (3) 1.5 1

Intel Slide 10 WFA Limitations - Higher number of estimated cycles  4 cycles in 0.18 micron - Harder to pipeline effectively  micropipelining waves (2) is difficult because initial cell changes every cycle  restarting (1) before (2) completes is complex  large in-flight packet table due to large number of nominations (up to 54)  may require multiple copies of matrix to buffer pipeline stages (these must avoid stale nominations) 3 cycles (1) (2) (3) (1) (2) (3)

Intel Slide 11 Parallel Iterative Matching (PIM)  Steps in One Iteration (PIM1)  Nominate: each input port nominates packets for every output port (same packet nominated multiple times …)  Grant: unmatched output port selects an input port packet randomly  Accept: unselected input port selects a grant randomly input port 0 input port 1 output port 0 output port 1 input port 0 input port 1 output port 0 output port 1 input port 0 input port 1 output port 0 output port 1 Nominate Grant Accept Output Port 0 unused in this arbitration round

Intel Slide 12 PIM1 Advantage & Pipeline + High interaction between input and output ports reduces arbitration collisions & improves # of matches Algorithm (implemented via connection matrix) (1) Select packet at input port & load matrix (1.5 cycles) (2) Run through matrix and inform input ports (1.5 cycles) (3) Forward arbitration to output ports (1 cycle) (1) (2) (3) 1.5 1

Intel Slide 13 PIM1 Limitations - Higher number of estimated cycles  4 cycles in 0.18 micron - Harder to pipeline effectively  restarting (1) before (2) completes is complex  same packet can be nominated multiple times requiring the “Accept” step (part of stage 2)  large in-flight packet table due to large number of nominations (up to 54)  may require multiple copies of matrix to buffer pipeline stages (these must avoid stale nominations) 3 cycles (1) (2) (3) (1) (2) (3)

Intel Slide 14 Simple, Pipelined Arbitration Algorithm (SPAA) used in the Alpha Router  Algorithm  Nominate: each input port nominates packets for exactly one output port (one packet nominated only once)  Grant: each output port selects an input port packet based on the least-recently selected one  Reset: input ports reset state of all unselected packets and renominate them in subsequent cycles input port 0 input port 1 output port 0 output port 1 input port 0 input port 1 output port 0 output port 1 Nominate Grant Accept Reset

Intel Slide 15 SPAA’s Simplicity  Low degree of interaction among ports - increases arbitration collisions + reduces complexity Algorithm (no centralized matrix) (1) Select packet at input port & load matrix (1 cycle) (2) Forward packets to output ports (1 cycle) (3) Output ports select packets and return feedback to input ports (1 cycle) 1 (1) (2) (3) 11

Intel Slide 16 SPAA’s Advantages + Fewer cycles  3 cycles in 0.18micron + Speculatively read out input buffer  prior to output port arbitration  because only one packet is nominated to one output port + Easier to pipeline  restart (1) for free input ports before (2) completes  only one packet nominated to one output port  small number (16) of in-flight packets  avoids any centralized matrix  speculative read allows data flits to follow header flits (1) (2) (3) 1 (1) (2) (3) 11 1 cycle

Intel Slide 17 Summary: Simpler is Better WFAPIM1SPAA Alpha # Matches Per CycleHighMediumLower # cycles (0.18 microns) 443 Restart Rate Every 3 cycles Every cycle

Intel Slide 18 Saturation Behavior Reasons: Hot spots & tree saturation 21364’s router shows cyclic pattern (link utilization with time) Ideally, operate at saturation bandwidth Solution: throttle input load saturation point

Intel Slide 19 Rotary Rule  21364’s in-built throttling + maximum outstanding cache miss requests per processor = 16  Rotary Rule: more throttling is a “direct” network + Rotary Rule prioritizes traffic in network ports over local ports + also, clears network congestion + relies on anti-starvation mechanism  WFA+Rotary: change first cell  SPAA+Rotary: change output port priority to the Rotary Rule

Intel Slide 20 Simulation Methodology  Asim  modeling infrastructure  detailed timing model of network  selected design points validated against RTL  Traffic Patterns  70% three coherence hops, 30% two coherence hops  random destinations  other traffic combinations in paper and simulated internally

Intel Slide Node Network: Base Case SPAA outperforms WFA & PIM1 24% higher throughput at knee Knee

Intel Slide Node Network: With Rotary Rule Rotary Rule helps both SPAA & WFA

Intel Slide 23 Summary & Conclusions  Arbitration Algorithms –WFA: Wave Front Arbitration Algorithm (Tamir & Chi, 1993, SGI Spider) –PIM1: Parallel Iterative Matching with one iteration (Anderson, et al., ASPLOS 1992) –SPAA: Simple, Pipelined Arbitration Algorithm (21364)  SPAA outperforms WFA & PIM1 + SPAA’s matching power similar to WFA & PIM1 (when many output ports are busy) + SPAA minimizes interactions between ports + SPAA can be pipelined more effectively  Rotary Rule + avoids network saturation under heavy load