Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.

Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel Emer*, Steve Lang*, & Dave Webb $ (ack: Richard Kessler) Intel*, UPV !, & HP $ Tenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2002

Intel Slide 2 Alpha 21364 Network 21364 Chip (including Router) Rambus Memory I/O M IO M M M M M M M M M M M L2 Cache Data L2 Cache Data Router MC2 MC1 L2 Cache Tags 21264 CORE

Intel Slide 3 The Alpha 21364 8x7 Router CROSSBARCROSSBAR Input Ports Output Ports Distributed Arbitration Algorithm Controls the Crossbar 8 Input ports: 4 network, 2 memory, 1 cache, 1 I/O 7 Output ports: 4 network, 2 memory/cache, 1 I/O Router Pipeline Length = 13/14 cycles Virtual Cut-Through

Intel Slide 4 Problem: Maximize # Matches Input Port 012 Input Port 1123 Input Port 2123 Input Port 3123 Input Port 4163 Input Port 5023 Input Port 6423 Input Port 752 3 Oldest Packet First: one match Smarter algorithm (shaded boxes): 7 matches (perfect) numbers in table cells: destination output port older packet at input port 3

Intel Slide 5 Simpler Algorithms Have Fewer Matches Assumes all output ports are free complexity

Intel Slide 6 Complexity may not pay off complexity @ 30% input buffer occupancy

Intel Slide 7 Key Results  Arbitration Algorithms –WFA: Wave Front Arbitration Algorithm (Tamir & Chi, 1993, SGI Spider) –PIM1: Parallel Iterative Matching with one iteration (Anderson, et al., ASPLOS 1992) –SPAA: Simple, Pipelined Arbitration Algorithm (21364)  SPAA outperforms WFA & PIM1 + SPAA’s matching power similar to WFA & PIM1 (when many output ports are busy) + SPAA minimizes interactions between ports + SPAA can be pipelined more effectively  Rotary Rule + avoids network saturation under very heavy load

Intel Slide 8 Wave Front Arbiter (WFA)  Proposed by Tamir & Chi, 1993 –used in the SGI Spider/Origin switch  Implement via “connection” matrix E N S W Grant Request i,j 1 2 3 4 5 6 7 output ports Grant = Request & N & W S = N & NOT(Grant) E = W & NOT(Grant) input port 0 input port 1 input port 2 input port 3

Intel Slide 9 WFA Advantage & Pipeline + High degree of interaction among output ports reduces arbitration collisions & improves # of matches Algorithm (implemented via a connection matrix) (1) Select packet at input port & load matrix (1.5 cycles) (2) Run through matrix and inform input ports (1.5 cycles) (3) Forward arbitration to output ports (1 cycle) (1) (2) (3) 1.5 1

Intel Slide 10 WFA Limitations - Higher number of estimated cycles  4 cycles in 0.18 micron - Harder to pipeline effectively  micropipelining waves (2) is difficult because initial cell changes every cycle  restarting (1) before (2) completes is complex  large in-flight packet table due to large number of nominations (up to 54)  may require multiple copies of matrix to buffer pipeline stages (these must avoid stale nominations) 3 cycles (1) (2) (3) 1.5 1 (1) (2) (3)

Intel Slide 11 Parallel Iterative Matching (PIM)  Steps in One Iteration (PIM1)  Nominate: each input port nominates packets for every output port (same packet nominated multiple times …)  Grant: unmatched output port selects an input port packet randomly  Accept: unselected input port selects a grant randomly input port 0 input port 1 output port 0 output port 1 input port 0 input port 1 output port 0 output port 1 input port 0 input port 1 output port 0 output port 1 Nominate Grant Accept Output Port 0 unused in this arbitration round

Intel Slide 12 PIM1 Advantage & Pipeline + High interaction between input and output ports reduces arbitration collisions & improves # of matches Algorithm (implemented via connection matrix) (1) Select packet at input port & load matrix (1.5 cycles) (2) Run through matrix and inform input ports (1.5 cycles) (3) Forward arbitration to output ports (1 cycle) (1) (2) (3) 1.5 1

Intel Slide 13 PIM1 Limitations - Higher number of estimated cycles  4 cycles in 0.18 micron - Harder to pipeline effectively  restarting (1) before (2) completes is complex  same packet can be nominated multiple times requiring the “Accept” step (part of stage 2)  large in-flight packet table due to large number of nominations (up to 54)  may require multiple copies of matrix to buffer pipeline stages (these must avoid stale nominations) 3 cycles (1) (2) (3) 1.5 1 (1) (2) (3)

Intel Slide 14 Simple, Pipelined Arbitration Algorithm (SPAA) used in the Alpha 21364 Router  Algorithm  Nominate: each input port nominates packets for exactly one output port (one packet nominated only once)  Grant: each output port selects an input port packet based on the least-recently selected one  Reset: input ports reset state of all unselected packets and renominate them in subsequent cycles input port 0 input port 1 output port 0 output port 1 input port 0 input port 1 output port 0 output port 1 Nominate Grant Accept Reset

Intel Slide 15 SPAA’s Simplicity  Low degree of interaction among ports - increases arbitration collisions + reduces complexity Algorithm (no centralized matrix) (1) Select packet at input port & load matrix (1 cycle) (2) Forward packets to output ports (1 cycle) (3) Output ports select packets and return feedback to input ports (1 cycle) 1 (1) (2) (3) 11

Intel Slide 16 SPAA’s Advantages + Fewer cycles  3 cycles in 0.18micron + Speculatively read out input buffer  prior to output port arbitration  because only one packet is nominated to one output port + Easier to pipeline  restart (1) for free input ports before (2) completes  only one packet nominated to one output port  small number (16) of in-flight packets  avoids any centralized matrix  speculative read allows data flits to follow header flits (1) (2) (3) 1 (1) (2) (3) 11 1 cycle

Intel Slide 17 Summary: Simpler is Better WFAPIM1SPAA Alpha 21364 # Matches Per CycleHighMediumLower # cycles (0.18 microns) 443 Restart Rate Every 3 cycles Every cycle

Intel Slide 18 Saturation Behavior Reasons: Hot spots & tree saturation 21364’s router shows cyclic pattern (link utilization with time) Ideally, operate at saturation bandwidth Solution: throttle input load saturation point

Intel Slide 19 Rotary Rule  21364’s in-built throttling + maximum outstanding cache miss requests per processor = 16  Rotary Rule: more throttling + 21364 is a “direct” network + Rotary Rule prioritizes traffic in network ports over local ports + also, clears network congestion + relies on anti-starvation mechanism  WFA+Rotary: change first cell  SPAA+Rotary: change output port priority to the Rotary Rule

Intel Slide 20 Simulation Methodology  Asim  modeling infrastructure  detailed timing model of 21364 network  selected design points validated against RTL  Traffic Patterns  70% three coherence hops, 30% two coherence hops  random destinations  other traffic combinations in paper and simulated internally

Intel Slide 21 64 Node Network: Base Case SPAA outperforms WFA & PIM1 24% higher throughput at knee Knee

Intel Slide 22 64 Node Network: With Rotary Rule Rotary Rule helps both SPAA & WFA

Intel Slide 23 Summary & Conclusions  Arbitration Algorithms –WFA: Wave Front Arbitration Algorithm (Tamir & Chi, 1993, SGI Spider) –PIM1: Parallel Iterative Matching with one iteration (Anderson, et al., ASPLOS 1992) –SPAA: Simple, Pipelined Arbitration Algorithm (21364)  SPAA outperforms WFA & PIM1 + SPAA’s matching power similar to WFA & PIM1 (when many output ports are busy) + SPAA minimizes interactions between ports + SPAA can be pipelined more effectively  Rotary Rule + avoids network saturation under heavy load

Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.

Similar presentations

Presentation on theme: "Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.

Similar presentations

Presentation on theme: "Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel."— Presentation transcript:

Similar presentations

About project

Feedback