Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel Emer*, Steve Lang*, & Dave Webb $ (ack: Richard Kessler) Intel*, UPV !, & HP $ Tenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2002
Intel Slide 2 Alpha Network Chip (including Router) Rambus Memory I/O M IO M M M M M M M M M M M L2 Cache Data L2 Cache Data Router MC2 MC1 L2 Cache Tags CORE
Intel Slide 3 The Alpha x7 Router CROSSBARCROSSBAR Input Ports Output Ports Distributed Arbitration Algorithm Controls the Crossbar 8 Input ports: 4 network, 2 memory, 1 cache, 1 I/O 7 Output ports: 4 network, 2 memory/cache, 1 I/O Router Pipeline Length = 13/14 cycles Virtual Cut-Through
Intel Slide 4 Problem: Maximize # Matches Input Port 012 Input Port 1123 Input Port 2123 Input Port 3123 Input Port 4163 Input Port 5023 Input Port 6423 Input Port Oldest Packet First: one match Smarter algorithm (shaded boxes): 7 matches (perfect) numbers in table cells: destination output port older packet at input port 3
Intel Slide 5 Simpler Algorithms Have Fewer Matches Assumes all output ports are free complexity
Intel Slide 6 Complexity may not pay off 30% input buffer occupancy
Intel Slide 7 Key Results Arbitration Algorithms –WFA: Wave Front Arbitration Algorithm (Tamir & Chi, 1993, SGI Spider) –PIM1: Parallel Iterative Matching with one iteration (Anderson, et al., ASPLOS 1992) –SPAA: Simple, Pipelined Arbitration Algorithm (21364) SPAA outperforms WFA & PIM1 + SPAA’s matching power similar to WFA & PIM1 (when many output ports are busy) + SPAA minimizes interactions between ports + SPAA can be pipelined more effectively Rotary Rule + avoids network saturation under very heavy load
Intel Slide 8 Wave Front Arbiter (WFA) Proposed by Tamir & Chi, 1993 –used in the SGI Spider/Origin switch Implement via “connection” matrix E N S W Grant Request i,j output ports Grant = Request & N & W S = N & NOT(Grant) E = W & NOT(Grant) input port 0 input port 1 input port 2 input port 3
Intel Slide 9 WFA Advantage & Pipeline + High degree of interaction among output ports reduces arbitration collisions & improves # of matches Algorithm (implemented via a connection matrix) (1) Select packet at input port & load matrix (1.5 cycles) (2) Run through matrix and inform input ports (1.5 cycles) (3) Forward arbitration to output ports (1 cycle) (1) (2) (3) 1.5 1
Intel Slide 10 WFA Limitations - Higher number of estimated cycles 4 cycles in 0.18 micron - Harder to pipeline effectively micropipelining waves (2) is difficult because initial cell changes every cycle restarting (1) before (2) completes is complex large in-flight packet table due to large number of nominations (up to 54) may require multiple copies of matrix to buffer pipeline stages (these must avoid stale nominations) 3 cycles (1) (2) (3) (1) (2) (3)
Intel Slide 11 Parallel Iterative Matching (PIM) Steps in One Iteration (PIM1) Nominate: each input port nominates packets for every output port (same packet nominated multiple times …) Grant: unmatched output port selects an input port packet randomly Accept: unselected input port selects a grant randomly input port 0 input port 1 output port 0 output port 1 input port 0 input port 1 output port 0 output port 1 input port 0 input port 1 output port 0 output port 1 Nominate Grant Accept Output Port 0 unused in this arbitration round
Intel Slide 12 PIM1 Advantage & Pipeline + High interaction between input and output ports reduces arbitration collisions & improves # of matches Algorithm (implemented via connection matrix) (1) Select packet at input port & load matrix (1.5 cycles) (2) Run through matrix and inform input ports (1.5 cycles) (3) Forward arbitration to output ports (1 cycle) (1) (2) (3) 1.5 1
Intel Slide 13 PIM1 Limitations - Higher number of estimated cycles 4 cycles in 0.18 micron - Harder to pipeline effectively restarting (1) before (2) completes is complex same packet can be nominated multiple times requiring the “Accept” step (part of stage 2) large in-flight packet table due to large number of nominations (up to 54) may require multiple copies of matrix to buffer pipeline stages (these must avoid stale nominations) 3 cycles (1) (2) (3) (1) (2) (3)
Intel Slide 14 Simple, Pipelined Arbitration Algorithm (SPAA) used in the Alpha Router Algorithm Nominate: each input port nominates packets for exactly one output port (one packet nominated only once) Grant: each output port selects an input port packet based on the least-recently selected one Reset: input ports reset state of all unselected packets and renominate them in subsequent cycles input port 0 input port 1 output port 0 output port 1 input port 0 input port 1 output port 0 output port 1 Nominate Grant Accept Reset
Intel Slide 15 SPAA’s Simplicity Low degree of interaction among ports - increases arbitration collisions + reduces complexity Algorithm (no centralized matrix) (1) Select packet at input port & load matrix (1 cycle) (2) Forward packets to output ports (1 cycle) (3) Output ports select packets and return feedback to input ports (1 cycle) 1 (1) (2) (3) 11
Intel Slide 16 SPAA’s Advantages + Fewer cycles 3 cycles in 0.18micron + Speculatively read out input buffer prior to output port arbitration because only one packet is nominated to one output port + Easier to pipeline restart (1) for free input ports before (2) completes only one packet nominated to one output port small number (16) of in-flight packets avoids any centralized matrix speculative read allows data flits to follow header flits (1) (2) (3) 1 (1) (2) (3) 11 1 cycle
Intel Slide 17 Summary: Simpler is Better WFAPIM1SPAA Alpha # Matches Per CycleHighMediumLower # cycles (0.18 microns) 443 Restart Rate Every 3 cycles Every cycle
Intel Slide 18 Saturation Behavior Reasons: Hot spots & tree saturation 21364’s router shows cyclic pattern (link utilization with time) Ideally, operate at saturation bandwidth Solution: throttle input load saturation point
Intel Slide 19 Rotary Rule 21364’s in-built throttling + maximum outstanding cache miss requests per processor = 16 Rotary Rule: more throttling is a “direct” network + Rotary Rule prioritizes traffic in network ports over local ports + also, clears network congestion + relies on anti-starvation mechanism WFA+Rotary: change first cell SPAA+Rotary: change output port priority to the Rotary Rule
Intel Slide 20 Simulation Methodology Asim modeling infrastructure detailed timing model of network selected design points validated against RTL Traffic Patterns 70% three coherence hops, 30% two coherence hops random destinations other traffic combinations in paper and simulated internally
Intel Slide Node Network: Base Case SPAA outperforms WFA & PIM1 24% higher throughput at knee Knee
Intel Slide Node Network: With Rotary Rule Rotary Rule helps both SPAA & WFA
Intel Slide 23 Summary & Conclusions Arbitration Algorithms –WFA: Wave Front Arbitration Algorithm (Tamir & Chi, 1993, SGI Spider) –PIM1: Parallel Iterative Matching with one iteration (Anderson, et al., ASPLOS 1992) –SPAA: Simple, Pipelined Arbitration Algorithm (21364) SPAA outperforms WFA & PIM1 + SPAA’s matching power similar to WFA & PIM1 (when many output ports are busy) + SPAA minimizes interactions between ports + SPAA can be pipelined more effectively Rotary Rule + avoids network saturation under heavy load