Download presentation
Presentation is loading. Please wait.
Published byBethany Banks Modified over 9 years ago
1
Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel Emer*, Steve Lang*, & Dave Webb $ (ack: Richard Kessler) Intel*, UPV !, & HP $ Tenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2002
2
Intel Slide 2 Alpha 21364 Network 21364 Chip (including Router) Rambus Memory I/O M IO M M M M M M M M M M M L2 Cache Data L2 Cache Data Router MC2 MC1 L2 Cache Tags 21264 CORE
3
Intel Slide 3 The Alpha 21364 8x7 Router CROSSBARCROSSBAR Input Ports Output Ports Distributed Arbitration Algorithm Controls the Crossbar 8 Input ports: 4 network, 2 memory, 1 cache, 1 I/O 7 Output ports: 4 network, 2 memory/cache, 1 I/O Router Pipeline Length = 13/14 cycles Virtual Cut-Through
4
Intel Slide 4 Problem: Maximize # Matches Input Port 012 Input Port 1123 Input Port 2123 Input Port 3123 Input Port 4163 Input Port 5023 Input Port 6423 Input Port 752 3 Oldest Packet First: one match Smarter algorithm (shaded boxes): 7 matches (perfect) numbers in table cells: destination output port older packet at input port 3
5
Intel Slide 5 Simpler Algorithms Have Fewer Matches Assumes all output ports are free complexity
6
Intel Slide 6 Complexity may not pay off complexity @ 30% input buffer occupancy
7
Intel Slide 7 Key Results Arbitration Algorithms –WFA: Wave Front Arbitration Algorithm (Tamir & Chi, 1993, SGI Spider) –PIM1: Parallel Iterative Matching with one iteration (Anderson, et al., ASPLOS 1992) –SPAA: Simple, Pipelined Arbitration Algorithm (21364) SPAA outperforms WFA & PIM1 + SPAA’s matching power similar to WFA & PIM1 (when many output ports are busy) + SPAA minimizes interactions between ports + SPAA can be pipelined more effectively Rotary Rule + avoids network saturation under very heavy load
8
Intel Slide 8 Wave Front Arbiter (WFA) Proposed by Tamir & Chi, 1993 –used in the SGI Spider/Origin switch Implement via “connection” matrix E N S W Grant Request i,j 1 2 3 4 5 6 7 output ports Grant = Request & N & W S = N & NOT(Grant) E = W & NOT(Grant) input port 0 input port 1 input port 2 input port 3
9
Intel Slide 9 WFA Advantage & Pipeline + High degree of interaction among output ports reduces arbitration collisions & improves # of matches Algorithm (implemented via a connection matrix) (1) Select packet at input port & load matrix (1.5 cycles) (2) Run through matrix and inform input ports (1.5 cycles) (3) Forward arbitration to output ports (1 cycle) (1) (2) (3) 1.5 1
10
Intel Slide 10 WFA Limitations - Higher number of estimated cycles 4 cycles in 0.18 micron - Harder to pipeline effectively micropipelining waves (2) is difficult because initial cell changes every cycle restarting (1) before (2) completes is complex large in-flight packet table due to large number of nominations (up to 54) may require multiple copies of matrix to buffer pipeline stages (these must avoid stale nominations) 3 cycles (1) (2) (3) 1.5 1 (1) (2) (3)
11
Intel Slide 11 Parallel Iterative Matching (PIM) Steps in One Iteration (PIM1) Nominate: each input port nominates packets for every output port (same packet nominated multiple times …) Grant: unmatched output port selects an input port packet randomly Accept: unselected input port selects a grant randomly input port 0 input port 1 output port 0 output port 1 input port 0 input port 1 output port 0 output port 1 input port 0 input port 1 output port 0 output port 1 Nominate Grant Accept Output Port 0 unused in this arbitration round
12
Intel Slide 12 PIM1 Advantage & Pipeline + High interaction between input and output ports reduces arbitration collisions & improves # of matches Algorithm (implemented via connection matrix) (1) Select packet at input port & load matrix (1.5 cycles) (2) Run through matrix and inform input ports (1.5 cycles) (3) Forward arbitration to output ports (1 cycle) (1) (2) (3) 1.5 1
13
Intel Slide 13 PIM1 Limitations - Higher number of estimated cycles 4 cycles in 0.18 micron - Harder to pipeline effectively restarting (1) before (2) completes is complex same packet can be nominated multiple times requiring the “Accept” step (part of stage 2) large in-flight packet table due to large number of nominations (up to 54) may require multiple copies of matrix to buffer pipeline stages (these must avoid stale nominations) 3 cycles (1) (2) (3) 1.5 1 (1) (2) (3)
14
Intel Slide 14 Simple, Pipelined Arbitration Algorithm (SPAA) used in the Alpha 21364 Router Algorithm Nominate: each input port nominates packets for exactly one output port (one packet nominated only once) Grant: each output port selects an input port packet based on the least-recently selected one Reset: input ports reset state of all unselected packets and renominate them in subsequent cycles input port 0 input port 1 output port 0 output port 1 input port 0 input port 1 output port 0 output port 1 Nominate Grant Accept Reset
15
Intel Slide 15 SPAA’s Simplicity Low degree of interaction among ports - increases arbitration collisions + reduces complexity Algorithm (no centralized matrix) (1) Select packet at input port & load matrix (1 cycle) (2) Forward packets to output ports (1 cycle) (3) Output ports select packets and return feedback to input ports (1 cycle) 1 (1) (2) (3) 11
16
Intel Slide 16 SPAA’s Advantages + Fewer cycles 3 cycles in 0.18micron + Speculatively read out input buffer prior to output port arbitration because only one packet is nominated to one output port + Easier to pipeline restart (1) for free input ports before (2) completes only one packet nominated to one output port small number (16) of in-flight packets avoids any centralized matrix speculative read allows data flits to follow header flits (1) (2) (3) 1 (1) (2) (3) 11 1 cycle
17
Intel Slide 17 Summary: Simpler is Better WFAPIM1SPAA Alpha 21364 # Matches Per CycleHighMediumLower # cycles (0.18 microns) 443 Restart Rate Every 3 cycles Every cycle
18
Intel Slide 18 Saturation Behavior Reasons: Hot spots & tree saturation 21364’s router shows cyclic pattern (link utilization with time) Ideally, operate at saturation bandwidth Solution: throttle input load saturation point
19
Intel Slide 19 Rotary Rule 21364’s in-built throttling + maximum outstanding cache miss requests per processor = 16 Rotary Rule: more throttling + 21364 is a “direct” network + Rotary Rule prioritizes traffic in network ports over local ports + also, clears network congestion + relies on anti-starvation mechanism WFA+Rotary: change first cell SPAA+Rotary: change output port priority to the Rotary Rule
20
Intel Slide 20 Simulation Methodology Asim modeling infrastructure detailed timing model of 21364 network selected design points validated against RTL Traffic Patterns 70% three coherence hops, 30% two coherence hops random destinations other traffic combinations in paper and simulated internally
21
Intel Slide 21 64 Node Network: Base Case SPAA outperforms WFA & PIM1 24% higher throughput at knee Knee
22
Intel Slide 22 64 Node Network: With Rotary Rule Rotary Rule helps both SPAA & WFA
23
Intel Slide 23 Summary & Conclusions Arbitration Algorithms –WFA: Wave Front Arbitration Algorithm (Tamir & Chi, 1993, SGI Spider) –PIM1: Parallel Iterative Matching with one iteration (Anderson, et al., ASPLOS 1992) –SPAA: Simple, Pipelined Arbitration Algorithm (21364) SPAA outperforms WFA & PIM1 + SPAA’s matching power similar to WFA & PIM1 (when many output ports are busy) + SPAA minimizes interactions between ports + SPAA can be pipelined more effectively Rotary Rule + avoids network saturation under heavy load
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.