Engineering Jon Turner Computer Science & Engineering Washington University Coarse-Grained Scheduling for Multistage Interconnects
2 Engineering Overview System level traffic regulation needed »arriving traffic unpredictable and largely uncontrolled »prevent congestion in interconnect »isolate uncongested links from effects of overloaded links Methods used for crossbars don’t directly apply »scale issues – hundreds to thousands of ports »not practical to schedule every packet transmission Alternate approach »maintain Virtual Output Queues at inputs »regulate traffic flows by controlling input sending rates »adjust sending rates periodically in response to traffic »trade-off – responsiveness vs. overhead
3 Engineering Coarse-Grained Scheduling Interconnect... DS... to output 1 to output n 11 n n 2 2 Coarse-grained nature of makes it scalable »limit status traffic to fraction of interconnect bandwidth schedulers exchange periodic status reports Virtual output queues scheduler paces queues to avoid overload
4 Engineering Batch LOOFA Algorithm Goals: avoid congestion and underflow at outputs Preference to outputs with smallest queues »for output with smallest queue, send max # of cells allowed by switch bandwidth and data in input-side VOQs »repeat for output with second smallest queue, etc. »continue until no input/output pair can transfer more cells Variants based on how inputs are selected »longest VOQ first »backlog proportional allocation Batch LOOFA is coarse-grained, but not distributed »but, can be shown to be work-conserving for speedups 2 »motivates variants that can be distributed
5 Engineering Implementing Batch LOOFA Finding maximal schedule equivalent to finding a blocking flow (flow that saturates all source-sink paths) »blocking flows can be found in O(n 2 ) time »even when we also favor low occupancy outputs Hardware implementation possible outputs inputs VOQ levels output queue levels S = 1.5 T = 8 Scheduling Problem outputs inputs Scheduling SolutionBlocking Flow Problem with Solution s a0a0 a1a1 a2a2 a3a3 b0b0 b1b1 b2b2 b3b3 t 12,12 12,6 12,12 12,11 12,12 12,7 6,6 12,6 4,4 6,3 5,5 5,0 14,6 6,6 4,4 5,2 capacity,flow
6 Engineering BLOOFA is Work-Conserving Idealized view of coarse-grained scheduling »each input receives up to T bytes during input phase »during transfer phase, each input can send up to ST bytes and each output can receive up to ST bytes »each output sends up to T bytes during output phase »scheduler is work-conserving if any time an output j sends <T bytes in an output phase, no input has data for j Define »q j = number of bytes in packets at output j »p ij = number of bytes in V ij and VOQs that precede V ij »slack ij = q j p ij For speedup 2, slack ij ≥T before an output phase »because each transfer phase increases min j slack ij by 2T
7 Engineering Minimum Slack Increases Define minSlack i =min j slack ij Outline of proof »slack ij increases by 2T during each transfer phase in which V ij is not passed »if slack ij =minSlack i + and <2T, then slack ij increases by at least 2T– during transfer phase to prove, must account for VOQs that pass V ij »so, minSlack i increases by at least 2T during each transfer »no VOQ can pass another during an input or output phase »minSlack i never decreases during a busy period at input i »minSlack ij ≥0 before each output phase (so, slack ij ≥0 also) »so, no wasted output phases
8 Engineering Distributed Batch LOOFA (DBL) Periodic status exchange »input i sends V ij (VOQ length) to output j (for all i,j) »output j sends q j and i V ij to input i (for all i,j) Input i sets upper limit on rate to j to S × V ij / i V ij »guarantees traffic to j does not congest switch Input i sets rates starting with shortest output queues first »for each output, go up to upper limit unless no remaining input-side bandwidth »can also limit by VOQ contents Only approximates centralized BLOOFA »rate assignments may correspond to non-blocking flow
9 Engineering Stress Test Inputs combine to build backlogs for 0, 2,... »creates input contention As inputs “drop out” they switch to unique output »must supply unique output, while clearing backlogs Ideal switch can forward all packets from p phase test with k steps per phase in pk steps »overshoot – (excess steps)/pk »miss rate – (missed xmit opportunities)/(# of steps) phase 0phase 1phase 2phase 3
10 Engineering Sample Stress Test Ideally, no input- side queueing for outputs 1, 3, 5. overshoot of 12.8%
11 Engineering Performance on Stress Tests BLOOFA/L favors inputs with longest VOQs, BLOOFA/P uses backlog proportional allocation among inputs Approx. Output Leveling (A-OLA) and Dist. Output Leveling Algorithms (DOLA) seek to equalize output queue levels