Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards Simple, High-performance Input-Queued Switch Schedulers Devavrat Shah Stanford University Berkeley, Dec 5 Joint work with Paolo Giaccone and Balaji.

Similar presentations


Presentation on theme: "Towards Simple, High-performance Input-Queued Switch Schedulers Devavrat Shah Stanford University Berkeley, Dec 5 Joint work with Paolo Giaccone and Balaji."— Presentation transcript:

1 Towards Simple, High-performance Input-Queued Switch Schedulers Devavrat Shah Stanford University Berkeley, Dec 5 Joint work with Paolo Giaccone and Balaji Prabhakar

2 2 Outline Description of input-queued switches Scheduling –the problem –some history Simple, high-performance schedulers –Laura –Serena –Apsara Conclusions

3 3 The Input-Queued (IQ) Switch Architecture N inputs, N outputs (in fig, N = 3) Time is slotted –at most one packet can arrive per time-slot at each input Equal sized cells/packets Buffers only at inputs Use a crossbar for switching packets

4 4 Scheduling Crossbar is defined by these constraints: in each time-slot –only one packet can be transferred to each output –only one packet can be transferred from each input The scheduling problem: Subject to the above constraint, find a matching of inputs and outputs –i.e. determine which output will receive a packet from which input in each time slot

5 5 Background to switch scheduling 1.[Karol et al. 1987] Throughput is limited due to head-of-line blocking (limited to 58% for Bernoulli IID uniform traffic) 2.[Tamir 1989] Observed that with “Virtual Output Queues” (VOQs) head-of-line blocking is eliminated.

6 6 Basic Switch Model S(t) N N L NN (t) A 1N (t) A 11 (t) L 11 (t) 11 A NN (t) A N1 (t) D 1 (t) D N (t)

7 7 Some definitions 3. Queue occupancies: Occupancy L 11 (t) L NN (t)

8 8 More background on theory [Anderson et al. 1993] A schedule is equivalent to finding a matching in a bipartite graph induced by input and output nodes

9 9 Background [McKeown et al. 1995] (a) Maximum size match does not give 100% throughput. (b) But maximum weight match can, where weight can be queue-length, age of a cell 20 3 2 30 25 20 30 25 MWM

10 10 Maximum Weight Matching Maximum weight matching (MWM) –100% throughput –provable delay bounds for i.i.d. Bernoulli admissible traffic –but, finding MWM is like solving a network-flow problem whose complexity is -- complex for high-speed networks We seek to approximate maximum weight matching Our goal: –obtain a simply implementable approximation to MWM that performs competitively with MWM

11 11 Approximating MWM Two performance measures –throughput –delay We first consider simple approximations to MWM that deliver 100% throughput (i.e. stability), and then deal with delay

12 12 Methods of Approximation Randomization –well-known method for simplifying implementation Using information in packet arrivals –since queue-sizes grow due to arrivals, and arrival times are a source of randomness Hardware parallelism –yields an efficient search procedure

13 13 Randomization The main idea of randomized algorithms is –to simplify the decision-making process by basing decisions upon a small, randomly chosen sample from the state rather than upon the complete state

14 14 An Illustrative Example Find the oldest person from a population of 1 billion Deterministic algorithm: linear search –has a complexity of 1 billion A randomized version: find the oldest of 30 randomly chosen people –has a complexity of 30 (ignoring complexity of random sampling) Performance –linear search will find the absolute oldest person (rank = 1) –if R is the person found by randomized algorithm, we can make statements like P(R has rank 0.95  thus, we can say that the performance of the randomized algorithm is very good with a high probability

15 15 Randomizing Iterative Schemes Often, we want to perform some operation iteratively Example: find the oldest person each year Say in 2001 you choose 30 people at random –and store the identity of the oldest person in memory –in 2002 you choose 29 new people at random –let R be the oldest person from these 29 + 1 = 30 people P(R has rank < 100 million) or, P(R has rank < 50 million)

16 16 Back to Switch Scheduling: Randomizing MWM Choose d matchings at random and use the heaviest one as the schedule Ideally we would like to have small d. However: Theorem: Even with d = N this algorithm doesn’t yield 100% throughput!

17 17 Proof

18 18 Switch Size : 32 X 32 Input Traffic (shown for a 4 X 4 switch) –Bernoulli i.i.d. inputs –diagonal load matrix: normalized load=x+y<1 x=2y Simulation Scenario

19 19

20 20 Crucial Observation The state of the switch changes due to arrivals & departures Between consecutive time slots, a queue’s length can change at most by 1 –hence a heavy matching tends to stay heavy Therefore –‘’remembering’’ a heavy matching should help in improving the performance

21 21 Tassiulas’ Algorithm [Tassiulas 1998] proposed the following algorithm based on this observation: –let S(t-1) be the matching used at time t-1 –let R(t) be a matching chosen uniformly at random –and let S(t) be the heavier of R(t) and S(t-1) This gives 100% throughput !  note the boost in throughput is due to the use of memory But, delays are very large

22 22

23 23 Derandomization Let G be a fully-connected graph where each node is one of the N! possible schedules Construct a Hamiltonian walk, H(t), on G –H(t) cycles through the nodes of G At any time t –let R(t) = H(t mod N!) –and let S(t) be the heavier of R(t) and S(t-1)  this also has 100% throughput, but delays are large (derandomization will be useful later)

24 24 Stability Lemma: Consider IQ switch with Bernoulli i.i.d. inputs. Let B be a matching algorithm which ensures W B (t) >= W*(t) – c for every t. Then B is stable. Theorem: W DER (t) >= W*(t) – 2N.N! Therefore, it is stable.

25 25 Delay These simple approximations of MWM yield 100% throughput, but delays are large To obtain good delays we’ll present three different algorithms which use the following features: –selective remembrance -- Laura –information in the arrivals -- Serena –hardware parallelism -- Apsara

26 26 Laura Tassiulas COMP = Maximum R(t) – uniform sample Next time COMP S(t-1) S(t) R(t) Laura COMP = Merge, picks the best edges of two matchings R(t) – non-uniform sample

27 27 10 70 60 50 40 30 10 20 Merging S(t-1) R 10 – 40+10 - 30+10-50= - 90 70-10+60-20=100 W(S(t-1))=160W(R)=150 S(t) W(S(t)) = 250 Merging Procedure

28 28 Throughput Theorem: –LAURA is stable under any admissible Bernoulli i.i.d. input traffic.

29 29 Average Backlog via Simulation Switch size: N = 32 Length of VOQ: Q MAX = 10000 Comparison with –iSLIP, iLQF, MUCS, RPA and MWM

30 30 Simulation Traffic Matrices –uniform diagonal sparse –logdiagonal

31 31 Laura: Diagonal traffic

32 32 Laura: Sparse traffic

33 33 Since an increase in queue sizes is due to arrivals And arrivals are a source of randomness  Use arrivals to generate random matching SERENASerena

34 34 Serena Next time Merge S(t-1) S(t) R(t) = matching generated using arrivals

35 35 23 7 89 3 2 5 Arr-R 47 11 31 97 S(t-1) Merging Procedure 89 3 5 23 W(S(t-1))=209 1 W(R)=121 R Merging S(t) W(S(t))=243 89 3 23 31 97

36 36 Throughput Theorem: –SERENA achieves 100% throughput under any admissible i.i.d. Bernoulli traffic pattern

37 37 Serena: Diagonal traffic

38 38 Apsara One way to obtain MWM is to search the space of all N! matchings A natural approximation: If S(t-1) is the current matching, then S(t) is the heaviest matching in a “neighborhood” of S(t-1) It turns out that there is a convenient way of defining neighbors (both for theory and for practice)

39 39 Neighbors Neighbors differ from S(t) in ONLY TWO edges (for all values of N) Neighbors Example: 3 x 3 switch S(t)

40 40 Apsara Next time MAX S(t-1) S(t) Neighbors generated in parallel N1N2Nk H(t) Hamiltonian Walk

41 41 Apsara: Throughput Theorem: Apsara is stable under any admissible i.i.d. Bernoulli traffic. (stability due to Hamiltonian matching) Also, note that W(S(t)) >= W(S(t-1),t) Theorem: If W(S(t)) = W(S(t-1),t) then W(S(t)) >= 0.5 W *(t) (this is not enough to ensure stability)

42 42 Apsara: Diagonal traffic

43 43 Limited Parallelism The Apsara algorithm searches over neighbors in parallel If space is limited to modules, then search over randomly chosen subset of size K from all neighbors And there are other (good) deterministic ways of searching a smaller neighborhood of matchings

44 44 Apsara: Limited parallelism

45 45 Diagonal traffic

46 46 Conclusions We have presented novel scheduling algorithms for input-queued switches –Laura –Serena –Apsara They are simple to implement and perform competitively with respect to the Maximum Weight Matching algorithm

47 47 References 1.L. Tassiulas, “Linear complexity algorithms for maximum throughput in radio networks and input-queued switches,” Proc. INFOCOM 1998. 2.D. Shah, P. Giaccone and B. Prabhakar, “An efficient randomized algorithm for input-queued switch scheduling,” Proc. of Hot Interconnects, 2001. 3.P. Giaccone, D. Shah and B. Prabhakar,” An Implementable Parallel Scheduler for Input-Queued Switches”, Proc. of Hot Interconnects, 2001. 4.P. Giaccone, B. Prabhakar and D. Shah, “Towards simple and efficient scheduler for high-aggregate IQ switches”, Submitted INFOCOM’02. 5.R. Motwani and P. Raghavan, Randomized Algorithms, Cambridge University Press, 1995.

48 48 Uniform traffic

49 49 LogDiagonal traffic


Download ppt "Towards Simple, High-performance Input-Queued Switch Schedulers Devavrat Shah Stanford University Berkeley, Dec 5 Joint work with Paolo Giaccone and Balaji."

Similar presentations


Ads by Google