CIST560 by M. Hamdi 1 Packet Scheduling/Arbitration in Virtual Output Queues: Maximal Matching Algorithms (Part II)

Slides:



Advertisements
Similar presentations
EE384y: Packet Switch Architectures
Advertisements

1 Maintaining Packet Order in Two-Stage Switches Isaac Keslassy, Nick McKeown Stanford University.
1 Scheduling Crossbar Switches Who do we chose to traverse the switch in the next time slot? N N 11.
1 Outline  Why Maximal and not Maximum  Definition and properties of Maximal Match  Parallel Iterative Matching (PIM)  iSLIP  Wavefront Arbiter (WFA)
Belief-Propagation Assisted Scheduling in Input-Queued Switches S. Atalla 1, D. Cuda 2, P. Giaccone 1, M. Pretti 2 1 Politecnico di Torino 2 Italian National.
High-Performance Networking Group Isaac Keslassy, Nick McKeown
Router Architecture : Building high-performance routers Ian Pratt
Submitters: Erez Rokah Erez Goldshide Supervisor: Yossi Kanizo.
Nick McKeown CS244 Lecture 6 Packet Switches. What you said The very premise of the paper was a bit of an eye- opener for me, for previously I had never.
Towards Simple, High-performance Input-Queued Switch Schedulers Devavrat Shah Stanford University Berkeley, Dec 5 Joint work with Paolo Giaccone and Balaji.
Algorithm Orals Algorithm Qualifying Examination Orals Achieving 100% Throughput in IQ/CIOQ Switches using Maximum Size and Maximal Matching Algorithms.
Making Parallel Packet Switches Practical Sundar Iyer, Nick McKeown Departments of Electrical Engineering & Computer Science,
1 Input Queued Switches: Cell Switching vs. Packet Switching Abtin Keshavarzian Joint work with Yashar Ganjali, Devavrat Shah Stanford University.
Input Queue Switch Technologies Speaker : Kuo-Cheng Lu N300/CCL/ITRI.
1 Comnet 2006 Communication Networks Recitation 5 Input Queuing Scheduling & Combined Switches.
1 ENTS689L: Packet Processing and Switching Buffer-less Switch Fabric Architectures Buffer-less Switch Fabric Architectures Vahid Tabatabaee Fall 2006.
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion MSM.
CSIT560 by M. Hamdi 1 Course Exam: Review April 18/19 (in-Class)
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion The.
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion Scaling.
6/22/20151 CLOS-NETWORK SWITCHES. H. Jonathan Chao 6/22/2015 Page 2 A Growable Switch Configuration i j.
Scheduling Proposals Scheduling Group Giulio Galante, Wensheng Hua, Sundar Iyer, Isaac Keslassy, Pablo Molinero, Gireesh Shrimali, Rui Zhang.
1 Internet Routers Stochastics Network Seminar February 22 nd 2002 Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University.
Lecture 11. Matching A set of edges which do not share a vertex is a matching. Application: Wireless Networks may consist of nodes with single radios,
1 EE384Y: Packet Switch Architectures Part II Load-balanced Switches Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University.
CSIT560 By M. Hamdi 1 Packet Scheduling/Arbitration in Virtual Output Queues and Others.
Maximum Size Matchings & Input Queued Switches Sundar Iyer, Nick McKeown High Performance Networking Group, Stanford University,
COMP680E by M. Hamdi 1 Course Exam: Review April 17 (in-Class)
1 Achieving 100% throughput Where we are in the course… 1. Switch model 2. Uniform traffic  Technique: Uniform schedule (easy) 3. Non-uniform traffic,
CSIT560 by M. Hamdi 1 Packet Scheduling/Arbitration in Virtual Output Queues and Others.
1 Netcomm 2005 Communication Networks Recitation 5.
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion Maximal.
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion Scheduling.
Distributed Scheduling Algorithms for Switching Systems Shunyuan Ye, Yanming Shen, Shivendra Panwar
1 Scheduling Crossbar Switches Who do we chose to traverse the switch in the next time slot? N N 11.
Pipelined Two Step Iterative Matching Algorithms for CIOQ Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York, Stony Brook.
Localized Asynchronous Packet Scheduling for Buffered Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York Stony Brook.
Buffer Management for Shared- Memory ATM Switches Written By: Mutlu Apraci John A.Copelan Georgia Institute of Technology Presented By: Yan Huang.
Load Balanced Birkhoff-von Neumann Switches
Nick McKeown CS244 Lecture 7 Valiant Load Balancing.
CS 552 Computer Networks IP forwarding Fall 2005 Rich Martin (Slides from D. Culler and N. McKeown)
ATM SWITCHING. SWITCHING A Switch is a network element that transfer packet from Input port to output port. A Switch is a network element that transfer.
High Speed Stable Packet Switches Shivendra S. Panwar Joint work with: Yihan Li, Yanming Shen and H. Jonathan Chao New York State Center for Advanced Technology.
Enabling Class of Service for CIOQ Switches with Maximal Weighted Algorithms Thursday, October 08, 2015 Feng Wang Siu Hong Yuen.
Summary of switching theory Balaji Prabhakar Stanford University.
The Router SC 504 Project Gardar Hauksson Allen Liu.
ISLIP Switch Scheduler Ali Mohammad Zareh Bidoki April 2002.
1 Performance Guarantees for Internet Routers ISL Affiliates Meeting April 4 th 2002 Nick McKeown Professor of Electrical Engineering and Computer Science,
Crossbar Switch Project
Stress Resistant Scheduling Algorithms for CIOQ Switches Prashanth Pappu Applied Research Laboratory Washington University in St Louis “Stress Resistant.
Belgrade University Aleksandra Smiljanić: High-Capacity Switching Switches with Input Buffers (Cisco)
CSIT560 By M. Hamdi 1 Packet Arbitration in VoQ switches and Others and QoS.
Buffered Crossbars With Performance Guarantees Shang-Tse (Da) Chuang Cisco Systems EE384Y Thursday, April 27, 2006.
SNRC Meeting June 7 th, Crossbar Switch Scheduling Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University
2/14/2016  A. Orda, A. Segall, 1 Queueing Networks M nodes external arrival rate (Poisson) service rate in each node (exponential) upon service completion.
Improving Matching algorithms for IQ switches Abhishek Das John J Kim.
Topics in Internet Research: Project Scope Mehreen Alam
Input buffered switches (1)
1 Chapter 7 Network Flow Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
scheduling for local-area networks”
Balaji Prabhakar Departments of EE and CS Stanford University
Packet Forwarding.
Chapter 7 Network Flow Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.
Packet Scheduling/Arbitration in Virtual Output Queues and Others
Outline Why Maximal and not Maximum
Stability Analysis of MNCM Class of Algorithms and two more problems !
Balaji Prabhakar Departments of EE and CS Stanford University
Scheduling Crossbar Switches
EE384Y: Packet Switch Architectures II
Presentation transcript:

CIST560 by M. Hamdi 1 Packet Scheduling/Arbitration in Virtual Output Queues: Maximal Matching Algorithms (Part II)

CIST560 by M. Hamdi 2 Pointer Desynchronization Performance: RRM < iSlip < FIRM Difference only in updating pointers Observation: iSlip and FIRM can effectively desynchronize their output pointers The best effect of pointer desynchronization is achieved if forced

CIST560 by M. Hamdi 3 Static Round Robin Matching (SRR): To Achieve FULL Desynchronization Initialization. The input pointers are set to 0's. The output pointers are set to some initial pattern such that there is no duplication among the pointers. The 3 steps of one iteration are: –Request. Each input sends a request to every output for which it has a queued cell. –Grant. If an output receives any requests, it chooses the one that appears next in a fixed, round-robin schedule starting from the highest priority element. The output notifies each input whether or not its request was granted. The pointer to the highest priority element of the round-robin schedule is always incremented by one (modulo N) whether there is a grant or not.

CIST560 by M. Hamdi 4 SRR (Cont’d) –Accept. If an input receives a grant, it accepts the one that appears next in a fixed round-robin schedule starting from the highest priority element. The pointer to the highest priority element of the round-robin schedule is incremented (modulo N) to one location beyond the accepted one. In DSRR (Improved version of SRR), input pointers are also desynchronized. Rotating DSRR (RDSRR): –Unfairness among inputs under special traffic model. –Outputs searching in clockwise and anti-clockwise directions alternatively to decide grants.

CIST560 by M. Hamdi 5 Simulation Results

CIST560 by M. Hamdi 6 Simulation Results

CIST560 by M. Hamdi 7 Simulation Results

CIST560 by M. Hamdi 8 Simulation Results

CIST560 by M. Hamdi 9 Stability Property A VOQ switch is considered stable if it approaches a steady state where the expected length of each VOQ is bounded. If it is stable, 100% throughput can be achieved under any admissible traffic pattern. RDSRR is more stable than iSlip and FIRM under various traffic patterns.

CIST560 by M. Hamdi 10 Stability Property (Cont’d)

CIST560 by M. Hamdi 11 3-Phase & 2-Phase Algorithms iSlip & FIRM are 3-phase algorithms: Request- Grant-Accept DRRM is 2-phase algorithm: Grant-Accept –Each input sends one grant –Each output sends one accept 2-FIRM is the 2-phase version of FIRM

CIST560 by M. Hamdi 12 DRRM (Dual Round Robin Matching)

CIST560 by M. Hamdi 13 3-Phase & 2-Phase Algorithms

CIST560 by M. Hamdi 14 3-Phase & 2-Phase Algorithms

CIST560 by M. Hamdi 15 3-Phase & 2-Phase Algorithms In general case, the traffic model changes from time to time When the temporary non-uniformity is on the input side, 3-phase scheme performs better When the temporary non-uniformity is on the output side, 2-phase scheme performs better

CIST560 by M. Hamdi 16 2-stage Maximum Size Matching Algorithm: Description The 2-stage algorithm works in the following way: 1. The pointers at both input and output sides are kept fully desynchronized. 2. In each iteration, there are 3 steps: Step 1: Each input sends a request to every output for which it has a queued cell. Step 2: Each input selects one VOQ to send grant that appears next starting from its highest priority output. Each output selects one request received in step 1 to send grant that appears next starting from its highest priority input. OutputCount = number of outputs receiving grants from inputs. InputCount = number of inputs receiving grants from outputs.

CIST560 by M. Hamdi 17 2-stage Maximum Size Matching Algorithm: Description Step 3: If OutputCount ? InputCount, each output selects one among the grants received in step 2 which appears next starting from its highest priority input and sends accept. Else, each input selects one among the grants received in step 2 which appears next starting from its highest priority output and sends accept. In simple words, this algorithm will decide in each time slot whether to use 2-phase or 3-phase scheme based on which one can make more matches.

CIST560 by M. Hamdi 18 2-stage Maximum Size Matching Algorithm: Hardware Implementation

CIST560 by M. Hamdi 19 Performance Evaluation: Simulation Study Uniform Traffic

CIST560 by M. Hamdi 20 Performance Evaluation: Simulation Study Load Improvement Percentage 67%196%81%58%60%84%43% Normalized Improvement Percentage 40%66%45%37% 46%30% Improvement Factor Improvement Percentage 7%75%92%54%59%83%43% Normalized Improvement Percentage 7%43%48%35%37%45%30% Improvement Factor stage over iSlip SRR over iSlip

CIST560 by M. Hamdi 21 Performance Evaluation: Simulation Study Bursty Traffic

CIST560 by M. Hamdi 22 Load Improvement Percentage 213%96%70%46%28%16% Normalized Improvement Percentage 68%49%41%31%22%14% Improvement Factor Improvement Percentage 89%56%46%33%22%14% Normalized Improvement Percentage 47%36%32%25%18%12% Improvement Factor Performance Evaluation: Simulation Study 2-stage over iSlip SRR over iSlip

CIST560 by M. Hamdi 23 Performance Evaluation: Simulation Study Hotspot Traffic

CIST560 by M. Hamdi 24 Load Improvement Percentage 26%56%101626%160469%81633% Normalized Improvement Percentage 21%36%100% Improvement Factor Improvement Percentage 5%9%56177%74631%19618% Normalized Improvement Percentage 5%8%99%100%99% Improvement Factor Performance Evaluation: Simulation Study 2-stage over iSlip SRR over iSlip

CIST560 by M. Hamdi 25 Performance Evaluation: Simulation Study Unbalanced Traffic

CIST560 by M. Hamdi 26 Performance Evaluation: Simulation Study Load Improvement Percentage 12%39%53%142%552%8040%3351% Normalized Improvement Percentage 11%28%35%59%85%99%97% Improvement Factor Improvement Percentage 4%35%74%225%843%11494%3499% Normalized Improvement Percentage 4%26%43%69%89%99%97% Improvement Factor stage over iSlip SRR over iSlip

CIST560 by M. Hamdi 27 A new algorithm – RDESRR Real Desynchronized Round Robin Model (RDESRR) Based on 2 phases RRM model (Request and Grant) Add a small share memory that each outputs can read/write (called Share Bits) The size of the memory is 1 bit per input If the bit is set, the corresponding input has already granted by an output If the bit is not set, the output may grant to corresponding input port

CIST560 by M. Hamdi 28 RDESRR Conceptual model Share Bits

CIST560 by M. Hamdi 29 RDESRR model 2 phases only Request. Each input sends a request to every output for which it has a queued cell. Grant. If an output receives any requests, it chooses the one that appears next in a fixed, round-robin schedule starting from the highest priority element. The output check the corresponding bit is set or not, if not set, the output will set the bit and notifies the input its request was granted. Otherwise, the output will look for next request until all requests has gone through. The pointer g i to the highest priority element of the round-robin schedule is incremented (modulo N) to one location beyond the granted input. If no request is received, the pointer stays unchanged.

CIST560 by M. Hamdi 30 RDESRR Demo - Request Step 1: Request

CIST560 by M. Hamdi 31 RDESRR Demo – Add a share memory in Output Step 2: Grant Share Bits Add a small share memory that each outputs can read/write (called Share Bits)

CIST560 by M. Hamdi RDESRR Demo – Output check the share bits Step 2: Grant Share Bits The output check the corresponding bit is set or not

CIST560 by M. Hamdi 33 RDESRR Demo – When share bit is occupied Step 2: Grant    Share Bits if not set, the output will set the bit and notifies the input its request was granted The share bit is First Come First Serve

CIST560 by M. Hamdi 34 RDESRR Demo – Output looks for next request Step 2: Grant    Share Bits If set, the output will look for next request until all requests have gone through

CIST560 by M. Hamdi 35 RDESRR Demo – All share bits are allocated Step 2: Grant     Share Bits Fully allocate the share bit will result for fully grant all input request

CIST560 by M. Hamdi RDESRR Demo – Pointer update/Share bit reset Share Bits The pointer gi to the highest priority element of the round-robin schedule is incremented (modulo N) to one location beyond the granted input If no request is received, the pointer stays unchanged Share bits are also reset

CIST560 by M. Hamdi 37 SIM Results Run the test for 32x32 port in SIM using – l

CIST560 by M. Hamdi 38 Input Queueing Longest Queue First or Oldest Cell First M ax i m u m w e i g h t Weight Waiting Time 100% Queue Length { } =

CIST560 by M. Hamdi 39 Input Queueing Why is serving long/old queues better than serving maximum number of queues? When traffic is uniformly distributed, servicing the maximum number of queues leads to 100% throughput. When traffic is non-uniform, some queues become longer than others. A good algorithm keeps the queue lengths matched, and services a large number of queues. VOQ # Avg Occupancy Uniform traffic VOQ # Avg Occupancy Non-uniform traffic

CIST560 by M. Hamdi 40 Maximum/Maximal Weight Matching 100% throughput for admissible traffic (uniform or non- uniform) Maximum Weight Matching –OCF (Oldest Cell First): w=cell waiting time –LQF (Longest Queue First):w=input queue occupancy –LPF (Longest Port First):w=QL of the source port + Sum of QL form the source port to the destination port Maximal Weight Matching (practical algorithms) –iOCF –iLQF –iLPF (comparators in the critical path of iLQF are removed )

CIST560 by M. Hamdi 41 Maximal Weight Matching Algorithms: iLQF Request. Each unmatched input sends a request word of width bits to each output for which it has a queued cell, indicating the number of cells that it has queued to that output. Grant. If an unmatched output receives any requests, it chooses the largest valued request. Ties are broken randomly. Accept. If an unmatched input receives one or more grants, it accepts the one to which it made the largest valued request. Ties are broken randomly.

CIST560 by M. Hamdi 42 Maximal Weight Matching Algotithms: iLQF The i-LQF algorithm has the following properties: Property 1. Independent of the number of iterations, the longest input queue is always served. Property 2. As with i-SLIP, the algorithm converges in at most logN iterations. Property 3. For an inadmissible offered load, an input queue may be starved.

CIST560 by M. Hamdi 43 Maximal Weight Matching Algotithms: iOCF The i-OCF algorithm works in similar fashion to iLQF, and has the following properties: Property 1. Independent of the number of iterations, the cell that has been waiting the longest time in the input queues (it must at the head of the queue) Property 2. As with i-LQF, the algorithm converges in at most logN iterations. Property 3. No input queue can be starved indefinitely. Property 4. It is difficult to keep time stamps on the cells.

CIST560 by M. Hamdi 44 iLQF - Implementation

CIST560 by M. Hamdi 45 iLPF - Implementation Complicated hardware

CIST560 by M. Hamdi 46 Other research efforts Packet-based arbitration Exhaustive-based arbitration Numerous other efforts

CIST560 by M. Hamdi 47 Packet Scheduling/Arbitration in Virtual Output Queues: Randomized Algorithms and Others

CIST560 by M. Hamdi 48 Input-Queued Packet Switch Crossbar Scheduler inputs outputs 1 N 1N i,j N,N 1,1 X i,j  i  j (  i i,j < 1 ;  j i,j < 1)

CIST560 by M. Hamdi 49 Bipartite Graph and Matrix inputs outputs

CIST560 by M. Hamdi 50 Stability of Scheduling Definition: Let X i,j (t) be the number of packets queued at input i for output j at time-slot t. Then an algorithm is stable iff:

CIST560 by M. Hamdi Maximum size matching Maximum weight matching Maximum Matching in VOQ Architecture

CIST560 by M. Hamdi 52 Complexity of Maximum Matchings Maximum Size/Cardinality Matchings: –It is not a stable algorithm –Algorithm by Dinic O(N 5/2 ) Maximum Weight Matchings –Algorithm by Kuhn O(N 3 logN) –It is a stable algorithm In general: –Hard to implement in hardware (does not lend itself to simple hardware implementation not because of its serial time complexity) –Slooooow.

CIST560 by M. Hamdi 53 Maximal Matching Algorithms Maximal matching algorithms are heuristic algorithms that try to approximate MSM or MWM. In general, maximal matching is much simpler to implement (Not because of its time complexity), and has a much faster running time. A maximal size matching is at least half the size of a maximum size matching. A maximal weight matching is at least half the size of a maximum weight matching.

CIST560 by M. Hamdi 54 Maximal Size Matching Algorithm: Performance and Properties Can have 100% throughtput under uniform traffic They converge in logN iterations to a maximal size matching Their performance can be quite good (close to an ideal Output Queued Switch) with multiple iterations The best iterative maximal size matching algorithm takes O(N 2 logN) serial or O(log N) parallel time steps. If the number of iterations is constant, then it can be implemented in constant time (that is why it is practical).

CIST560 by M. Hamdi 55 State of Input Queues (N 2 bits) 1 2 N 1 2 N Decision Register Grant Arbiters Request Arbiters Implementation of the parallel maximal matching algorithms

CIST560 by M. Hamdi 56 Small Differences (in implementation) between RRM, iSlip & FIRM But large difference in performance RRMiSlipFIRM Input No grantunchanged Grantedone location beyond the accepted one Output No requestunchanged Grant accepted one location beyond the granted one Grant not accepted one location beyond the previously granted one unchangedthe granted one

CIST560 by M. Hamdi 57 Maximum/Maximal Weight Matching 100% throughput for admissible traffic (uniform or non-uniform) Maximum Weight Matching –OCF (Oldest Cell First): w=cell waiting time –LQF (Longest Queue First):w=input queue occupancy –LPF (Longest Port First):w=QL of the source port + Sum of QL form the source port to the destination port Maximal Weight Matching (practical iterative algorithms) Make these maximal weight matching algorithms operate like iSLIP –iOCF –iLQF –iLPF

CIST560 by M. Hamdi 58 Maximal Weight Matching Algorithms: iLQF Request. Each unmatched input sends a request word of width bits to each output for which it has a queued cell, indicating the number of cells that it has queued to that output. Grant. If an unmatched output receives any requests, it chooses the largest valued request (has the longest queue). Ties are broken randomly. Accept. If an unmatched input receives one or more grants, it accepts the one to which it made the largest valued request (has the longest queue). Ties are broken randomly.

CIST560 by M. Hamdi 59 Maximal Weight Matching Algotithms: iLQF The i-LQF algorithm has the following properties: Property 1. Independent of the number of iterations, the longest input queue is always served. Property 2. As with i-SLIP, the algorithm converges in at most logN iterations. Property 3. For an inadmissible offered load, an input queue may be starved. Property 4. It is a stable algorithm.

CIST560 by M. Hamdi 60 Maximal Weight Matching Algotithms: iOCF The i-OCF algorithm works in similar fashion to iLQF, and has the following properties: Property 1. Independent of the number of iterations, the cell that has been waiting the longest time in the input queues (it must at the head of the queue) Property 2. As with i-LQF, the algorithm converges in at most logN iterations. Property 3. No input queue can be starved indefinitely. Property 4. It is difficult to keep time stamps on the cells.

CIST560 by M. Hamdi 61 Can we do better with than maximal matchings using Randomized Algorithms

CIST560 by M. Hamdi 62 Motivation Networking problems suffer from the “curse of dimensionality” –algorithmic solutions do not scale well Typical causes –size: large number of users or large number of I/O –time: very high speeds of operation A good deterministic algorithm exists (Max Flow), but … –it requires too large a data structure –it needs state information, and “state” is too big –it “starts from scratch” in each iteration

CIST560 by M. Hamdi 63 Randomization Randomized algorithms have frequently been used in many situations where the state space (e.g., different number of connections between input and output N!) is very large Randomized algorithms –are a powerful way of approximating –it is often possible to randomize deterministic algorithms –this simplifies the implementation while retaining a (surprisingly) high level of performance The main idea is –to simplify the decision-making process –by basing decisions upon a small, randomly chosen sample of the state –rather than upon the complete state

CIST560 by M. Hamdi 64 An Illustrative Example Find the largest element of a set S of size 1 billion Deterministic algorithm: linear search –has a complexity of 1 billion The randomized version: find the largest of 10 randomly chosen samples –has a complexity of 10 –(note: this ignores complexity of choosing 10 random samples) Performance –linear search will find the absolute largest element –if R is the element found by randomized algorithm, we can make statements like P(R is at least the 100 millionth largest element) =  thus, we can say that the performance of the randomized algorithm is very good with a high probability

CIST560 by M. Hamdi 65 Randomizing Iterative Schemes (e.g., iSLIP) Often, we want to perform some operation iteratively Example: find the heaviest matching in a switch in every time slot Since, in each time slot –at most one packet can arrive at each input –and, at most one packet can depart from each output  the size of the queues, or the “state” of the switch, doesn’t change by much between successive time slots  so, a matching that was heavy at time t will quite likely continue to be heavy at time t+1 This suggests that –knowing a heavy matching at time t should help in determining a heavy matching at time t+1  there is no need to start from scratch in each time slot

CIST560 by M. Hamdi 66 Summarizing Randomized Algorithms Randomized algorithms can help simplify the implementation –by reducing the amount of work in each iteration If the state of the system doesn’t change by much between iterations, then –we can reduce the work even further by carrying information between iterations The big pay-off is  that, even though it is an approximation, the performance of a randomized scheme can be surprisingly good

CIST560 by M. Hamdi 67 Randomized Scheduling Algorithms: Example Consider a 3 x 3 input-queued switch –input traffic: is Bernoulli IID and λij = α/3 for all i, j, and α < 1 –This is admissible –note: there are a total of 6 (= 3!) possible service matrices

CIST560 by M. Hamdi 68 Random Scheduling Algorithms In time slot n, let S(n) be equal to one of the 6 possible matchings independently and uniformly at random Stability of Random –Consider L11(n), the number of packets in VOQ11 arrivals to VOQ11 occur according to A11(n), which is Bernoulli IID input rate = λ11 = α/3 this queue gets served whenever the service matrix connects input 1 to output 1 There are 2 service matrices that connect input 1 to output 1 since Random chooses service matrices u.a.r., input 1 is connected to output 1 1. for a fraction of time = 2/6 = 1/3 --- the service rate between input1 and output1 E(L11(n)) < iff λ11 < 1/3  α < 1 This random algorithm is stable.

CIST560 by M. Hamdi 69 Random Scheduling Algorithms Instability of Random Now suppose λii = α for all i and λij =0 for –clearly, this is admissible traffic for all α < 1 –but, under Random, the service rate at VOQ11 is 1/3 at best –hence VOQ11 and the switch will be unstable as soon as Stability (or 100% throughput) means it is stable under all admissible traffic!

CIST560 by M. Hamdi 70 Switch Size : 32 x 32 Input Traffic (shown for a 4 X 4 switch) –diagonal load matrix: normalized load=x+y<1 x=2y It is a good test-case Simulation Scenario

CIST560 by M. Hamdi 71 Obvious Randomized Schemes Choose a matching at random and use it as the schedule  doesn’t give 100% throughput (already shown) Choose 2 matchings at random and use the heavier one as the schedule Choose N matchings at random and use the heaviest one as the schedule   None of these can give 100% throughput !!

CIST560 by M. Hamdi 72

CIST560 by M. Hamdi 73 Bounds on Maximum Throughput

CIST560 by M. Hamdi 74 Iterative Randomized Scheme (Tassiulas) Say M is the matching used at time t Let R be a new matching chosen uniformly at random (u.a.r.) among the N! different matchings At time t+1, use the heavier of M and R Complexity is very low O(1) iterations This gives 100% throughput !  note the boost in throughput is due to memory (saving previous matchings) But, delays are very large

CIST560 by M. Hamdi 75

CIST560 by M. Hamdi 76 Observations for Improvement Most of the weight of a matching is carried in a small number of edges Hence, remember edges not matchings We can have 100% throughput under all admissible traffic.

CIST560 by M. Hamdi 77

CIST560 by M. Hamdi 78 Finer Observations Let M be schedule used at time t Choose a “good’’ random matching R M’ = Merge(M,R) M’ includes best edges from M and R Use M’ as schedule at time t+1 Above procedure yields algorithm called LAURA There are many other small variations to this algorithm.

CIST560 by M. Hamdi Merging XR = =-1 W(X)=12W(R)=10 M W(M)=13 Merging Procedure

CIST560 by M. Hamdi 80

CIST560 by M. Hamdi 81 Can we avoid having schedulers altogether !!!

CIST560 by M. Hamdi 82 Recap: Two Successive Scaling Problems OQ routers: + work-conserving (QoS) - memory bandwidth = (N+1)R R R R R IQ routers: + memory bandwidth = 2R - arbitration complexity Bipartite Matching R R

CIST560 by M. Hamdi 83 Today: 64 ports at 10Gbps, 64-byte cells. Arbitration Time = = 51.2ns Request/Grant Communication BW = 17.5Gbps 10Gbps 64bytes IQ Arbitration Complexity Two main alternatives for scaling: 1. 1.Increase cell size 2. 2.Eliminate arbitration Scaling to 160Gbps: Arbitration Time = 3.2ns Request/Grant Communication BW = 280Gbps

CIST560 by M. Hamdi 84 Desirable Characteristics for Router Architecture Ideal: OQ 100% throughput Minimum delay Maintains packet order Necessary: able to regularly connect any input to any output What if the world was perfect? Assume Bernoulli iid uniform arrival traffic...

CIST560 by M. Hamdi 85 Round-Robin Scheduling Uniform & non-bursty traffic => 100% throughput Problem: traffic is non-uniform & bursty

CIST560 by M. Hamdi 86 Two-Stage Switch (I) 1 N 1 N 1 N External Outputs Internal Inputs External Inputs First Round-RobinSecond Round-Robin

CIST560 by M. Hamdi 87 Two-Stage Switch (I) 1 N 1 N 1 N External Outputs Internal Inputs External Inputs First Round-RobinSecond Round-Robin Load Balancing

CIST560 by M. Hamdi % throughput Problem: unbounded mis-sequencing External Outputs Internal Inputs 1 N External Inputs Cyclic Shift 1 N 1 N Two-Stage Switch Characteristics

CIST560 by M. Hamdi 89 Two-Stage Switch (II) NewN 3 instead of N 2

CIST560 by M. Hamdi 90 Expanding VOQ Structure Solution: expand VOQ structure by distinguishing among switch inputs a b

CIST560 by M. Hamdi 91 What is being done in practice (Cisco for example) They want schedulers that achieve 100% throughput and very low delay (Like MWM) They want it to be as simple as iSLIP in terms of hardware implementation Is there any solution to this !!!!!

CIST560 by M. Hamdi 92 Typical Performance of ISLIP-like Algorithms PIM with 4 iterations

CIST560 by M. Hamdi 93 What is being done in practice (Cisco for example) CompanySwitching Capacity Switch Architecture Fabric Overspeed Agere40 Gbit/s-2.5 Tbit/sArbitrated crossbar2x AMCC Gbit/sShared memory1.0x AMCC40 Gbit/s-1.2 Tbit/sArbitrated crossbar1-2x Broadcom Gbit/sBuffered crossbar1-4x Cisco Gbit/sArbitrated crossbar2x