Distributed Crossbar Schedulers

Slides:



Advertisements
Similar presentations
EE384y: Packet Switch Architectures
Advertisements

Lecture 4. Topics covered in last lecture Multistage Switching (Clos Network) Architecture of Clos Network Routing in Clos Network Blocking Rearranging.
1 Outline  Why Maximal and not Maximum  Definition and properties of Maximal Match  Parallel Iterative Matching (PIM)  iSLIP  Wavefront Arbiter (WFA)
Router Architecture : Building high-performance routers Ian Pratt
Nick McKeown CS244 Lecture 6 Packet Switches. What you said The very premise of the paper was a bit of an eye- opener for me, for previously I had never.
A Scalable Switch for Service Guarantees Bill Lin (University of California, San Diego) Isaac Keslassy (Technion, Israel)
Algorithm Orals Algorithm Qualifying Examination Orals Achieving 100% Throughput in IQ/CIOQ Switches using Maximum Size and Maximal Matching Algorithms.
The Concurrent Matching Switch Architecture Bill Lin (University of California, San Diego) Isaac Keslassy (Technion, Israel)
Scaling Internet Routers Using Optics Producing a 100TB/s Router Ashley Green and Brad Rosen February 16, 2004.
1 ENTS689L: Packet Processing and Switching Buffer-less Switch Fabric Architectures Buffer-less Switch Fabric Architectures Vahid Tabatabaee Fall 2006.
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion MSM.
CSIT560 by M. Hamdi 1 Course Exam: Review April 18/19 (in-Class)
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion The.
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion Scaling.
1 EE384Y: Packet Switch Architectures Part II Load-balanced Switches Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University.
Maximum Size Matchings & Input Queued Switches Sundar Iyer, Nick McKeown High Performance Networking Group, Stanford University,
COMP680E by M. Hamdi 1 Course Exam: Review April 17 (in-Class)
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion Maximal.
August 20 th, A 2.5Tb/s LCS Switch Core Nick McKeown Costas Calamvokis Shang-tse Chuang Accelerating The Broadband Revolution P M C - S I E R R.
Pipelined Two Step Iterative Matching Algorithms for CIOQ Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York, Stony Brook.
Localized Asynchronous Packet Scheduling for Buffered Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York Stony Brook.
1 IP routers with memory that runs slower than the line rate Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford.
CS-334: Computer Architecture
Load Balanced Birkhoff-von Neumann Switches
Merits of a Load-Balanced AAPN 1.Packets within a flow are transported to their correct destinations in sequence. This is due to the 1:1 logical connection.
Belgrade University Aleksandra Smiljanić: High-Capacity Switching Switches with Input Buffers (Cisco)
Hot Interconnects 2005 Control Path Implementation for a Low- Latency Optical HPC Switch C. Minkenberg 1, F. Abel 1, P. Müller 1, R. Krishnamurthy 1, M.
CS 552 Computer Networks IP forwarding Fall 2005 Rich Martin (Slides from D. Culler and N. McKeown)
ATM SWITCHING. SWITCHING A Switch is a network element that transfer packet from Input port to output port. A Switch is a network element that transfer.
1 Copyright © Monash University ATM Switch Design Philip Branch Centre for Telecommunications and Information Engineering (CTIE) Monash University
QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.
Enabling Class of Service for CIOQ Switches with Maximal Weighted Algorithms Thursday, October 08, 2015 Feng Wang Siu Hong Yuen.
HPSR 2006 Distributed Crossbar Schedulers Cyriel Minkenberg 1, Francois Abel 1, Enrico Schiattarella 2 1 IBM Research, Zurich Research Laboratory 2 Dipartimento.
Univ. of TehranAdv. topics in Computer Network1 Advanced topics in Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
EE384y EE384Y: Packet Switch Architectures Part II Scaling Crossbar Switches Nick McKeown Professor of Electrical Engineering and Computer Science,
Routers. These high-end, carrier-grade 7600 models process up to 30 million packets per second (pps).
EEE440 Computer Architecture
Applied research laboratory 1 Scaling Internet Routers Using Optics Isaac Keslassy, et al. Proceedings of SIGCOMM Slides:
ISLIP Switch Scheduler Ali Mohammad Zareh Bidoki April 2002.
Packet Forwarding. A router has several input/output lines. From an input line, it receives a packet. It will check the header of the packet to determine.
Crossbar Switch Project
Stress Resistant Scheduling Algorithms for CIOQ Switches Prashanth Pappu Applied Research Laboratory Washington University in St Louis “Stress Resistant.
Belgrade University Aleksandra Smiljanić: High-Capacity Switching Switches with Input Buffers (Cisco)
Dr Mohamed Menacer College of Computer Science and Engineering, Taibah University CE-321: Computer.
Buffered Crossbars With Performance Guarantees Shang-Tse (Da) Chuang Cisco Systems EE384Y Thursday, April 27, 2006.
Input buffered switches (1)
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
1 Building big router from lots of little routers Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford University.
Chapter 2 PHYSICAL LAYER.
scheduling for local-area networks”
EE384Y: Packet Switch Architectures Scaling Crossbar Switches
Overview Parallel Processing Pipelining
Network Resources.
Buffer Management and Arbiter in a Switch
Buffer Management in a Switch
Switching and High-Speed Networks
Packet Forwarding.
Chapter 3 Top Level View of Computer Function and Interconnection
William Stallings Data and Computer Communications
Packet Scheduling/Arbitration in Virtual Output Queues and Others
ECEG-3202 Computer Architecture and Organization
Outline Why Maximal and not Maximum
EE 122: Lecture 7 Ion Stoica September 18, 2001.
Overview of Computer Architecture and Organization
Youngki Kim Mobile R&D Laboratory KT, Korea
William Stallings Computer Organization and Architecture 7th Edition
Bandwidth Utilization: Multiplexing
EE384Y: Packet Switch Architectures II
Switch Performance Analysis and Design Improvements
Presentation transcript:

Distributed Crossbar Schedulers Cyriel Minkenberg1, Francois Abel1, Enrico Schiattarella2 1 IBM Research, Zurich Research Laboratory 2 Dipartimento di Elettronica, Politecnico di Torino

Outline OSMOSIS overview Challenges in the OSMOSIS scheduler design Basics of crossbar scheduling Distributed scheduler Architecture Problems Solutions Results Implementation

(bipartite graph matching algorithm) OSMOSIS Overview All-optical Switch 64 Ingress Adapters 64 Egress Adapters 8 Broadcast Units 8x1 1x8 Com- biner 128 Select Units VOQs Tx control EQ control 2 Rx all-optical packet transfer 5 Optical Amplifier WDM Mux Star Coupler 8x1 1x128 1 1 4b SOA switch command Fast SOA 1x8 Fiber Selector Gates Fast SOA 1x8 Wavelength Selector Gates 1 EQ control 2 Rx VOQs 1 Tx packet waiting 1 control 8 128 64 64 control links central scheduler (BGM) 3 central scheduler (bipartite graph matching algorithm) request 2 grant 4a 64 ports @ 40 Gb/s, 256-byte cells => 51.2 ns time slot Broadcast-and-select architecture (crossbar) Combination of wavelength- and space-division multiplexing Fast switching based on SOAs Electronic input and output adapters, electronic arbitration

Architectural Scheduler Challenges Latency < 1 ls Pr: Long permission latency (RTT + scheduling) So: Speculation Multicast support Pr: Fair integration with unicast scheduling, control channel overhead So: Independent schedulers with filter, merge & feedback scheme Scheduling rate = cell rate Pr: Produce one high quality matching every 51.2 ns So: Deeply pipelined matching with parallel sub-schedulers (FLPPR) FPGA-only scheduler implementation Pr: Does a 64-port scheduler fit in one FPGA device? If not, how do we distribute it over multiple devices while maintaining an acceptable level of performance?

Crossbar Scheduling: Bipartite Graph Matching inputs outputs maximal, size=3 request matrix inputs outputs inputs outputs maximal, size=2 inputs outputs maximum, size=4 A crossbar is a non-blocking fabric that can transfer cells from any input to any output with the following constraints: At most one cell from any input At most one cell to any output Equivalent to Bipartite Graph Matching (BGM)

Pointer-based Parallel Iterative Matching One matching must be computed in every time slot, so we need fast and simple algorithms Suitable class of algorithms is parallel, iterative, and based on round-robin pointers i-SLIP (McKeown), DRRM (Chao) These algorithms have a number of desirable features: 100% throughput under uniform i.i.d. traffic Starvation-free: any VOQ is served within finite time under any traffic pattern Iterative: sequential improvement of the matching by repeating steps Amenable to fast hardware implementation; high degree of parallelism and symmetry

DRRM Operation VOQ state input selectors output selectors Step 0: Initially, all inputs and outputs are unmatched Step 1: Each unmatched input requests the first unmatched output in round-robin order for which it has a packet, starting from pointer R[i]. R[i]  (R[i] + 1) modulo N iff the request is granted in Step 2 of the first iteration Step 2: Each output grants the first input in round-robin order that has requested it, starting from pointer G[o]. G[o]  (G[o] + 1) modulo N Iterate: Repeat Steps 1 and 2 until no more edges can be added or a fixed number of iterations are completed Key to good performance is pointer desynchronization If all VOQs are non-empty, pointers eventually all point to different outputs No conflicts: maximum performance IS[1] OS[1] IS[2] OS[2] IS[3] OS[3] IS[4] OS[4]

Distribution Issues Problem: Scheduler does not fit in a single device due to area constraints Quadratic complexity growth of priority encoders Monolithic implementation (implicit temporal and spatial assumptions) All results are available before the next time slot (or iteration) All required information is available to all selectors Distributed implementation breaks these assumptions Main problem: input selector issues a request at t0 and receives result (granted or not) at t0 + RTT Input selector does not know results of requests issued during last RTT Selectors are only aware of local status info (e.g. matches made in previous iterations) The time required for information to travel from the inputs to the outputs and back is called round-trip time (RTT)  = RTT / (cell duration) input status update and selection IS[1] IS[N] request RTT RTT >> cell duration grant OS[1] OS[N] output selection and status update time

Coping with Uncertainty (1) Problem: Uncertainty in the algorithm’s status The pointer-update mechanism breaks No desynchronization  Throughput loss Solution: Maintain a separate pointer set for each time slot in the RTT Basic idea: No pointer is reused before the last result is available Each input (output) selector maintains  distinct request (grant) pointers, labeled R[ t ] and G[ t ], with t  [0,  -1] At time slot t the input selectors use set R[t mod ] to generate requests; each request carries the ID of pointer set used Output selectors generate grants using G[ t ] in response to requests from R[ t ] Each pointer set is updated independently from the others, so they all desynchronize independently. Therefore, all the good features DRRM are preserved Pointer sets are only updated once every RTT, hence they take longer to desynchronize

Coping with Uncertainty (2) Problem: Uncertainty in the algorithm’s status The VOQ-state update mechanism breaks How many requests were successful? Excess requests may lead to “wasted” grants, leading to reduced performance Solution: Maintain a pending request counter for every VOQ P(i,j) tracks the number of requests issued for VOQ(i,j) over the last RTT Increment when issuing new request Decrement when result arrives Filter requests: if P(i,j) exceeds the number of unserved cells in VOQ(i,j) do not submit further requests This massively reduces the number of wasted grants

Multi-pointer Approach (RTT = 4) Hardware cost ( -1) additional pointers at each input/output, each log2N bits wide N2 pending request counters N -to-1 multiplexers Selection logic is not duplicated pending request counters VOQ state 1 1 2 1 R[t0;1] G[t0;1] R[t1;1] IS[1] OS[1] G[t1;1] R[t2;1] G[t2;1] R[t3;1] G[t3;1] IS[2] OS[2] request pointer set grant pointer set R[t3;2] R[t2;2] R[t1;2] R[t0;2] R[t3;2] R[t2;2] R[t1;2] G[t0;2] IS[3] OS[3] R[t3;2] R[t2;2] R[t1;2] R[t0;3] R[t3;2] R[t2;2] R[t1;2] G[t0;3] R[t3;2] R[t2;2] R[t1;2] R[t0;4] IS[4] OS[4] R[t3;2] R[t2;2] R[t1;2] G[t0;4] request pointers input selectors output selectors grant pointers

Multiple Iterations Additional uncertainty: Which inputs/outputs have been matched in previous iterations? Inputs should not request outputs that are already taken: Wasted requests Outputs should not grant inputs that are already taken: Violation of one-to-one matching property Because of issue 2 above, the output selectors must be aware of all grants in previous iterations, also by other selectors Implement all output selectors in one device Input selectors use a request flywheel pointer to create request diversity across multiple iterations PRC filtering applies only to first iteration Can lead to “premature” grants

Distributed Scheduler Architecture VOQ state input selectors output selectors switch command channels IS[1] OS[1] control channel IS[2] OS[2] control channel Control channel interfaces (each on a separate card) Allocators (on midplane) IS[3] OS[3] control channel IS[4] OS[4] control channel

Performance Characteristics (16 ports)

Optical Switch Controller Module (OSCM) Midplane (OSCB; prototype shown here) with 40 daughter boards (OSCI; top right). Board layout (bottom right)

Thank You!