Network Algorithms, Lecture 1: Intro and Principles

Network Algorithms, Lecture 1: Intro and Principles
George Varghese, UCSD

Course Goal Network Algorithms: Algorithms for key bottlenecks in a router (lookups, switching, QoS) How to attack new bottlenecks: Changing world -> measurement, security Algorithmics: Models and Principles. System thinking + Hardware + Algorithms

Different Backgrounds
Balaji: Stochastic Processes, randomized algorithms, QCN, DCTCP, 2 yr at Cisco (Nuova) Tom: Former Cisco VP and Fellow. Architect of Cisco Catalyst and Nexus Me: Algorithmic background. DRR, IP lookups. 1 year at Cisco (NetSift) All worked on router bottlenecks. Want to reconcile our viewpoints. Fun for you

Course Grading 4 Homeworks: 20% Course Project: 80% Office hours:
Posted Weds, due Thus by 2pm Course Project: 80% Teams of 2 students Ideally, original idea related to course leading to a paper; or, a detailed review of paper(s) on a topic. 1 page abstract due by Mon, Apr 18th Final project presentation + 5 page report Office hours: Balaji: Tuesday 10-11 -- George Wednesday 10-11 TA: Mohammed, introduction

Course Outline Lectures 1 to 2: Principles and Models
Lectures 3 to 5: Lookups Lectures 6 to 9: Switching Lectures 10 to 11: QoS Later lectures: measurement, security, data center switching fabrics, congestion control, etc

Plan for Rest of Lecture
Part 1: Warm up example of algorithmic thinking Part 2: Ten Principles for Network Algorithms

Warm Up: Scenting an Attack
Observation: Large frequency of uncommon characters within code buried in say URL. Goal: Flag such packets for further observation.

Strawman Solution Strawman: Increment character count and flag if any over threshold. Problem: Second pass over array after packet

Oliver Asks for Less Idea: Relax specification to alarm if count over threshold wrt current and not total length Problem: Corner cases for small lengths

Oliver uses Algorithmic Thinking
Idea: Keep track of max relativized count and compare to length L at end. Problem: Still 2 Reads, 1 Write to Memory. 1 Read is free (parallelism, wide memories)

Relax Specification: Approximate!
Idea: Round up thresholds to powers of 2 to use shifts. Have false positives anyway. Problem: Initialize counts per packet

Lazy initialization Idea: Use generation number and lazy count initialization. Problem: Generation number wrap (Scrub)

Part 2, Ten Principles for Network
Algorithms

Summary P1: Relax Specification for efficiency
P2: Utilize degrees of freedom P3: Shift Computation in Time P4: Avoid Waste seen in a holistic view P5: Add State for efficiency P6: Exploit hardware parallelism and memories. P7: Pass information (e.g., hints) in interfaces P8: Use finite universe methods (bucket-sort etc) P9: Use algorithmic thinking P10: Optimize the Expected Case

P1: Relax Specifications
IntServ Spec: Serve smallest, requires sorting DRR: Throughput fair, modify round robin, O(1) Router Output Queue Scheduling 1500 800 50

Forms of relaxing specification
Probabilistic: use randomization Ethernet, RED Approximate: RED,TCP Round trip weights Shift computation in space: Path MTU Calculate Min segment Size R1 E R2 4000 1500 Fragment?

P2: Utilize degrees of Freedom
Have you used all information (Polya)? TCP: Reduce rate on drop; later, on ECN set DCTCP: Use number of ECN bits set in window Calculate Send rate R1 E R2 10 G 1 G Pass ECN Bit

P2 in IP Lookups Have you used all knobs under your control?
Multiibit Trie: Many bits, 2 Reads Unibit Trie: One bit at a time P7 = *  5 Reads

P2 in Algorithms (keyword search)
We hold these truths Last character first Boyer Moore Noticing degrees of freedom is also a nice heuristic for algorithm design Last character first is so fast because we can precompute that if the last character is l and not s, there is no way that truths can be earlier as there is no l in “truths”. So we can skip whole swaths of text (sublinear). But we do need to precompute some information which brings us to the next principle truths

P3:Shift Computation in Time
Precompution: slow update, fast lookup Lazy Evaluation: counter initialization Expense Sharing (batching): timing wheels P7 = * P6 = 10000* Systems evolve in time and space. Computers have fab time, boot time, compile time, run time: can do computation any time with tradeoff of flexibility and speed. Networks have similar time scales where compilation corresponds to route computation, and run time to packet forwarding Precomputed Prefix of

P4: Avoid Waste When viewing system holistically e.g: packet copies
3 bits Copies of a packet across a router or a host is 2 bits Avoid waste

P5: Add State Those who do not learn from history . . .
BIC, QCN: Remember past high rate Past high DRR Active queue is a simple example, Round robin across queues is wasteful if there are hundreds of queues and only 10 are idle. So simply add state called an active list that stores indices of quees with at least 1 packet and round robin across those queues. Start here?

P6: Leverage Hardware Exploit Parallelism (memory and processing)
Leverage diff memories: SRAM, DRAM, EDRAM Touched rarely Keep in Slow memory Touched often Keep in Fast memory Tom’s lecture following this will tell you more Large set of Counters

P7: Pass information across layers
Do not hide information Pass “hints” Add index to list File index: Open. After that no hash table lookups. More complex example to bypass kernel in Via Bypassing the Kernel in VIA

P8: Finite Universe Techniques
Bit Maps: Saves space & time by HW parallelism Bucket sorting: skipping over empty elements? 11 14 12 13 Timer Queue

P8: Finite Universe Techniques
Cost of skipping empty buckets shared with incrementing current time. (P2) Current time = 10 11 12 13 Timer wheel 14

P9: Use Algorithmic Ideas
Not blindly reusing algorithms but using ideas Binary Search on Hash Tables: 1 2 3 4 8 01 * 101* Instead of starting with the longest prefix length and working backwards (W memory accesses where W is the length of the longest prefix), it would be faster if we could do binary search starting with the middle. This is just an idea however. To make it work, we need to add state called markers and to do lots of precomputation * * Or here? Start here

P10: Optimize Expected Case
Mostly worst case, sometimes expected case Optimization should not be caught by testing! D, M E D D D->M Cache Router Instead of maintaining separate HW threads for cached and uncached, drop uncached!

Summing up Course: router bottlenecks/network algorithms
Not just a recipe book but a way of thinking Will see principles in action in later lectures Next lecture: hardware models by Tom

Network Algorithms, Lecture 3: Exact Lookups
George Varghese

Course Goal Network Algorithms: Algorithms for key bottlenecks in a router (lookups, switching, QoS) How to attack new bottlenecks: Changing world -> measurement, security Algorithmics: Models and Principles. System thinking + Hardware + Algorithms

Part 1: Exact Match Algorithms Part 2: Small Transition to Longest Match

Context 1 Challenge-and-response' is a formula describing the free play of forces that provokes new departures in individual and social life. An effective challenge stimulates men to creative action Arnold Toynbee Query species key K. Retrieve state associate with K. Memory references crucial. Simplest lookup problem, yet crucial for a bridge. Birthed by 3 challenges

Challenges that led to Bridges
Challenge: Ethernets under fire. Routers slow. Packet repeater learning bridge

Challenge 2: Wire Speed Without wire speed, can lose packets to forward in stream of packets to filter. Finesse: 2 lookups per port in 51.2 usec

Attention to Algorithmics
Architecture: 4-port cheap DRAM with cycle time of 100 nsec for packet bufers and lookup memory. Bus parallelism, memory bandwidth, page mode. Data Copying: Ethernet chips used DMA, packets copied from one port to other by flpping pointers. Control Overhead: Interrupt overhead minimized by processor polling, staying in a loop after a packet interrupt. Lookups: Used caveats. Wrote software to verify lookup bottleneck (Q2)

Eight cautionary questions
Q1, Worth improving performance? Yes. Q2, Really a bottleneck? 68,000 code timing did not meet 25.6 usec budget. Q3 Impact on system? Allows wire-speed. Q4, Initial analysis shows benefits? DRAM references = log 8000, so 1.3 usec. Q5, Worth adding custom hardware? Cheap using a PAL. Q6, Do Prototypes confirm promise? Lab prototype tested before product was built. Q7: Sensitive to environment changes? Worst-case design

Challenge 3: Higher Speeds
Scaling via hashing for FDDI bridge (10 Mbps). Collisions? Use Perfect hashing. A(x) * M(x) / G(x), A(x) picked randomly

Improvement 1: Wide Words
Intuition: More choice (d words per location) Srinivasan-Varghese H(C) A, B, C D, -, -

Improvement 2: Parallelism
Intuition: k independent choices (Broder-Karlin) X A B H3(B) H1(B) H2(B)

State of the Art: d-left
Intuition: Combine  d-left: Broder, Mitzenmacher X, Y A, L R, B H3(B) H1(B) H2(B)

Binary Search can also work
Use hardware parallelism. Seemingly clever. Easier way to see this?

Lessons Challenge-Response: Bridges invented to solve a problem (slow multi-protocol routers) not present today. Cost-Performance: is what justies bridges today in so-called switches together with a small number of extra features like VLANs. Introduced techniques: Broke wire-speed barrier, spread to routers, ideas like fast lookups, trading updates for lookup speed, minimal hardware all became classical.

Network Algorithms, Lecture 4: Longest Matching Prefix Lookups
George Varghese

Defining Problem, why its important Trie Based Algorithms Multibit trie algorithms Compressed Tries Binary Search Binary Search on Hash Tables

Longest Matching Prefix
Given N prefixes K_i of up to W bits, find the longest match with input K of W bits. 3 prefix notations: slash, mask, and wildcard /31 or 1* N =1M (ISPs) or as small as 5000 (Enterprise). W can be 32 (IPv4), 64 (multicast), 128 (IPv6). For IPv4, CIDR makes all prefix lengths from 8 to 28 common, density at 16 and 24

Why Longest Match Much harder than exact match. Why is thus dumped on routers. Form of compression: instead of a billion routes, around 500K prefixes. Core routers need only a few routes for all Stanford stations. Really accelerated by the running out of Class B addresses and CIDR

Sample Database

Skip versus Path Compression
Removing 1-way branches ensures that tries nodes is at most twice number of prefixes. Skip count (Berkeley code, Juniper patent) requires exact match and backtracking: bad!

Multibit Tries Multibit Tries

Optimal Expanded Tries
Pick stride s for root and solve recursively Srinivasan Varghese

Degermark et al Leaf Pushing: entries that have pointers plus prefix have prefixes pushed down to leaves

Why Compression is Effective
Breakpoints in function (non-zero elements) is at most twice the number of prefixes

Eatherton-Dittia-Varghese
Lulea uses large arrays: TreeBitMap uses small arrays, counts bits in hardware. No leaf pushing, 2 bit maps per node. CRS-1

Binary Search Natural idea: reduce prefix matching to exact match by padding prefixes with 0’s. Problem: addresses that map to diff prefixes can end up in same range of table.

Modified Binary Search
Solution: Encode a prefix as a range by inserting two keys A000 and AFFF Now each range maps to a unique prefix that can be precomputed.

Why this works Any range corresponds to earliest L not followed by H. Precompute with a stack.

Modified Search Table Need to handle equality (=) separate from case where key falls within region (>).

Transition to IPv6 So far: schemes with either log N or W/C memory references. IPv6? We describe a scheme that takes O(log W) references or log 128 = 7 references Waldvogel-Varghese-Turner. Uses binary search on prefix lengths not on keys.

Why Markers are Needed

Why backtracking can occur
Markers announce “Possibly better information to right”. Can lead to wild goose chase.

Avoid backtracking by . . Precomputing longest match of each marker

2011 Conclusions Fast lookups require fast memory such as SRAM  compression  Eatherton scheme. Can also cheat by using several DRAM banks in parallel with replication. EDRAM  binary search with high radix as in B-trees. IPv6 still a headache: possibly binary search on hash tables. For enterprises and reasonable size databases, ternary CAMs are way to go. Simpler too.

Principles Used P1: Relax Specification (fast lookup, slow insert)
P2: Utilize degrees of freedom (strides in tries) P3: Shift Computation in Time (expansion) P4: Avoid Waste seen (variable stride) P5: Add State for efficiency (add markers) P6: Hardware parallelism (Pipeline tries, CAM) P8: Finite universe methods (Lulea bitmaps) P9: Use algorithmic thinking (binary search)

Packet Classification
George Varghese

Original Motivation: Firewalls
Firewalls use packet filtering to block say ssh and force access to web and mail via proxies. Still part of “defense in depth” today. Need fast wire speed packet filtering

Simplified Internet Message Format
Dest and Src IP address (telephone numbers). Dst and Src Ports (extensions): indicate protocol For instance, Port 80 = Web, Mail = 25

Sample Firewall Database

Beyond firewalls today

Service differentiation via classification
Every router in world: if packet addressed to router, do packet classification before LPM. Extract 5 (or more fields). If there is a match, treat packet as specified by highest match rule. Can use to drop packets, give some applications more QoS, different routes for some apps etc. Standard solution: CAMs. But lets look at some algorithmic solutions some of which are used. Routers often support 1000s of rules so linear search (despite parallel logic) is too slow.

Plan of Attack First, 2-field (2D) packet classification. Useful for measurement and multicast. Then we introduce a nice geometric model and move on to general K-field classification.

2D (two field) example

First attempt: Set Pruning Tries
Each destination prefix D points to a trie containing all source prefixes in rules whose destination field is a prefix of D. O(N^2) memory!

Worst-case example for storage

Less memory via backtracking
Source tries now only contain sources in rules whose destination is equal to D. O(W^2) time.

Grid of Tries (Srinvasan-Varghese)
Use pre-computed switch pointers (dashed line). No backtracking and linear space.

Geometric Model (Lakshman-Staliadis)
Example: F1 = (0*, 10*). Each field is a dimension in geometric space

Beyond 2D Bad News: Lower bound (computational geometry): O((W^k)/k!) time for linear storage. Good news: (Gupta-Mckeown): # of Disjoint classification regions in real databases is small. For example: theoretically in 2D we can have N^2 disjoint regions but practically we have O(N) Can we exploit this observation for speed with small storage. Yes, but not provably. Heuristics.

Divide and Conquer? Natural to try LPM in each field separately and combine. Concatenation does not work!

Aside: Range to Prefix Matches
Real classifiers use ranges (e.g., < 1024 for well known ports). Theorem: Can write any range as the union of a logaritmic number of prefix ranges. Example: [8,12] in 5 bits. 01* does not work but 0100* and 0101* and does! Useful theorem for CAM vendors as well as they only support prefix ranges. Recall hardware!

Bit Vector (Lakshman-Staliadis)
Store an N-bit vector with each field value M with bit J set in Field I if M matches Rule I in field J. AND and find first bit set. Priority Encoder.

Why is Lucent Fast? Since the bit vectors are O(N), from a CS perspective it is O(N), as bad as linear search. Really reduces constants uses wide memories. Nk/W memory accesses where W is width. Recall W = 1000 is feasible 1000 rule tables in a few accesses, many of which are parallel. Moral: Know hardware complexity measures!

Cross-Products (Srinivasan-Varghese)
Theorem: Best matching rule for crossproduct is best matcing rule for packet.

Equivalenced Crossproducts (Gupta-Mckeown): aka RFC
Idea: Instead of “multiplying” in 1 fell swoop, do 2 at a time and equivalence at each step. GSR 16 crossproducts but only 8 classes!

Hi Cuts (Gupta-Mckeown)
Different idea: Decision tree in geemetric space to “zero in” on narrowest matching region.

State of Art Woo algorithm: Like HiCuts but uses bit testing and not range testing. Hypercuts (Singh): beyond Woo to test multiple bits at a time using arrays. Cisco CRS Space usage of Hypercuts/HiCuts can be employed using 2 parallel trees (Brian Alleyne) Efficuts (Purdue, SIGCOMM 2010) is a publicly available implementation of best ideas so far. CAMs still easier though need algorithmic tricks to reduce power.

Principles Used P1: Relax Specification (heuristics beyond 2D)
P2: Degrees of freedom (HiCuts  Hypercuts) P3: Shift Computation in Time (grid-of-tries) P4: Avoid Waste seen (Crossproducts  RFC) P5: Add State for efficiency (switch pointers) P6: Hardware parallelism (Bit vector) P8: Finite universe methods (Bit vector) P9: Use algorithmic thinking (decision trees)

Students like you . . . PANKAJ CHEENU SUMEET
Stanford  Sahasra  Netlogic  Twitter UCSD  Sahasra  Netlogic  Google UCSD  NetSift  Cisco SO DO A GREAT PROJECT! SOME MORE PAPERS UP

EE384M Network Algorithms Spring 2011 A history of big routers
(slides from Nick McKeown’s EE 384X presentation) Balaji Prabhakar

Outline What is an Internet router? The early days: Modified computers
What limits performance: Memory access time The early days: Modified computers Programmable against uncertainty The middle years: Specialized for performance Needed new architectures, theory, and practice So how did we do? Simple model breaking down

Definitions N = number of linecards. Typically 8-32 per chassis
1 N 2 … 3 R … 4 … 5 6 8 7 N = number of linecards. Typically 8-32 per chassis R = line-rate. 1Gb/s, 2.5Gb/s, 10Gb/s, 40Gb/s, 100Gb/s Capacity of router = N x R

What a Big Router Looks Like
Cisco GSR 12816 Juniper T640 Capacity: 640Gb/s Power: 5kW Capacity: 320Gb/s Power: 3kW 19” 19” 6ft 3ft 2.5ft 2ft

What Multirack Routers Looks Like
Cisco CRS-1 Juniper T TX Matrix

Lookup internet address
Check and update checksum Check and update age

Barebones Router Router Control and Management

Barebones Router

Bottlenecks Memory, memory, …
2 1

Outline What is an Internet router? The early days: Modified computers
What limits performance: Memory access time The early days: Modified computers Programmable against uncertainty The middle years: Specialized for performance Needed new architectures, theory, and practice So how did we do? Simple model breaking down

Early days: Modified Computer
Must run at rate N x R R R R R R R R R Bottlenecks

2nd Generation Router R

Early days: Modified Computer
Function more important than speed 1993 (WWW) changed everything We badly needed Some new architecture Some theory Some higher performance

3rd Generation Router: Switch
N x R

1 x R Arbiter

Arbiter Arbiter

4th Generation Router Multirack; optics inside
Optical links 100s of metres Linecards Switch

More 4th Generation Routers
Alcatel 7670 RSP Juniper TX TX Avici TSR Cisco CRS-1

Example of Theory There’s something special about “2”

Case 1: Placing calls Crosspoint switch 1 Permutation D 1 1 1 1 A crosspoint switch supports all permutations So it is “non-blocking” But it needs N2 crosspoints

Case 1: Placing Calls Uncertainty costs
1 If I give you the permutation, you can route it. If I give you entries one at a time, you can’t. 1 1 Clos (1950s): But if you make it run 2 times faster you can route calls one at a time.

Case 2: Mimicking N x R

Case 2: Mimicking 1 x R

Are they equivalent? NR No. R

Case 2: Mimicking 1 x R ? x R Algorithm

Now are they equivalent?
NR Yes, if it runs 2 times faster. R 2R Algorithm

Case 3: Are they equivalent?
Yes, if it runs 2 times faster.

Case 4: Routing packets with uncertainty
Algorithm 1 If you know the rates, you can find a sequence of permutations: Rates = But we don’t know the rates (they are always changing)

Case 4: Routing packets with uncertainty
2 If you choose the permutations one at a time, and you can spend as long as you want choosing, then you can support any pattern of rates. 3 But if you have to make decisions one at a time, then the switch has to run 2 times faster.

Case 5: Load-balancing Load-balancing to support all rate matrices:
Requires the network to run 2 times faster E.g. the VL2 (Valiant Load balancing) architecture for Data Centers

Summary of switching theory
Balaji Prabhakar Balaji Prabhakar Stanford University

Outline of Notes Focus on four types of architecture
Output-queued switches (ideal architecture, not much to say) Input-queued crossbars Combined input- and output-queued switches Buffered crossbars (mentioned briefly)

A Detailed Sketch of a Router
Output Scheduler Interconnection Fabric Switch Lookup Engine Packet Buffers Network Processor Lookup Engine Packet Buffers Network Processor Lookup Engine Packet Buffers Network Processor Line cards Outputs

Things to Remember/Look for
Switch design is mainly influenced by Cost Heat dissipation Key technological factors affecting cost and heat Memory bandwidth (not the size of memory, but its speed) Complexity of algorithms Number of off-chip operations (this affects speed) Winning algorithms Make the right trade-offs Are very simple In hardware architecture design, switch/router design seems an exception in that theory has made a surprising amount of difference to the practice

Evolution of Switches In the beginning, there were only telephone switches Data packet/cell switches came in with ATM Almost all original designs were either of the shared memory or the output-queued architecture These architectures were difficult to scale to high bandwidths, because of their very high memory bandwidth requirement Input-queued switches require a low memory bandwidth, hence were seen as very scaleable

Evolution of Switches 1987: A very influential paper in switching, by Karol et. al. IQ switches suffered from the head-of-line blocking phenomenon, which limits their throughput to 58% This very poor performance nearly killed the IQ architecture 121 221 Switching theory bifurcates: The IQ and CIOQ researches The negative result of Karol et. al. generated much interest in the Combined Input- and Output-queued (CIOQ) architecture during the years We will return to CIOQ switches later For now we will look at developments in the IQ architecture

Input-queued Switches

Evolution of IQ Switches
121 221 11 2 1 22 1993: Appearance of paper by Anderson et. al. Showed that head-of-line blocking is easily overcome by the use of virtual output queues, hence higher throughputs are possible; however, VOQs required the switch fabric to be “scheduled” (this is a key trade-off: scheduling problem for memory bandwidth) Showed that switch scheduling is equivalent to bipartite graph matching, introduced the Parallel Iterative Matching algorithm

Evolution of IQ Switches
1995: Nick McKeown develops the iSLIP algorithm in his thesis Used, in 1996, in Cisco Sytems’ flagship GSR family of routers 1996: Influential paper by McKeown, Walrand and Anantharam Showed that the Maximum Size Matching does not give 100% throughput Showed that Maximum Weight Matching does give 100% throughput 1992: Paper by Tassiulas and Ephremides Showed that the Maximum Weight Matching gives 100% throughput And many other interesting theoretical results 1998: Tassiulas introduces a randomized version of the MWM algorithm He showed that this simple algorithm gives 100% throughput But, its delay performance was very poor 2000: Giaccone, Prabhakar and Shah introduce other randomized algorithms which give 100% throughput with delay very nearly equal to that of the MWM algorithm

Performance Analysis of IQ Switches
Analyzing throughput Bernoulli IID input processes: Lyapunov analysis of the Markov chain corresponding to the queue-size process (all papers mentioned previously) SLLN input processes: Fluid models introduced by Dai and Prabhakar Adversarial input processes: Analyzed by Andrews and Zhang Analyzing delay performance Bounds from Lyapunov analysis: Leonardi et al, Kopikare and Shah Heavy traffic analysis: Stolyar analyzes the MWM algorithm under heavy traffic Shah and Wischik build on this and analyze MWM algorithms with different queue weights See talks by Shah and Williams

Combined Input- and Output-queued Switches

CIOQ Switches IQ CIOQ OQ
Recall the negative result on IQ switches in the paper by Karol et al It started a lot of work on CIOQ switches The aim was to get the performance of OQ switches at very near the cost of IQ switches A number of heuristic algorithms, simulations and special-case analyses showed that with a speedup of about 4, a CIOQ switch could approach the performance of an OQ switch IQ CIOQ OQ Speedup = 1 Inexpensive Poor performance Speedup = 4 or 5? Inexpensive Great performance Speedup = N Expensive Great performance

CIOQ Switches Prabhakar and McKeown (1999)
Prove that a CIOQ switch with a speedup of 4 exactly emulates an OQ switch; i.e. there does not exist an input pattern of packets that can distinguish the two switches They introduced an algorithm called MUCF, which is of the stable marriage type This result was later improved to 2 by Chuang, Goel, McKeown and P Related other work due to Charny et al, Krishna et al Iyer, Zhang and McKeown (2002?) generalize the above to switches with a single stage of buffers Thereby making a theoretical analysis of the Juniper router architecture (which has a shared memory architecture) Dai and Prabhakar (2000) and Leonardi et al (2000) show that any maximal matching algorithm delivers a 100% throughput at a speedup of 2 This result has a lot of significance for practice because (essentially) all commercial switches employ a speedup close to 2 and (truncated) maximal matching algorithms; so it validated a popular practice

Buffered Crossbars This type of fabric is very attractive because 11 2
22 This type of fabric is very attractive because It completely decouples the input from the output It can handle variable-length packets in a natural way It sits in some hot-selling networking products: e.g. Cisco’s Catalyst 6000 switch Very ripe for theoretical study

Scheduling algorithms for CIOQ switches
Balaji Prabhakar Balaji Prabhakar

Outline of lecture notes
We have seen a brief overview of switches and routers Practice and theory: the commonalities, the divergence The evolution of switch architectures The CIOQ switch architecture Overview Current practice: separate fabric and output schedulers Stable marriage algorithms for integrating the two schedulers Why they were unsuccessful Why they are interesting again…

A Detailed Sketch of a Router
Interconnection Fabric Switch Lookup Engine Packet Buffers Network Processor Lookup Engine Packet Buffers Network Processor Output Scheduler Fabric Scheduler Lookup Engine Packet Buffers Network Processor Line cards Outputs

Note Typically, there are two interlocking schedulers
Fabric scheduler and the output scheduler Speedup is the “grease” that smoothes their interaction

What is speedup Input-queued switch Output-queued switch Crossbar
Fabric with Speedup S Memory Bandwidth = S+1 M.B. = N + 1 M.B. = 2 Switch with speedup S: In each time slot At most 1 packet arrives at (departs from) each input (output) At most S packets switched from (to) each input (output) Crossbar Fabric with Speedup 1 Speedup N Input-queued switch Output-queued switch

CIOQ Switches IQ CIOQ OQ Probabilistic analyses: Assume traffic models
Bruzzi and Patavina ‘90, Chen and Stern ‘91, Iliadis and Denzel ‘93, Lin and Sylvester ‘93, Chang, Paulraj and Kailath ‘94 Numerical methods: Use simulated or actual traffic Murata, Kubota, Miyahara ’89, Lee and Li ’91, Goli and Kumar ’92, Lee and Liu ’94, Bianchini and Kim ’95, etc. Showed that switches which use a speedup of between 2 and 5 achieve the same mean delay and throughput as an output-queued switch (whose speedup is N) IQ CIOQ OQ Speedup = 1 Inexpensive Poor performance Speedup = 2--5 ? Inexpensive Good performance Speedup = N Expensive Great performance

Our approach The setup Arbitrary inputs and emulation of OQ switch
Can we say something about delay irrespective of traffic statistics? Competitive analysis: The idea is to compete with an o/q switch The setup Under arbitrary, but identical inputs (packet-by-packet) Is it possible to replace an o/q switch by a CIOQ switch and schedule the CIOQ switch so that the outputs are identical, packet-by-packet? If yes, what is the scheduling algorithm?

More specifically Consider an N × N switch with (integer) speedup S > 1 Apply the same inputs, cell-by-cell, to both switches We will assume that the o/q switch sends out packets in FIFO order And we’ll see if the CIOQ switch can match cells on the output side

An algorithm

Key concept: Port threads

Theorem (P and McKeown, ‘97)

Subsequent work Interestingly (and surprisingly), the MUCF algorithm doesn’t give OQ emulation at speedup 2 Counterexample due to S-T. Chuang Theorem (Chuang, Goel, McKeown and P, ‘98) An algorithm, called Critical Cells First (CCF), achieves OQ emulation at speedup of 2 - 1/N; moreover, this is necessary and sufficient The output scheduling policy can be any “monotone scheduling policy;” e.g. strict priority, WFQ, LIFO Charny et al (‘98) Showed speedup of 4 is sufficient for monotone scheduling policies with leaky bucket constrained inputs Krishna et al (‘98) Produced an algorithm that is work-conserving, like an OQ switch is However, none of these algorithms were “implementable”

Implementable algorithms
In a core router, switching decisions need to be made in 40ns Therefore, algorithms have to be v.simple The elaborate information exchange required by the previous algorithms “did them in” What is implemented: the Request-Grant-Accept type algorithm In each iteration of the matching process… Inputs send 1-bit requests to outputs An output grants to one of the requesting inputs An input, which receives multiple grants, accepts one of the outputs All of these are 1-bit communications The CIOQ algorithms, while being of the iterative type, lacked the RGA capability The information exchanged was “global” E.g. time of arrival, flow id, … So this work seemed fundamentally limited

A new perspective The output scheduling algorithms commercially implemented are “port-level fair” That is, an output link’s bandwidth is partitioned at the level of input ports A policy is “port-level fair” if at any given time the output has a complete ordering of all the inputs for service, based only on the occupancy of the VIQs FIFO is not port-level fair, because an output can only determine the departure time of a packet when given its arrival time, not just based on which input it arrived at Flow-level WFQ is also, clearly, not port-level fair

Remarks Note that FIFO is the easiest policy for an OQ switch to implement So, port-level fair algorithms should be harder… Somewhat surprisingly, in a CIOQ switch, this is inverted, as we will see A port-level fair algorithm is popular in practice because a network operator can guarantee bandwidth at that level, without knowing how many flows there are Individual flows at an input get bandwidth guarantees from the input’s share (which is known at the input) of the output bandwidth A 2-step process

The algorithm: FLGS We now describe the Fully Local Gale-Shapley (FLGS) algorithm: It is a “stable marriage” algorithm between inputs and outputs The ranking list at an output at any time is defined by the port-level fair policy Ranking list at an input: The VOQs are divided into two groups: empty and non-empty All empty VOQs have rank “infinity” When an empty VOQ has an arrival, it becomes ranked 1 It will remain in this rank until another empty VOQ receives an arrival, or it becomes empty, whichever is first NOTE: The relative ranks of two non-empty never changes while they remain non-empty

The algorithm: FLGS Theorem (Firoozshahian, Manshadi, Goel and P, ‘06): The fully local Gale-Shapley (FLGS) algorithm enables the emulation of an OQ switch employing any port-level fair output scheduler at a speeup of 2 The algorithm seems counter-intuitive because the ranking at the inputs seems “dumb,” too much so But, the algorithm works, and it was nearly known to us in 1997 (in the Chuang et al paper); however, we weren’t looking for something so simple at that time!

At smaller speedups… The theorem guarantees emulation of an OQ switch at speedup 2 Using simulations we can see how the algorithm performs relative to an OQ switch at speedups less than 2 Here the comparison is wrt average packet delays, not packet-by-packet 4 classes with WFQ and Strict Priority, load of 99.9% WFQ: High weight class WFQ: Low weight class

Concluding remark So, why did it take “10 years?”
We became aware of “port-level fair” policies only recently The implementors never really understood our original algorithms; they thought (correctly) that they were too complicated and, perhaps, dismissed the whole approach The big “take away” for me Try to understand implementors and the process of implementation This could have saved us a lot of time The implementors (especially the good ones) aren’t really looking at the theoretical literature

Balaji Prabhakar Departments of EE and CS Stanford University
Randomized switch scheduling algorithms Balaji Prabhakar Departments of EE and CS Stanford University

Randomized algorithms
Randomization is a method that can be used to simplify the implementation The main idea base decisions upon a small, randomly chosen sample of the state/input, instead of the complete state/input randomized algorithms are also robust to adversarial attacks (decisions depend on chance events)

An illustrative example
Find the youngest person from a population of 1 billion Deterministic algorithm: linear search has a complexity of 1 billion A randomized version: find the youngest of 30 randomly chosen people has a complexity of 30 Performance linear search will find the absolute youngest person (rank = 1) if R is the person found by randomized algorithm, we can say thus, we can say that the performance of the randomized algorithm is good with a high probability

Randomizing iterative schemes
Often, we want to perform some operation iteratively Example: find the youngest person each year Say in 2007 you choose 30 people at random and store the identity of the youngest person in memory in 2008 you choose 29 new people at random let R be the youngest person from these = 30 people or

Randomized approximation to the Max Wt Matching algorithm joint work with Paolo Giaccone and Devavrat Shah

Notation and definitions
Arrivals: Aij(t), i.i.d. Bernoulli, E(Aij(t)) = Q(t) = [Qij(t)] are backlogs at time t 1 2 3 Scheduling problem: Given Q(t), determine a matching, S(t), of inputs and outputs to maximize throughput and minimize backlogs Q(t+1) = [Q(t) + A(t) – S(t)]+

Useful performance metrics
Throughput an algorithm is stable (or delivers 100% throughput) if for any admissible arrival, the average backlog is bounded; i.e. (equivalent to positive recurrence of Q(t)) Minimize average backlogs or, equivalently, packet delays

Scheduling: Bipartite graph matching
19 3 4 21 1 18 7 Schedule or Matching

Scheduling algorithms
19 3 4 21 1 18 7 Max Wt Matching 19 18 Max Size Matching 19 1 7 Practical Maximal Matchings  Stable (Tassiulas-Ephremides 92, McKeown et. al. 96, Dai-Prabhakar 00)  Not stable  Not stable (McKeown-Ananthram-Walrand 96)

The Maximum Weight Matching Algorithm
MWM: performance throughput: stable (Tassiulas-Ephremides 92; McKeown et al 96; Dai-Prabhakar 00) backlogs: very low on average (Leonardi et al 01; Shah-Kopikare 02) MWM: implementation has cubic worst-case complexity (approx. 27,000 iterations for a 30-port switch) MWM algorithms involve backtracking: i.e. edges laid down in one iteration may be removed in a subsequent iteration algorithm not amenable to pipelining

Switch algorithms 19 19 18 1 7 Better performance
Max Wt Matching 19 18 Maximal matching Max Size Matching 19 1 7 Not stable Not stable Stable and low backlogs Better performance Easier implementation

Randomized approximation to MWM
Consider the following randomized approximation: At every time - sample d matchings independently and uniformly - use the heaviest of these d matchings to schedule packets Ideally we would like to use a small value of d. However,… Theorem. Even with d = N, this algorithm is not stable. In fact, when d = N, the throughput is at most (Giaccone-Prabhakar-Shah 02)

Proof Let Eij be the edge connecting input i to output j, then
Since the edge Eij can be served only if it is chosen by at least one of the d matching, it follows that the throughput is at most 63%

Tassiulas’ algorithm Previous matching S(t-1) Random Matching R(t)
Next time MAX Current matching S(t)

Tassiulas’ algorithm MAX S(t-1) R(t) W(R(t))=150 W(S(t-1))=160 S(t) 10
40 30 MAX 10 70 10 60 20 S(t-1) R(t) W(S(t-1))=160 W(R(t))=150 S(t)

Performance of Tassiulas’ algorithm
Theorem (Tassiulas 98): The above scheme is stable under any admissible Bernoulli IID inputs.

Backlogs under Tassiulas’ algorithm

Reducing backlogs: the Merge operation
10 50 10 40 30 10 70 Merge 10 60 20 S(t-1) R(t) W(S(t-1))=160 W(R(t))=150 30 v/s 120 130 v/s 30

Reducing backlogs: the Merge operation
10 50 10 40 30 10 Merge 70 10 60 20 S(t-1) R(t) W(S(t-1))=160 W(R(t))=150 W(S(t)) = 250

Performance of Merge algorithm
Theorem (GPS): The Merge scheme is stable under any admissible Bernoulli IID inputs.

Merge v/s Max

Use arrival information: Serena
23 7 47 89 3 11 31 2 97 5 S(t-1) The arrival graph W(S(t-1))=209

23 47 89 3 11 31 2 97 5 S(t-1) The arrival graph W(S(t-1))=209

23 23 47 89 3 11 31 Merge 97 6 S(t-1) W=121 W(S(t))=243 S(t) 89 3 23 31 97 W(S(t-1))=209

Performance of Serena algorithm
Theorem (GPS): The Serena algorithm is stable under any admissible Bernoulli IID inputs.

Backlogs under Serena

Network Algorithms, Lecture 1: Intro and Principles

Similar presentations

Presentation on theme: "Network Algorithms, Lecture 1: Intro and Principles"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Network Algorithms, Lecture 1: Intro and Principles

Similar presentations

Presentation on theme: "Network Algorithms, Lecture 1: Intro and Principles"— Presentation transcript:

Similar presentations

About project

Feedback