Network Processor Algorithms: Design and Analysis Stochastic Networks Conference Montreal July 22, 2004 Balaji Prabhakar Stanford University.

Slides:



Advertisements
Similar presentations
Balaji Prabhakar Active queue management and bandwidth partitioning algorithms Balaji Prabhakar Departments of EE and CS Stanford University
Advertisements

Router Internals CS 4251: Computer Networking II Nick Feamster Spring 2008.
Router Internals CS 4251: Computer Networking II Nick Feamster Fall 2008.
Data and Computer Communications
Scheduling Criteria CPU utilization – keep the CPU as busy as possible (from 0% to 100%) Throughput – # of processes that complete their execution per.
CSIT560 Internet Infrastructure: Switches and Routers Active Queue Management Presented By: Gary Po, Henry Hui and Kenny Chong.
PFabric: Minimal Near-Optimal Datacenter Transport Mohammad Alizadeh Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker.
Courtesy: Nick McKeown, Stanford 1 Intro to Quality of Service Tahir Azim.
EE 685 presentation Optimal Control of Wireless Networks with Finite Buffers By Long Bao Le, Eytan Modiano and Ness B. Shroff.
CS 268: Lecture 8 Router Support for Congestion Control Ion Stoica Computer Science Division Department of Electrical Engineering and Computer Sciences.
CS 4700 / CS 5700 Network Fundamentals Lecture 12: Router-Aided Congestion Control (Drop it like it’s hot) Revised 3/18/13.
The War Between Mice and Elephants LIANG GUO, IBRAHIM MATTA Computer Science Department Boston University ICNP (International Conference on Network Protocols)
1 Statistical Analysis of Packet Buffer Architectures Gireesh Shrimali, Isaac Keslassy, Nick McKeown
XCP: Congestion Control for High Bandwidth-Delay Product Network Dina Katabi, Mark Handley and Charlie Rohrs Presented by Ao-Jan Su.
Router Architecture : Building high-performance routers Ian Pratt
The War Between Mice and Elephants Presented By Eric Wang Liang Guo and Ibrahim Matta Boston University ICNP
Nick McKeown CS244 Lecture 6 Packet Switches. What you said The very premise of the paper was a bit of an eye- opener for me, for previously I had never.
What's inside a router? We have yet to consider the switching function of a router - the actual transfer of datagrams from a router's incoming links to.
High Performance All-Optical Networks with Small Buffers Yashar Ganjali High Performance Networking Group Stanford University
Analysis of a Statistics Counter Architecture Devavrat Shah, Sundar Iyer, Balaji Prabhakar & Nick McKeown (devavrat, sundaes, balaji,
10 - Network Layer. Network layer r transport segment from sending to receiving host r on sending side encapsulates segments into datagrams r on rcving.
1 Architectural Results in the Optical Router Project Da Chuang, Isaac Keslassy, Nick McKeown High Performance Networking Group
Sizing Router Buffers Nick McKeown Guido Appenzeller & Isaac Keslassy SNRC Review May 27 th, 2004.
EE 122: Router Design Kevin Lai September 25, 2002.
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion Introduction.
Reducing the Buffer Size in Backbone Routers Yashar Ganjali High Performance Networking Group Stanford University February 23, 2005
1 Emulating AQM from End Hosts Presenters: Syed Zaidi Ivor Rodrigues.
Nick McKeown 1 Memory for High Performance Internet Routers Micron February 12 th 2003 Nick McKeown Professor of Electrical Engineering and Computer Science,
Computer Networking Lecture 17 – Queue Management As usual: Thanks to Srini Seshan and Dave Anderson.
The War Between Mice and Elephants By Liang Guo (Graduate Student) Ibrahim Matta (Professor) Boston University ICNP’2001 Presented By Preeti Phadnis.
1 Achieving 100% throughput Where we are in the course… 1. Switch model 2. Uniform traffic  Technique: Uniform schedule (easy) 3. Non-uniform traffic,
Core Stateless Fair Queueing Stoica, Shanker and Zhang - SIGCOMM 98 Rigorous fair Queueing requires per flow state: too costly in high speed core routers.
1 BRICK: A Novel Exact Active Statistics Counter Architecture Nan Hua 1, Bill Lin 2, Jun (Jim) Xu 1, Haiquan (Chuck) Zhao 1 1 Georgia Institute of Technology.
Pipelined Two Step Iterative Matching Algorithms for CIOQ Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York, Stony Brook.
Localized Asynchronous Packet Scheduling for Buffered Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York Stony Brook.
1 IP routers with memory that runs slower than the line rate Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford.
Computer Networks Switching Professor Hui Zhang
An Integrated IP Packet Shaper and Scheduler for Edge Routers MSEE Project Presentation Student: Yuqing Deng Advisor: Dr. Belle Wei Spring 2002.
Congestion models for bursty TCP traffic Damon Wischik + Mark Handley University College London DARPA grant W911NF
1 Copyright © Monash University ATM Switch Design Philip Branch Centre for Telecommunications and Information Engineering (CTIE) Monash University
Summary of switching theory Balaji Prabhakar Stanford University.
Bandwidth partitioning (jointly with R. Pan, C. Psounis, C. Nair, B. Yang, L. Breslau and S. Shenker)
CEDAR Counter-Estimation Decoupling for Approximate Rates Erez Tsidon Joint work with Iddo Hanniel and Isaac Keslassy Technion, Israel 1.
1 Flow Identification Assume you want to guarantee some type of quality of service (minimum bandwidth, maximum end-to-end delay) to a user Before you do.
ACN: RED paper1 Random Early Detection Gateways for Congestion Avoidance Sally Floyd and Van Jacobson, IEEE Transactions on Networking, Vol.1, No. 4, (Aug.
1 On Class-based Isolation of UDP, Short-lived and Long-lived TCP Flows by Selma Yilmaz Ibrahim Matta Computer Science Department Boston University.
Designing Packet Buffers for Internet Routers Friday, October 23, 2015 Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford.
Addressing Queuing Bottlenecks at High Speeds Sailesh Kumar Patrick Crowley Jonathan Turner.
Author: Sriram Ramabhadran, George Varghese Publisher: SIGMETRICS’03 Presenter: Yun-Yan Chang Date: 2010/12/29 1.
Designing Packet Buffers for Router Linecards Sundar Iyer, Ramana Kompella, Nick McKeown Reviewed by: Sarang Dharmapurikar.
Winter 2006EE384x1 EE384x: Packet Switch Architectures I Parallel Packet Buffers Nick McKeown Professor of Electrical Engineering and Computer Science,
Packet Scheduling and Buffer Management Switches S.Keshav: “ An Engineering Approach to Networking”
Nick McKeown1 Building Fast Packet Buffers From Slow Memory CIS Roundtable May 2002 Nick McKeown Professor of Electrical Engineering and Computer Science,
1 Performance Guarantees for Internet Routers ISL Affiliates Meeting April 4 th 2002 Nick McKeown Professor of Electrical Engineering and Computer Science,
CS640: Introduction to Computer Networks Aditya Akella Lecture 20 - Queuing and Basics of QoS.
Nick McKeown Spring 2012 Lecture 2,3 Output Queueing EE384x Packet Switch Architectures.
Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University
Buffered Crossbars With Performance Guarantees Shang-Tse (Da) Chuang Cisco Systems EE384Y Thursday, April 27, 2006.
1 Fair Queuing Hamed Khanmirza Principles of Network University of Tehran.
1 A quick tutorial on IP Router design Optics and Routing Seminar October 10 th, 2000 Nick McKeown
Queue Scheduling Disciplines
Block-Based Packet Buffer with Deterministic Packet Departures Hao Wang and Bill Lin University of California, San Diego HSPR 2010, Dallas.
Network layer (addendum) Slides adapted from material by Nick McKeown and Kevin Lai.
1 Building big router from lots of little routers Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford University.
Buffer Management in a Switch
Chapter 6 Queuing Disciplines
Addressing: Router Design
Coping with (exploiting) heavy tails
Techniques and problems for
Techniques for Fast Packet Buffers
Presentation transcript:

Network Processor Algorithms: Design and Analysis Stochastic Networks Conference Montreal July 22, 2004 Balaji Prabhakar Stanford University

2 Overview Network Processors –What are they? –Why are they interesting to industry and to researchers SIFT: a simple algorithm for identifying large flows –The algorithm and its uses Traffic statistics counters –The basic problem and algorithms Sharing processors and buffers –A cost/benefit analysis

3 Cisco GSR Juniper M160 6ft 19 ” 2ft Capacity: 160Gb/s Power: 4.2kW 3ft 2.5ft 19 ” Capacity: 80Gb/s Power: 2.6kW Capacity is sum of rates of line-cards IP Routers 2.5ft

4 A Detailed Sketch Network Processor Lookup Engine Network Processor Lookup Engine Network Processor Lookup Engine Interconnection Fabric Switch Output Scheduler Line cardsOutputs Packet Buffers Packet Buffers Packet Buffers

5 Network Processors Network processors are an increasingly important component of IP routers They perform a number of tasks (essentially everything except Switching and Route lookup) –Buffer management –Congestion control –Output scheduling –Traffic statistics counters –Security … They are programmable, hence add great flexibility to a router’s functionality

6 Network Processors But, because they operate under severe constraints –very high line rates –heat constraints the algorithms that they can support should be lightweight They have become very attractive to industry They give rise to some interesting algorithmic and performance analytic questions

7 Rest Of The Talk SIFT: a simple algorithm for identifying large flows –The algorithm and its uses  with Arpita Ghosh and Costas Psounis Traffic statistics counters –The basic problem and algorithms  with Sundar Iyer, Nick McKeown and Devavrat Shah Sharing processors and buffers –A cost/benefit analysis  with Vivek Farias and Ciamac Moallemi

8 SIFT: Motivation Current egress buffers on router line cards serve packets in a FIFO manner But, giving the packets of short flows a higher priority, e.g. using the SRPT (Shortest Remaining Processing Time) policy –reduces average flow delay –given the heavy-tailed nature of Internet flow size distribution, the reduction in delay can be huge Egress Buffer FIFO: SRPT:

9 But … SRPT is unimplementable –router needs to know residual flow sizes for all enqueued flows: virtually impossible to implement Other pre-emptive schemes like SFF (shortest flow first) or LAS (least attained service) are like-wise too complicated to implement This has led researchers to consider tagging flows at the edge, where the number of distinct flows is much smaller –but, this requires a different design of edge and core routers –more importantly, needs extra space on IP packet headers to signal flow size Is something simpler possible?

10 SIFT: A randomized algorithm Flip a coin with bias p (= 0.01, say) for heads on each arriving packet, independently from packet to packet A flow is “sampled” if one its packets has a head on it HTTTTT H

11 SIFT: A Randomized Algorithm A flow of size X has roughly 0.01X chance of being sampled –flows with fewer than 15 packets are sampled with prob 0.15 –flows with more than 100 packets are sampled with prob 1 –the precise probability is: 1 – (1-0.01) X Most short flows will not be sampled, most long flows will be

12 Ideally, we would like to sample like the blue curve Sampling with prob p gives the red curve –there are false positives and false negatives Can we get the green curve? The Accuracy of Classification Flow size Prob (sampled)

13 Sample with a coin of bias q = 0.1 –say that a flow is “sampled” if it gets two heads! –this reduces the chance of making errors –but, you have to have a count the number heads So, how can we use SIFT at a router? SIFT+

14 SIFT at a router Sample incoming packets Place any packet with a head (or the second such packet) in the low priority buffer Place all further packets from this flow in the low priority buffer (to avoid mis-sequencing) All flows B Short flows B/2 Long flows B/2 sampling

15 Topology: Simulation results Traffic Sources Traffic Sinks

16 Overall Average Delays

17 Average Delay for Short Flows

18 Average Delay for Long Flows

19 SIFT needs –two logical queues in one physical buffer –to sample arriving packets –a table for maintaining id of sampled flows –to check whether incoming packet belongs to sampled flow or not  all quite simple to implement Implementation Requirements

20 The buffer of the short flows has very low occupancy –so, can we simply reduce it drastically without sacrificing performance? More precisely, suppose –we reduce the buffer size for the small flows, increase it for the large flows, keep the total the same as FIFO A Big Bonus

SIFT Incurs Fewer Drops Buffer_Size(Short flows) = 10; Buffer_Size(Long flows) = 290; Buffer_Size(Single FIFO Queue) = 300; SIFT FIFO

22 Suppose we reduce the buffer size of the long flows as well Questions: –will packet drops still be fewer? –will the delays still be as good? Reducing Total Buffer Size

Drops With Less Total Buffer Buffer_Size(PRQ 0 ) = 10; Buffer_Size(PRQ 1 ) = 190; Buffer_Size(One Queue) = 300; One Queue SIFT FIFO

Delay Histogram for Short Flows SIFT FIFO

Delay Histogram for Long Flows SIFT FIFO

26 The amount of buffering needed to keep links fully utilized –old formula: = 10 Gbps x 0.25 = 2.5 G –corrected to: ¼ 250 M But, this formula is for large (elephant) flows, not for short (mice) flows –elephant arrival rate: 0.65 or 0.7 of C; hence they smaller buffers for them –mice buffers are almost empty due to high priority, mice don’t cause elephant packet drops –elephants use TCP to regulate their sending rate according to Why SIFT Reduces Buffers mice elephants SIFT

27 A randomized scheme, preliminary results show that –it has a low implementation complexity –it reduces delays drastically (users are happy) –with 30-35% smaller buffers at egress line cards (router manufacturers are happy) Leads to a 15 pkts or less lane on the Internet, could be useful Further work needed –at the moment we have a good understanding of how to sample, and extensive (and encouraging) simulation tests –need to understand the effect of reduced buffers on end-to-end congestion control algorithms Conclusions for SIFT

28 Traffic Statistics Counters: Motivation Switches maintain statistics, typically using counters that are incremented when packets arrive At high line rates, memory technology is a limiting factor for the implementation of counters; for example, in a 40 Gb/s switch, each packet must be processed in 8 ns To maintain a counter per flow at these line rates, we would like an architecture with the speed of SRAM, and the density (size) of DRAM

29 Counter Management Algorithm SRAM Arrivals (at most one per time slot) N counters DRAM … Update counter in DRAM, empty corresponding counter in SRAM (once every b time slots) … Hybrid Architecture Shah, Iyer, Prabhakar, and McKeown (2001) proposed a hybrid SRAM/DRAM architecture

30 Counter Management Algorithm Shah et al. place a requirement on the counter management algorithm (CMA) that it must maintain all counter values accurately That is, given N and b, what should the size of each SRAM counter be so that no counts are missed?

31 Some CMAs Round robin –maximum counter value is bN Largest Counter First (LCF) –optimal in terms of SRAM memory usage –no counter can have a value larger than:

32 Analysis of LCF This upper bound is proved by establishing a bound on the following potential (Lyapunov) function –let Q i (t) be the size of counter i at time t, then Hence, the size of the largest counter is at most E.g. for b = 2,

33 An Implementable Algorithm LCF is difficult to implement –with one counter per flow, we would like to support at least 1 million counters –maintaining a sorted list of counters to determine the longest counter takes too much SRAM memory Ramabhadran and Varghese (2003) proposed a simpler algorithm with the same memory usage as LCF

34 LCF with Threshold The algorithm keeps track of the counters that have value at least as large as b At any service time, let j be the counter with the largest value among those incremented since the previous service, and let c be its value –if c ¸ b, serve counter j –if c · b, serve any counter with value at least b; if no such counter exists, serve counter j Maintaining the counters with values at least b is a non-trivial problem; it is solved using a bitmap and an additional data structure Is something even simpler possible?

35 Some Simpler Algorithms … Possible approaches for a CMA that is simpler to implement: –arrival information (serve largest counter among those incremented) –random sampling –round-robin pointer Trade-off between simplicity and performance: more SRAM is needed in the worst case for these schemes

36 An Alternative Architecture Decision problem: given a counter with a particular value and the occupancy of the buffer, when should the counter value be moved to the FIFO buffer? What size counters does this lead to? –Interesting question with Poisson arrivals, exponential services, tractable Counter Management Algorithm SRAM N counters DRAM … … FIFO Buffer

37 The Cost of Sharing We have seen that there is a very limited amount of buffering and processing capability in each line card In order to fully utilize these resources, it will become necessary to share them amongst the packets arriving at each line card But, sharing imposes a cost –we may need to traverse the switch fabric more often than needed –each of the two processors involved in a migration will need to do some processing; e.g.  local, 1 remote, instead of just 1 –or, the host processor may simply be worse at the processing; e.g. 1 local versus K (> 1) remote Need to understand the tradeoff between costs and benefits –will focus on a specific queueing model –interested in simple rules –benefit measured in reduction of backlogs

38 The Setup Does sharing reduce backlogs? Poisson ( ) 1 K 1 exp(1)

39 Job arrives at queue 1 Send the job to queue 2 if Otherwise, keep the job in queue 1 Analogous policy for jobs arriving at queue 2 Additive Threshold Policy

40 Additive Thresholds - Queue Tails No Sharing

41 Theorem: Additive policy is stable if and unstable if For example, if Additive Thresholds - Stability Stable for Unstable for

42 The pros/cons of sharing –Reduction in backlogs –Loss of throughput Inference

43 Job arrives at queue 1 Send the job to queue 2 if Otherwise, keep the job in queue 1 Multiplicative Threshold Policy Theorem: Multiplicative policy is stable for all < 1 Interestingly, this policy improves delays while preserving throughput!

44 Multiplicative Thresholds - Queue Tails No Sharing

45 Multiplicative Thresholds - Delay Average Delay

46 Conclusions Network processors add useful features to a router’s function There are many algorithmic questions that come up –simple, high performance algorithms are needed For the theorist, there are many new and interesting questions; we have seen three examples briefly –SIFT: a sampling algorithm –Designing traffic statistics counters –Sharing: a cost-benefit analysis