Network Processors and their memory Network Processor Workshop, Madrid 2004 Nick McKeown Departments of Electrical Engineering and Computer Science, Stanford.

Slides:



Advertisements
Similar presentations
EE384Y: Packet Switch Architectures
Advertisements

1 Understanding Buffer Size Requirements in a Router Thanks to Nick McKeown and John Lockwood for numerous slides.
Computer Organization and Architecture
Router Buffer Sizing and Reliability Challenges in Multicast Aditya Akella 02/28.
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
Advanced Computer Networking Congestion Control for High Bandwidth-Delay Product Environments (XCP Algorithm) 1.
1 Statistical Analysis of Packet Buffer Architectures Gireesh Shrimali, Isaac Keslassy, Nick McKeown
Sizing Router Buffers Guido Appenzeller Isaac Keslassy Nick McKeown Stanford University.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 High Speed Router Design Shivkumar Kalyanaraman Rensselaer Polytechnic Institute
Router Architecture : Building high-performance routers Ian Pratt
May 28th, 2002Nick McKeown 1 Scaling routers: Where do we go from here? HPSR, Kobe, Japan May 28 th, 2002 Nick McKeown Professor of Electrical Engineering.
Designing Networks with Little or No Buffers or Can Gulliver Survive in Lilliput? Yashar Ganjali High Performance Networking Group Stanford University.
High Performance All-Optical Networks with Small Buffers Yashar Ganjali High Performance Networking Group Stanford University
High Performance Networking with Little or No Buffers Yashar Ganjali High Performance Networking Group Stanford University
1 Circuit Switching in the Core OpenArch April 5 th 2003 Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University
1 Architectural Results in the Optical Router Project Da Chuang, Isaac Keslassy, Nick McKeown High Performance Networking Group
1 OR Project Group II: Packet Buffer Proposal Da Chuang, Isaac Keslassy, Sundar Iyer, Greg Watson, Nick McKeown, Mark Horowitz
High Performance Networking with Little or No Buffers Yashar Ganjali on behalf of Prof. Nick McKeown High Performance Networking Group Stanford University.
Sizing Router Buffers (Summary)
Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.
Sizing Router Buffers Nick McKeown Guido Appenzeller & Isaac Keslassy SNRC Review May 27 th, 2004.
CS 300 – Lecture 20 Intro to Computer Architecture / Assembly Language Caches.
Scaling Internet Routers Using Optics Isaac Keslassy, Shang-Tse Da Chuang, Kyoungsik Yu, David Miller, Mark Horowitz, Olav Solgaard, Nick McKeown Department.
Modeling TCP in Small-Buffer Networks
EE 122: Router Design Kevin Lai September 25, 2002.
A Switch-Based Approach to Starvation in Data Centers Alex Shpiner Joint work with Isaac Keslassy Faculty of Electrical Engineering Faculty of Electrical.
The Effect of Router Buffer Size on HighSpeed TCP Performance Dhiman Barman Joint work with Georgios Smaragdakis and Ibrahim Matta.
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion Introduction.
Reducing the Buffer Size in Backbone Routers Yashar Ganjali High Performance Networking Group Stanford University February 23, 2005
Nick McKeown 1 Memory for High Performance Internet Routers Micron February 12 th 2003 Nick McKeown Professor of Electrical Engineering and Computer Science,
Isaac Keslassy (Technion) Guido Appenzeller & Nick McKeown (Stanford)
L13: Sharing in network systems Dina Katabi Spring Some slides are from lectures by Nick Mckeown, Ion Stoica, Frans.
Routers with Small Buffers Yashar Ganjali High Performance Networking Group Stanford University
Router Architectures An overview of router architectures.
Router Architectures An overview of router architectures.
1 IP routers with memory that runs slower than the line rate Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford.
Gigabit Routing on a Software-exposed Tiled-Microprocessor
Paper Review Building a Robust Software-based Router Using Network Processors.
Professor Yashar Ganjali Department of Computer Science University of Toronto
CS144 An Introduction to Computer Networks
How Emerging Optical Technologies will affect the Future Internet NSF Meeting, 5 Dec, 2005 Nick McKeown Stanford University
1 - CS7701 – Fall 2004 Review of: Sizing Router Buffers Paper by: – Guido Appenzeller (Stanford) – Isaac Keslassy (Stanford) – Nick McKeown (Stanford)
Sizing Router Buffers How much packet buffers does a router need? C Router Source Destination 2T The current “Rule of Thumb” A router needs a buffer size:
1 Processing packets in packet switches CS343 May 7 th 2003 Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University.
Designing Packet Buffers for Internet Routers Friday, October 23, 2015 Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford.
Winter 2006EE384x1 EE384x: Packet Switch Architectures I Parallel Packet Buffers Nick McKeown Professor of Electrical Engineering and Computer Science,
Nick McKeown1 Building Fast Packet Buffers From Slow Memory CIS Roundtable May 2002 Nick McKeown Professor of Electrical Engineering and Computer Science,
1 Performance Guarantees for Internet Routers ISL Affiliates Meeting April 4 th 2002 Nick McKeown Professor of Electrical Engineering and Computer Science,
CS 4396 Computer Networks Lab Router Architectures.
Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
Winter 2008CS244a Handout 71 CS244a: An Introduction to Computer Networks Handout 7: Congestion Control Nick McKeown Professor of Electrical Engineering.
Winter 2003CS244a Handout 71 CS492B Project #2 TCP Tutorial # Jin Hyun Ju.
1 A quick tutorial on IP Router design Optics and Routing Seminar October 10 th, 2000 Nick McKeown
1 How scalable is the capacity of (electronic) IP routers? Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
1 Flow & Congestion Control Some slides are from lectures by Nick Mckeown, Ion Stoica, Frans Kaashoek, Hari Balakrishnan, and Sam Madden Prof. Dina Katabi.
Buffers: How we fell in love with them, and why we need a divorce Hot Interconnects, Stanford 2004 Nick McKeown High Performance Networking Group Stanford.
Networks with Very Small Buffers Yashar Ganjali, Guido Appenzeller, High Performance Networking Group Prof. Ashish Goel, Prof. Tim Roughgarden, Prof. Nick.
1 Building big router from lots of little routers Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford University.
Sachin Katti, CS244 Slides courtesy: Nick McKeown
Weren’t routers supposed
Addressing: Router Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Techniques and problems for
Routers with Very Small Buffers
Presentation transcript:

Network Processors and their memory Network Processor Workshop, Madrid 2004 Nick McKeown Departments of Electrical Engineering and Computer Science, Stanford University

2 Outline  What I was going to say  Network processors and their memory  Packet processing is all about getting packets into and out of a chip and memory.  Computation is a side-issue.  Memory speed is everything: Speed matters more than size  Remarks

3 General Observations  Up until about 1998,  Low-end packet switches used general purpose processors,  Mid-range packet switches used FPGAs for datapath, general purpose processors for control plane.  High-end packet switches used ASICs for datapath, general purpose processors for control plane.  More recently,  3 rd party network processors used in some low-end datapaths.  Home-grown network processors used in mid- and high-end.

4 Why NPUs seem like a good idea  What makes a CPU appealing for a PC  Flexibility: Supports many applications  Time to market: Allows quick introduction of new applications  Future proof: Supports as-yet unthought of applications  No-one would consider using fixed function ASICs for a PC

5 Why NPUs seem like a good idea  What makes a NPU appealing  Time to market: Saves 18months building an ASIC. Code re-use.  Flexibility: Protocols and standards change.  Future proof: New protocols emerge.  Less risk: Bugs more easily fixed in s/w.  Surely no-one would consider using fixed function ASICs for new networking equipment?

6 Why NPUs seem like a bad idea  Jack of all trades, master of none  NPUs are difficult to program  NPUs inevitably consume more power,  …run more slowly and  …cost more than an ASIC  Requires domain expertise  Why would a/the networking vendor educate its suppliers?  Designed for computation rather than memory- intensive operations

7 NPU Characteristics  NPUs try hard to hide memory latency  Conventional caching doesn’t work Equal number of reads and writes No temporal or spatial locality Cache misses lose throughput, confuse schedulers and break pipelines  Therefore it is common to use multiple processors with multiple contexts

8 Network Processors Load-balancing cache Off chip Memory Dispatch CPU Dedicated HW support, e.g. lookups Dedicated HW support, e.g. lookups Dedicated HW support, e.g. lookups Dedicated HW support, e.g. lookups Incoming packets dispatched to: 1. Idle processor, or 2. Processor dedicated to packets in this flow (to prevent mis-sequencing), or 3. Special-purpose processor for flow, e.g. security, transcoding, application-level processing.

9 Network Processors Pipelining cache Off chip Memory CPU cache CPU cache CPU cache CPU Dedicated HW support, e.g. lookups Dedicated HW support, e.g. lookups Dedicated HW support, e.g. lookups Dedicated HW support, e.g. lookups Processing broken down into (hopefully balanced) steps, Each processor performs one step of processing.

10 Question Is it clear that multiple small parallel processors are needed?

11 Doubts  When are 10 processors at speed 1 better than 1 processor at speed 10?  Network processors make sense if:  Application is parallelizable into multiple threads/contexts.  Uniprocessor performance is limited by load-latency.  If general purpose processors evolve anyway to:  Contain multiple processors per chip.  Support hardware multi-threading.  …then perhaps they are better suited because:  Greater development effort means faster general purpose processors.  Better development environments.

12 Outline  What I was going to say  Network processors and their memory  Packet processing is all about getting packets into and out of a chip and memory.  Computation is a side-issue.  Memory speed is everything: Speed matters more than size.  Remarks

13 NPUs and Memory Buffer MemoryLookupCountersSchedule StateClassification Program DataInstruction Code Typical NPU or packet-processor has 8-64 CPUs, 12 memory interfaces and 2000 pins

14 Trends in Technology, Routers & Traffic DRAM Random Access Time 1.1x / 18months Moore’s Law 2x / 18 months Router Capacity 2x / 18months Line Capacity 2x / 7 months User Traffic 2x / 12months

15 Memory gets further away  Accessing memory becomes twice as expensive every 18 months.  CPUs  Bigger caches  Larger refill blocks and faster pins  Better pre-fetching algorithms  NPUs  More CPUs…?

16 Backbone router capacity 1Tb/s 1Gb/s 10Gb/s 100Gb/s Router capacity per rack 2x every 18 months

17 Backbone router capacity 1Tb/s 1Gb/s 10Gb/s 100Gb/s Router capacity per rack 2x every 18 months Traffic 2x every year

18 Trends and Consequences CPU Instructions per minimum length packet 1 Consequences: 1.Per-packet processing is getting harder. 2.Efficient, simple processing will become more important. 3.Routers will get faster, simpler and more efficient. (Weren’t they supposed to simple in the first place?) Disparity between traffic and router growth 2 1Tb/s Router capacity 2x every 18 months Traffic 2x every year 100Tb/s 2015: 16x disparity

19 Trends and Consequences (2) Power consumption is out of control 3 Disparity between line-rate and memory access time 4 Consequences: 1.Power efficiency will continue to be important. 2.Memories will seem slower and slower. Are we just going to keep adding more parallelism?

20 Predictions (1)  Memory speed will matter more than size  Memory speed will remain a problem.  Waiting for slow off-chip memory will become intolerable.  Memory size will become less of an issue.  Memory Size  Packet buffers: Today they are too big; they’ll get smaller.

21 Memory Size  Universally applied rule-of-thumb:  A router needs a buffer size: 2T is the round-trip propagation time C is the capacity of the outgoing link  Background  Mandated in backbone and edge routers.  Appears in RFPs and IETF architectural guidelines.  Has huge consequences for router design.  Comes from dynamics of TCP congestion control.  Villamizar and Song: “High Performance TCP in ANSNET”, CCR,  Based on 16 TCP flows at speeds of up to 40 Mb/s.

22 Example  10Gb/s linecard or router  Requires 300Mbytes of buffering.  Read and write new packet every 32ns.  Memory technologies  SRAM: require 80 devices, 1kW, $2000.  DRAM: require 4 devices, but too slow.  Problem gets harder at 40Gb/s  Hence RLDRAM, FCRAM, etc.

23 Rule-of-thumb  Where did the rule-of-thumb come from?  Is it correct? (No) Joint work with Guido Appenzeller and Isaac Keslassy, Stanford

24 Single TCP Flow B Dest CC’ > C Source t Window size Buffer size and RTT For every W ACKs received, send W+1 packets

25 Over-buffered Link

26 Under-buffered Link

27 Buffer = Rule-of-thumb Interval magnified on next slide

28 Microscopic TCP Behavior When sender pauses, buffer drains one RTT Drop

29 Origin of rule-of-thumb  While Source pauses, buffer drains  Source pauses for 2T + B/C – W max /2C seconds  Buffer drains in B/C seconds  Therefore, buffer never goes empty if B > 2T x C  We can size B to keep bottleneck link busy W max - 2W max - 1W max Timeout Pause Source Destination

30 Rule-of-thumb  Rule-of-thumb makes sense for one flow  Typical backbone link has > 20,000 flows  Does the rule-of-thumb still hold?  Answer:  If flows are perfectly synchronized, then Yes.  If flows are desynchronized then No.

31 Buffer size is height of sawtooth t B 0

32 If flows are synchronized  Aggregate window has same dynamics  Therefore buffer occupancy has same dynamics  Rule-of-thumb still holds. t

33 Two TCP Flows Two TCP flows can synchronize

34 If flows are not synchronized  Aggregate window has less variation  Therefore buffer occupancy has less variation  The more flows, the smaller the variation  Rule-of-thumb does not hold. t

35 If flows are not synchronized  With a large number of flows (>500) central limit theorem applies  Therefore, we can pick the utilization we want, and determine the buffer size.

36 Required buffer size Simulation

37 Experiments with backbone router GSR TCP Flows Router BufferLink Utilization PktsRAMModelSimExp x 1 x 2 x 3 x Mb 2Mb 4Mb 8Mb 96.9% 99.9% 100% 94.7% 99.3% 99.9% 99.8% 94.9% 98.1% 99.8% 99.7% x 1 x 2 x 3 x kb 1Mb 2Mb 4Mb 99.7% 100% 99.2% 99.8% 100% 99.5% 100% 99.9% Thanks: Experiments conducted by Paul Barford and Joel Sommers, U of Wisconsin

38 In Summary  Buffer size dictated by long TCP flows.  10Gb/s linecard with 200,000 x 56kb/s flows  Rule-of-thumb: Buffer = 2.5Gbits Requires external, slow DRAM  Becomes: Buffer = 6Mbits Can use on-chip, fast SRAM Completion time halved for short-flows  40Gb/s linecard with 40,000 x 1Mb/s flows  Rule-of-thumb: Buffer = 10Gbits  Becomes: Buffer = 50Mbits

39 Outline  What I was going to say  Network processors and their memory  Packet processing is all about getting packets into and out of a chip and memory.  Computation is a side-issue.  Memory speed is everything: Speed matters more than size  Remarks

40 My 2c on network processors Context Data cache(s) DataHdr Characteristics: 1. Stream processing. 2. Multiple flows. 3. Most processing on header, not data. 4. Two sets of data: packets, context. 5. Packets have no temporal locality, and special spatial locality. 6. Context has temporal and spatial locality. Characteristics: 1. Shared in/out bus. 2. Optimized for data with spatial and temporal locality. 3. Optimized for register accesses. The nail: The hammer:

41 A network uniprocessor Data cache(s) Off chip Memory Context memory hierarchy Add hardware support for multiple threads/contexts. On-chip FIFO Off-chip FIFOs Head/tail Mailbox registers

42 Recommendations  Follow the CPU lead…  Develop quality public benchmarks  Encourage comparison and debate  Develop a “DLX” NPU to  Compare against  Encourage innovation in: Instruction sets (Parallel) programming languages and development tools  Ride the coat-tails of CPU development  Watch for the CPU with networking extensions  NPUs are about memory not computation.  Memory speed matters more than size.