Introduction to High-Performance Internet Switches and Routers

Slides:

Advertisements

Similar presentations

Router Internals CS 4251: Computer Networking II Nick Feamster Spring 2008.

Advertisements

Router Internals CS 4251: Computer Networking II Nick Feamster Fall 2008.

IP Router Architectures. Outline Basic IP Router Functionalities IP Router Architectures.

© Jörg Liebeherr ECE 1545 Packet-Switched Networks.

1 ELEN 602 Lecture 18 Packet switches Traffic Management.

Chapter 8 Hardware Conventional Computer Hardware Architecture.

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 High Speed Router Design Shivkumar Kalyanaraman Rensselaer Polytechnic Institute

Router Architecture : Building high-performance routers Ian Pratt

What's inside a router? We have yet to consider the switching function of a router - the actual transfer of datagrams from a router's incoming links to.

Introduction to High-Performance Internet Switches and Routers

4-1 Network layer r transport segment from sending to receiving host r on sending side encapsulates segments into datagrams r on rcving side, delivers.

1 Circuit Switching in the Core OpenArch April 5 th 2003 Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University

10 - Network Layer. Network layer r transport segment from sending to receiving host r on sending side encapsulates segments into datagrams r on rcving.

1 Architectural Results in the Optical Router Project Da Chuang, Isaac Keslassy, Nick McKeown High Performance Networking Group

048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion Scaling.

1 Internet Routers Stochastics Network Seminar February 22 nd 2002 Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University.

EE 122: Router Design Kevin Lai September 25, 2002.

048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion Introduction.

CS 268: Lecture 12 (Router Design) Ion Stoica March 18, 2002.

Nick McKeown 1 Memory for High Performance Internet Routers Micron February 12 th 2003 Nick McKeown Professor of Electrical Engineering and Computer Science,

Katz, Stoica F04 EECS 122: Introduction to Computer Networks Switch and Router Architectures Computer Science Division Department of Electrical Engineering.

August 20 th, A 2.5Tb/s LCS Switch Core Nick McKeown Costas Calamvokis Shang-tse Chuang Accelerating The Broadband Revolution P M C - S I E R R.

1 Growth in Router Capacity IPAM, Lake Arrowhead October 2003 Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University.

Router Architectures An overview of router architectures.

Router Architectures An overview of router architectures.

Chapter 4 Queuing, Datagrams, and Addressing

1 IP routers with memory that runs slower than the line rate Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford.

Computer Networks Switching Professor Hui Zhang

Company and Product Overview Company Overview Mission Provide core routing technologies and solutions for next generation carrier networks Founded 1996.

Professor Yashar Ganjali Department of Computer Science University of Toronto

CS 552 Computer Networks IP forwarding Fall 2005 Rich Martin (Slides from D. Culler and N. McKeown)

ATM SWITCHING. SWITCHING A Switch is a network element that transfer packet from Input port to output port. A Switch is a network element that transfer.

A 50-Gb/s IP Router 참고논문 : Craig Partridge et al. [ IEEE/ACM ToN, June 1998 ]

TO p. 1 Spring 2006 EE 5304/EETS 7304 Internet Protocols Tom Oh Dept of Electrical Engineering Lecture 9 Routers, switches.

1 Copyright © Monash University ATM Switch Design Philip Branch Centre for Telecommunications and Information Engineering (CTIE) Monash University

Advance Computer Networking L-8 Routers Acknowledgments: Lecture slides are from the graduate level Computer Networks course thought by Srinivasan Seshan.

Department of Computer and IT Engineering University of Kurdistan Computer Networks II Router Architecture By: Dr. Alireza Abdollahpouri.

Univ. of TehranAdv. topics in Computer Network1 Advanced topics in Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

1 Processing packets in packet switches CS343 May 7 th 2003 Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University.

Designing Packet Buffers for Internet Routers Friday, October 23, 2015 Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford.

EE384y EE384Y: Packet Switch Architectures Part II Scaling Crossbar Switches Nick McKeown Professor of Electrical Engineering and Computer Science,

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 ECSE-6600: Internet Protocols Informal Quiz #14 Shivkumar Kalyanaraman: GOOGLE: “Shiv RPI”

1 Performance Guarantees for Internet Routers ISL Affiliates Meeting April 4 th 2002 Nick McKeown Professor of Electrical Engineering and Computer Science,

1 Router Design Bruce Davie with help from Hari Balakrishnan & Nick McKeown.

An Introduction to Packet Switching Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford University

CS 4396 Computer Networks Lab Router Architectures.

Final Chapter Packet-Switching and Circuit Switching 7.3. Statistical Multiplexing and Packet Switching: Datagrams and Virtual Circuits 4. 4 Time Division.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Winter 2006EE384x Handout 11 EE384x: Packet Switch Architectures Handout 1: Logistics and Introduction Professor Balaji Prabhakar

Lecture Note on Switch Architectures. Function of Switch.

1 A quick tutorial on IP Router design Optics and Routing Seminar October 10 th, 2000 Nick McKeown

1 How scalable is the capacity of (electronic) IP routers? Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University

Packet Switch Architectures The following are (sometimes modified and rearranged slides) from an ACM Sigcomm 99 Tutorial by Nick McKeown and Balaji Prabhakar,

The Fork-Join Router Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford University

Network Layer4-1 Chapter 4 Network Layer All material copyright J.F Kurose and K.W. Ross, All Rights Reserved Computer Networking: A Top Down.

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

Univ. of TehranComputer Network1 Computer Networks Computer Networks (Graduate level) University of Tehran Dept. of EE and Computer Engineering By: Dr.

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

Network layer (addendum) Slides adapted from material by Nick McKeown and Kevin Lai.

1 Building big router from lots of little routers Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford University.

Weren’t routers supposed

Addressing: Router Design

Packet Switch Architectures

EE 122: Lecture 7 Ion Stoica September 18, 2001.

Chapter 4 Network Layer Computer Networking: A Top Down Approach 5th edition. Jim Kurose, Keith Ross Addison-Wesley, April Network Layer.

Chapter 3 Part 3 Switching and Bridging

Project proposal: Questions to answer

Techniques and problems for

Packet Switch Architectures

Presentation transcript:

Introduction to High-Performance Internet Switches and Routers

Network Architecture http://www.ust.hk/itsc/network/ Metropolitan Long Haul Network DWDM Core 10GbE Core Core Routers 10GbE Core Routers Metropolitan Metropolitan Campus / Residential 10GbE Edge Routers Edge switch 10GbE GbE • • • • • • Access switch Access Routers http://www.ust.hk/itsc/network/

pop pop pop pop

How the Internet really is: Current Trend Modems, DSL SONET/SDH DWDM … in the access network, where the majority of users use either a phone modem or a DSL line to connect to the Internet. So, why don’t textbooks talk about these circuit switches, because these circuits not integrated with IP; they are seen by IP as static point-to-point links. We have this architecture for historic reasons, not because of a well thought design process. In the past when an Internet Service Provider in the West coast wanted to connect with another ISP in the East Coast, it had two options; lay down a cable, which was very expensive, or it would rent a circuit from the long-distance telephone carriers, who were using circuit switching. Also when the ISP wanted to access residential customer, it would go through one of the few companies with a connection to the home, the local phone company. Now, is this hybrid architecture the right network architecture? Wouldn’t it be better to have one that uses only packets? or even perhaps one that uses only circuits? Well, let me define some performance criteria that I will use to answer those questions.

What is Routing? A B C R1 R2 R3 R4 D E F R5 5 5

Points of Presence (POPs) A B C POP1 POP3 POP2 POP4 D E F POP5 POP6 POP7 POP8 6 6

Where High Performance Routers are Used (10 Gb/s) R2 (10 Gb/s) R1 R6 R5 R4 R3 R7 R9 R10 R8 R11 R12 R14 R13 R16 R15 (10 Gb/s) (10 Gb/s) 049045 - Router Architectures 7 7

Hierarchical arrangement End hosts (1000s per mux) Access multiplexer Edge Routers POP Core Routers 10Gb/s “OC192” POP POP POP: Point of Presence. Richly interconnected by mesh of long-haul links. Typically: 40 POPs per national network operator; 10-40 core routers per POP. Point of Presence (POP)

Typical POP Configuration Transport Network DWDM/SONET Terminal Backbone routers 10G WAN Transport Links > 50% of high speed interfaces are router-to-router (Core routers) 10G Router-Router Intra-Office Links Aggregation switches/routers (Edge Switches)

Today’s Network Equipment Routers Switches SONET DWDM LAYER 3 LAYER 2 LAYER 1 LAYER 0 Internet FR & ATM SONET DWDM Protocol

Functions in a packet switch Interconnect scheduling Route lookup TTL process ing Buffer QoS schedul Control plane Ingress linecard Egress linecard Interconnect Framing Data path Control path Scheduling path

Functions in a circuit switch Interconnect scheduling Control plane Interconnect Framing Ingress linecard Egress linecard Data path Control path

Our emphasis for now is to look at packet switches (IP, ATM, Ethernet, Framerelay, etc.) Please if anyone has additional comments please speak up

What a Router Looks Like Cisco CRS-1 Juniper T1600 60 cm 44 cm Capacity: 640Gb/s Power: 13.2kW Full rack Capacity: 1.6Tb/s Power: 9.1kW Half-a-rack 214 cm 95 cm Cisco: CRS-1 (“up to 72 boxes for a total capacity of 92 terabits per second “) Juniper: TX (four T-640) 101cm 79 cm (16-Slot Single-Shelf System) (16-Slot System) 049045 - Router Architectures 14 14

What a Router Looks Like Cisco GSR 12416 Juniper M160 19” 19” Capacity: 160Gb/s Power: 4.2kW Capacity: 80Gb/s Power: 2.6kW 6ft 3ft 2ft 2.5ft

A Router Chassis Fans/ Power Supplies Linecards

Backplane A Circuit Board with connectors for line cards High speed electrical traces connecting line cards to fabric Usually passive Typically 30-layer boards

Line Card Picture

What do these two have in common? Cisco CRS-1 Cisco Catalyst 3750G

What do these two have in common? CRS-1 linecard 20” x (18”+11”) x 1RU 40Gbps, 80MPPS State-of-the-art 0.13u silicon Full IP routing stack including IPv4 and IPv6 support Distributed IOS Multi-chassis support Cat 3750G Switch 19” x 16” x 1RU 52Gpbs, 78 MPPS State-of-the-art 0.13u silicon Full IP routing stack including IPv4 and IPv6 support Distributed IOS Multi-chassis support

What is different between them? Cisco CRS-1 Cisco Catalyst 3750G

A lot… CRS-1 linecard Cat 3750G Switch Up to 1024 linecards Fully programmable forwarding 2M prefix entries and 512K ACLs 46Tbps 3-stage switching fabric MPLS support H-A non-stop routing protocols Cat 3750G Switch Up to 9 stack members Hardwired ASIC forwarding 11K prefix entries and 1.5K ACLs 32Gbps shared stack ring L2 switching support Re-startable routing applications Also note that CRS-1 line card is 30x the material cost of Cat3750G

Other packet switches Cisco 7500 “edge” routers Lucent GX550 Core ATM switch DSL router

What is Routing? R3 A B C R1 R2 R4 D E F R5 R5 F R3 E D Next Hop Destination

What is Routing? R3 A B C R1 R2 R4 D E F R5 20 bytes 32 Data 16 32 4 1 Data Options (if any) Destination Address Source Address Header Checksum Protocol TTL Fragment Offset Flags Fragment ID Total Packet Length T.Service HLen Ver 20 bytes D D D R5 F R3 E D Next Hop Destination

What is Routing? A B C R1 R2 R3 R4 D E F R5

Basic Architectural Elements of a Router Routing Routing table update (OSPF, RIP, IS-IS) Admission Control Congestion Control Reservation Control Plane “Typically in Software” Routing Lookup Packet Classifier Switching Arbitration Scheduling Switch (per-packet processing) “Typically in Hardware” Switching

Basic Architectural Components Datapath: per-packet processing 3. 1. Output Scheduling 2. Forwarding Table Interconnect Forwarding Decision Forwarding Table Forwarding Decision Forwarding Table Forwarding Decision

Per-packet processing in a Switch/Router 1. Accept packet arriving on an ingress line. 2. Lookup packet destination address in the forwarding table, to identify outgoing interface(s). 3. Manipulate packet header: e.g., decrement TTL, update header checksum. 4. Send packet to outgoing interface(s). 5. Queue until line is free. 6. Transmit packet onto outgoing line.

ATM Switch Lookup cell VCI/VPI in VC table. Replace old VCI/VPI with new. Forward cell to outgoing interface. Transmit cell onto link.

Ethernet Switch Lookup frame DA in forwarding table. If known, forward to correct port. If unknown, broadcast to all ports. Learn SA of incoming frame. Forward frame to outgoing interface. Transmit frame onto link.

IP Router Lookup packet DA in forwarding table. If known, forward to correct port. If unknown, drop packet. Decrement TTL, update header Cksum. Forward packet to outgoing interface. Transmit packet onto link.

Special per packet/flow processing The router can be equipped with additional capabilities to provide special services on a per-packet or per-class basis. The router can perform some additional processing on the incoming packets: Classifying the packet IPv4, IPv6, MPLS, ... Delivering packets according to a pre-agreed service: Absolute service or relative service (e.g., send a packet within a given deadline, give a packet a better service than another packet (IntServ – DiffServ)) Filtering packets for security reasons Treating multicast packets differently from unicast packets

Per packet Processing Must be Fast !!! Year Aggregate Line-rate Arriving rate of 40B POS packets (Million pkts/sec) 1997 622 Mb/s 1.56 1999 2.5 Gb/s 6.25 2001 10 Gb/s 25 2003 40 Gb/s 100 2006 80 Gb/s 200 2008 … Packet processing must be simple and easy to implement Memory access time is the bottleneck 200Mpps × 2 lookups/pkt = 400 Mlookups/sec → 2.5ns per lookup

First Generation Routers Shared Backplane Route Table CPU Buffer Memory Line Interface MAC Typically <0.5Gb/s aggregate capacity CPU Memory Line Interface

Bus-based Router Architectures with Single Processor The first generation of IP router Based on software implementations on a single general-purpose CPU. Limitations: Serious processing bottleneck in the central processor Memory intensive operations (e.g. table lookup & data movements) limits the effectiveness of processor power A severe limiting factor to overall router throughput from input/output (I/O) bus

Second Generation Routers CPU Route Table Buffer Memory Line Card Line Card Line Card Buffer Memory Buffer Memory Buffer Memory Fwding Cache Fwding Cache Fwding Cache MAC MAC MAC Typically <5Gb/s aggregate capacity

Bus-based Router Architectures with Multiple Processors Architectures with Route Caching Distribute packet forwarding operations Network interface cards Processors Route caches Packets are transmitted once over the shared bus Limitations: The central routing table is a bottleneck at high-speeds traffic dependent throughput (cache) shared bus is still a bottleneck

Third Generation Routers Switched Backplane Line Card CPU Card Line Card Local Buffer Memory Local Buffer Memory CPU Line Interface Routing Table Memory Fwding Table Fwding Table MAC MAC Typically <50Gb/s aggregate capacity

Switch-based Router Architectures with Fully Distributed Processors To avoid bottlenecks: Processing power Memory bandwidth Internal bus bandwidth Each network interface is equipped with appropriate processing power and buffer space.

Fourth Generation Routers/Switches Optics inside a router for the first time Optical links 100s of metres Switch Core Linecards 0.3 - 10Tb/s routers in development

Juniper TX8/T640 Alcatel 7670 RSP Avici TSR Chiaro

Next Gen. Backbone Network Architecture – One backbone, multiple access networks DSL, FTTH, Dial Telecommuter Residential (G)MPLS based Multi-service Intelligent Packet Backbone Network IPv6 IX ISP’s GGSN Service POP SGSN Dual Stack IPv4-IPv6 Enterprise Network Dual Stack IPv4-IPv6 DSL/FTTH/Dial access Network Dual Stack IPv4-IPv6 Cable Network ISP offering Native IPv6 services CE router PE Router (Service POP) PE One Backbone Network Maximizes speed, flexibility and manageability

Current Generation: Generic Router Architecture Header Processing Data Hdr Lookup IP Address Update Header Data Hdr Queue Packet ~1M prefixes Off-chip DRAM Address Table IP Address Next Hop Buffer Memory ~1M packets Off-chip DRAM

Current Generation: Generic Router Architecture (IQ) Lookup IP Address Update Header Header Processing Address Table Data Hdr Queue Packet 1 2 N Data Hdr Buffer Memory Lookup IP Address Update Header Header Processing Address Table Data Hdr Queue Packet Buffer Memory Scheduler Lookup IP Address Update Header Header Processing Address Table Queue Packet Data Hdr Buffer Memory

Current Generation: Generic Router Architecture (OQ) Lookup IP Address Update Header Header Processing Address Table Data Hdr 1 1 Queue Packet Buffer Memory Lookup IP Address Update Header Header Processing Address Table 2 2 Queue Packet Buffer Memory Lookup IP Address Update Header Header Processing Address Table N N Queue Packet Buffer Memory

Basic Architectural Elements of a Current Router Typical IP Router Linecard Buffer & State Memory Scheduler Buffer Mgmt & Scheduling Physical Layer Framing & Maintenance Buffered or Bufferless Fabric (e.g. crossbar, bus) Packet Processing Buffer Mgmt & Scheduling Lookup Tables Buffer & State Memory OC192c Linecard: ~10-30M gates ~2Gbits of memory ~2 square feet >$10k cost; price $100K Backplane

Performance metrics Capacity Throughput Controllable Delay “maximize C, s.t. volume < 2m3 and power < 5kW” Throughput Operators like to maximize usage of expensive long-haul links. Controllable Delay Some users would like predictable delay. This is feasible with output-queueing plus weighted fair queueing (WFQ). WFQ

Why do we Need Faster Routers? To prevent routers from becoming the bottleneck in the Internet. To increase POP capacity, and to reduce cost, size and power.

Why we Need Faster Routers To prevent routers from being the bottleneck Line Capacity 2x / 7 months User Traffic 2x / 12months Router Capacity 2.2x / 18months Moore’s Law 2x / 18 months DRAM Random Access Time 1.1x / 18months

Disparity between traffic Why we Need Faster Routers 1: To prevent routers from being the bottleneck Disparity between traffic and router growth traffic 5-fold disparity Router capacity

Why we Need Faster Routers 2: To reduce cost, power & complexity of POPs Big POPs need big routers POP with large routers POP with smaller routers Interfaces: Price >$200k, Power > 400W About 50-60% of interfaces are used for interconnection within the POP. Industry trend is towards large, single router per POP.

A Case study: UUNET Internet Backbone Build Up 1999 View (4Q) 2002 View (4Q) 8 OC-48 links between POPs (not parallel) 52 OC-48 links between POPs: many parallel links 3 OC-192 Super POP links: multiple parallel interfaces between POPs (D.C. – Chicago; NYC – D.C.) To Meet the traffic growth, Higher Performance Routers with Higher Port Speed, are required

Why we Need Faster Routers 2: To reduce cost, power & complexity of POPs Once a router is 99.999% available it is possible to make this step Further Reduces CapEx, Operational cost Further increases network stability

Ideal POP CARRIER OPTICAL TRANSPORT Existing Carrier Equipment Gigabit Routers Gigabit Routers CARRIER OPTICAL TRANSPORT VoIP Gateways VoIP Gateways SONET SONET DWDM and OPTICAL SWITCHES DWDM and OPTICAL SWITCHES Digital Subscriber Line Aggregation Digital Subscriber Line Aggregation ATM ATM Gigabit Ethernet Gigabit Ethernet Cable Modem Aggregation Cable Modem Aggregation

Why are Fast Routers Difficult to Make? Big disparity between line rates and memory access speed

Problem: Fast Packet Buffers Example: 40Gb/s packet buffer Size = RTT*BW = 10Gb; 64 byte packets Write Rate, R Buffer Manager Read Rate, R 1 packet every 12.8 ns 1 packet every 12.8 ns Buffer Memory How fast? Get into the motivation – say OC768 line rate buffering is a goal. - Just mention directly the amount of buffering required. Why is it not possible today? Why is it intersting…. Talks abut SRAM/DRAM. Use SRAM? + fast enough random access time, but - too low density to store 10Gb of data. Use DRAM? + high density means we can store data, but - too slow (50ns random access time).

Memory Technology (2007) Technology Max single chip density $/chip ($/MByte) Access speed Watts/chip Networking DRAM 64 MB $30-$50 ($0.50-$0.75) 40-80ns 0.5-2W SRAM 8 MB $50-$60 ($5-$8) 3-4ns 2-3W TCAM 2 MB $200-$250 ($100-$125) 4-8ns 15-30W

How fast a buffer can be made? ~5ns for SRAM ~50ns for DRAM Buffer Memory External Line 64-byte wide bus 64-byte wide bus Rough Estimate: 5/50ns per memory operation. Two memory operations per packet. Therefore, maximum ~50/5 Gb/s. Aside: Buffers need to be large for TCP to work well, so DRAM is usually required.

Packet Caches Buffer Manager SRAM DRAM Buffer Memory Small ingress SRAM cache of FIFO heads cache of FIFO tails 55 56 96 97 87 88 57 58 59 60 89 90 91 1 Q 2 5 7 6 8 10 9 11 12 14 13 15 50 52 51 53 54 86 82 84 83 85 92 94 93 95 DRAM Buffer Memory Buffer Manager SRAM Arriving 4 3 2 1 Departing 2 Packets 5 4 3 2 1 Packets Q 6 5 4 3 2 1 b>>1 packets at a time DRAM Buffer Memory

Why are Fast Routers Difficult to Make? Packet processing gets harder What will happen What we’d like: (more features) QoS, Multicast, Security, … Instructions per arriving byte time

Why are Fast Routers Difficult to Make? Clock cycles per minimum length packet since 1996

Options for packet processing General purpose processor MIPS PowerPC Intel Network processor Intel IXA and IXP processors IBM Rainier Control plane processors: SiByte (Broadcom), QED (PMC-Sierra). FPGA ASIC

General Observations Up until about 2000, More recently, Low-end packet switches used general purpose processors, Mid-range packet switches used FPGAs for datapath, general purpose processors for control plane. High-end packet switches used ASICs for datapath, general purpose processors for control plane. More recently, 3rd party network processors now used in many low- and mid-range datapaths. Home-grown network processors used in high-end.

Why are Fast Routers Difficult to Make? Demand for Router Performance Exceeds Moore’s Law Growth in capacity of commercial routers (per rack): Capacity 1992 ~ 2Gb/s Capacity 1995 ~ 10Gb/s Capacity 1998 ~ 40Gb/s Capacity 2001 ~ 160Gb/s Capacity 2003 ~ 640Gb/s Capacity 2007 ~ 11.5Tb/s Average growth rate: 2.2x / 18 months.

Maximizing the throughput of a router Engine of the whole router Operators increasingly demand throughput guarantees: To maximize use of expensive long-haul links For predictability and planning Serve as many customers as possible Increase the lifetime of the equipment Despite lots of effort and theory, no commercial router today has a throughput guarantee.

Maximizing the throughput of a router Engine of the whole router Ingress linecard Interconnect Egress linecard Framing Route lookup TTL process ing Buffer ing Buffer ing QoS schedul ing Framing Interconnect scheduling Control plane Data path Control path Scheduling path

Maximizing the throughput of a router Engine of the whole router This depends on the architecture of the switching: Input Queued Output Queued Shared memory It depends on the arbitration/scheduling algorithms within the specific architecture This is key to the overall performance of the router.

Why are Fast Routers Difficult to Make? Power: It is exceeding the limit

Switching Architectures

Generic Router Architecture Lookup IP Address Update Header Header Processing Address Table Data Hdr 1 1 Queue Packet Buffer Memory Lookup IP Address Update Header Header Processing Address Table 2 2 N times line rate Queue Packet Buffer Memory N times line rate Lookup IP Address Update Header Header Processing Address Table N N Queue Packet Buffer Memory

Generic Router Architecture Data Hdr Lookup IP Address Update Header Header Processing Address Table Data Hdr Queue Packet 1 2 N Buffer Memory Lookup IP Address Update Header Header Processing Address Table Data Hdr Queue Packet Buffer Memory Scheduler Lookup IP Address Update Header Header Processing Address Table Queue Packet Buffer Memory

Interconnects Two basic techniques Input Queueing Output Queueing Usually a non-blocking switch fabric (e.g. crossbar) Usually a fast bus

Simple model of output queued switch Link 1, ingress Link 1, egress Link 2, ingress Link 2, egress Link 3, ingress Link 3, egress Link 4, ingress Link 4, egress Link rate, R Link 2 Link rate, R Link 1 R1 Link 3 R R Link 4 R R R R

Output Queued (OQ) Switch How an OQ Switch Works Output Queued (OQ) Switch

Characteristics of an output queued (OQ) switch Arriving packets are immediately written into the output queue, without intermediate buffering. The flow of packets to one output does not affect the flow to another output. An OQ switch has the highest throughput, and lowest delay. The rate of individual flows, and the delay of packets can be controlled (QoS).

The shared memory switch A single, physical memory device Link 1, ingress Link 1, egress Link 2, ingress Link 2, egress R R Link 3, ingress Link 3, egress R R Link N, ingress Link N, egress R R

Characteristics of a shared memory switch

Memory bandwidth Basic OQ switch: Shared Memory Switch: Consider an OQ switch with N different physical memories, and all links operating at rate R bits/s. In the worst case, packets may arrive continuously from all inputs, destined to just one output. Maximum memory bandwidth requirement for each memory is (N+1)R bits/s. Shared Memory Switch: Maximum memory bandwidth requirement for the memory is 2NR bits/s.

How fast can we make a centralized shared memory switch? 5ns SRAM Shared Memory 5ns per memory operation Two memory operations per packet Therefore, up to 160Gb/s (200 x 8/10 nsec) In practice, closer to 80Gb/s 1 2 N 200 byte bus

Output Queueing The “ideal” 1 2 2 1 1

How to Solve the Memory Bandwidth Problem? Use Input Queued Switches In the worst case, one packet is written and one packet is read from an input buffer Maximum memory bandwidth requirement for each memory is 2R bits/s. However, using FIFO input queues can result in what is called “Head-of-Line (HoL)” blocking

Input Queueing Head of Line Blocking Delay Load 58.6% 100%

Head of Line Blocking

Virtual Output Queues (VoQ) At each input port, there are N queues – each associated with an output port Only one packet can go from an input port at a time Only one packet can be received by an output port at a time It retains the scalability of FIFO input-queued switches It eliminates the HoL problem with FIFO input Queues

Input Queueing Virtual output queues

Input Queues Virtual Output Queues Delay Load 100%

Input Queueing (VoQ) Scheduler Memory b/w = 2R Can be quite complex!

Combined IQ/SQ Architecture Can be a good compromise 1 …. .… Routing fabric N N output queues In one shared memory Packets (data) Flow control

A Comparison Memory speeds for 32x32 switch Cell size = 64 bytes Shared-Memory Input-queued Line Rate Memory BW Access Time Per cell Memory BW Access Time 100 Mb/s 6.4 Gb/s 80 ns 200 Mb/s 2.56 s 1 Gb/s 64 Gb/s 8 ns 2 Gb/s 256 ns 2.5 Gb/s 160 Gb/s 3.2 ns 5 Gb/s 102.4 ns 10 Gb/s 640 Gb/s 0.8 ns 20 Gb/s 25.6 ns

Scalability of Switching Fabrics

Shared Bus It is the simplest interconnect possible Protocols are very well established Multicasting and broadcasting is natural They have a scalability problem as we cannot have multiple transmissions concurrently Its maximum bandwidth is around 100 Gbps – it limits the maximum number of I/O ports and/or the line rates It is typically used for “small” shared memory switches or output-queued switches – very good choice for Ethernet switches

Crossbars It is becoming the preferred interconnect of choice for high-speed switches Have a very high throughput, and support QoS and multicast N2 crosspoints – but now it is not the real limitation nowadays Data In Data Out configuration

Limiting factors Crossbar switch N2 crosspoints per chip, It’s not obvious how to build a crossbar from multiple chips, Capacity of “I/O”s per chip. State of the art: About 200 pins each operating at 3.125Gb/s ~= 600Gb/s per chip. About 1/3 to 1/2 of this capacity available in practice because of overhead and speedup. Crossbar chips today are limited by the “I/O” capacity.

Limitations to Building Large Crossbar Switches: I/O pins Maximum practical bit rate per pin ~ 3.125 Gbits/sec At this speed you need between 2-4 pins per single bit To achieve a 10 Gbps/sec (OC-192) line rate, you need around 4 parallel data lines (4-bit parallel transmission) For example, consider a 4-bit data data parallel 64-input crossbar that is designed to support OC-192 line rates per port. Each port interface would require 4 x 3 = 12 pins in each direction. Hence a 64-port crossbar would need 12 x 64 x 2 = 1536 pins just for the I/O data lines Hence, the real problem is I/O pin limitations How to solve the problem?

Scaling: Trying to build a crossbar from multiple chips 16x16 crossbar switch: Building Block: 4 inputs 4 outputs Eight inputs and eight outputs required!

How to build a scalable crossbar Use bit slicing – parallel crossbars For example, we can use 4-bit crossbars to implement the previous example. So we need 4 parallel 1-bit crossbars. Each port interface would require 1 x 3 = 3 pins in each direction. Hence a 64-port crossbar would need 3 x 64 x 2 = 384 pins for the I/O data lines – which is reasonable (but we need 4 chips here).

Scaling: Bit-slicing N Cell is “striped” across multiple identical planes. Crossbar switched “bus”. Scheduler makes same decision for all slices. Linecard 8 7 6 5 Cell Cell Cell 4 3 2 1 Scheduler

Scaling: Time-slicing Cell goes over one plane; takes N cell times. Scheduler is unchanged. Scheduler makes decision for each slice in turn. Linecard Cell 8 7 6 5 4 Cell 3 Cell 2 Cell 1 Cell Cell Scheduler

HKUST 10Gb/s 256x256 Crossbar Switch Fabric Design Our overall switch fabric is an OC-192 256*256 crossbar switch Such a system is composed of 8 256*256 crossbar chips, each running at 2Gb/s (to compensate for the overhead and to provide a switch speedup) The Deserializer (DES) is to convert the OC-192 10Gb/s data at the fiber link to 8 low speed signals, while the Serializer (SER) is to serialize the low speed signals back to the fiber link

Architecture of the Crossbar Chip Crossbar Switch Core – fulfills the switch functions Control – configures the crossbar core High speed data link – communicates between this chip and SER/DES PLL – provides on-chip precise clock

Technical Specification of our Core-Crossbar Chip Full crossbar core 256*256 (embedded with 2 bit-slices) Technology TSMC 0.25mm SCN5M Deep (lambda=0.12 mm) Layout size 14 mm * 8 mm Transistor counts 2000k Supply voltage 2.5v Clock Frequency 1GHz Power 40 W

Layout of a 256*256 crossbar switch core

HKUST Crossbar Chip in the News Researchers offer alternative to typical crossbar design http://www.eetimes.com/story/OEG20020820S0054 By Ron Wilson - EE Times August 21, 2002 (10:56 a.m. ET) PALO ALTO, Calif. — In a technical paper presented at the Hot Chips conference here Monday (Aug.19) researchers Ting Wu, Chi-Ying Tsui and Mounir Hamdi from Hong Kong University of Science and Technology (China) offered an alternative pipeline approach to crossbar design. Their approach has yielded a 256-by-256 signal switch with a 2-GHz input bandwidth, simulated in a 0.25-micron, 5-metal process. The growing importance of crossbar switch matrices, now used for on-chip interconnect as well as for switching fabric in routers, has led to increased study of the best ways to build these parts.

Scaling a crossbar Conclusion: scaling the capacity is relatively straightforward (although the chip count and power may become a problem). In each scheme so far, the number of ports stays the same, but the speed of each port is increased. What if we want to increase the number of ports? Can we build a crossbar-equivalent from multiple stages of smaller crossbars? If so, what properties should it have?

Multi-Stage Switches

Basic Switch Element This is equivalent to crosspoint in the crossbar (no longer a good argument) 1 1 Two States Cross Through Optional Buffering

Example of Multistage Switch It needs NlogN Internal switches (crosspoints) – less than the crossbar K 000 1 1 1 one half of the deck 1 001 2 010 1 1 1 3 011 N 4 100 1 the other half of deck 1 1 5 101 6 110 1 1 1 7 111 a perfect shuffle a perfect shuffle

Packet Routing The bits of the destination address provide the required routing tags. The digits in the destination address are used to set the state of the stages. destination port address 000 1 1 1 1 001 2 1 010 1 1 1 011 101 white bit controls switch setting in each stage 3 011 101 1 011 101 011 4 011 101 100 1 1 1 5 1 101 6 110 1 1 1 7 111 Perfect shuffle Perfect shuffle Stage 1 Stage 2 Stage 3

Internal blocking Internal link blocking as well as output blocking can happen in a Multistage switch. The following example illustrates an internal blocking for connections of input 0 to output 3 and input 4 to output 2. 011 010 000 1 1 1 011 010 blocking link 1 001 ??? 2 ??? 010 1 1 1 011 3 4 100 1 1 1 5 101 6 110 1 1 1 7 111 Perfect shuffle Perfect shuffle Stage 1 Stage 2 Stage 3

Output Blocking The following example illustrates output blocking for the connections between input 1 and output 6, and input 3 and output 6. 000 1 1 1 1 001 110 010 2 1 1 1 110 011 3 4 100 1 1 1 5 101 6 110 110 1 1 1 7 output blocking 111 Perfect shuffle Perfect shuffle Stage 1 Stage 2 Stage 3

3-stage Clos Network N = n x m k >= n m x m n x k k x n 1 1 1 n 1 2 … 2 … … … N m … m N N = n x m k >= n k

Clos-network Properties Expansion factors Strictly Nonblocking iff m >= 2n -1 Rearrangeable Nonblocking iff m >= n

3-stage Fabrics (Basic building block – a crossbar) Clos Network

3-Stage Fabrics Clos Network Expansion factor required = 2-1/N (but still blocking for multicast)

4-Port Clos Network Strictly Non-blocking

Construction example Switch size 1024x1024 Construction module Input switch thirty-two 32x48 Central switch forty-eight 48x48 Output switch thirty-two 48x32 Expansion 48/32=1.5 1 32x48 #1 48x32 #1 48x48 #1 32 33 32x48 #2 48x32 #2 48x48 #2 64 993 32x48 #32 48x32 #32 48x48 #48 1024

Lucent Architecture Buffers

MSM Architecture

Cisco’s 46Tbps Switch System Line Card Chassis Fabric Card Chassis 12.5G 12.5G total 80 chassis 8 sw planes speedup 2.5 1152 LICs 1296x1296 switch fabric 3-stage Benes sw multicast in the sw 1:N fabric redundancy 40 Gbps packet processor (188 RISCs) 40G LC (1) S1/S3 (1) 18 x 18 S2 (1) 72 x 72 LC (16) S1/S3 (8) 18 x 18 S2 (18) 72 x 72 LCC(1) FCC(1) LC (1137) S1/S3 (569) 18 x 18 S2 (127) 72 x 72 LC (1152) S1/S3 (576) 18 x 18 S2 (144) 72 x 72 LCC(72) FCC(8)

Massively Parallel Switches Instead of using tightly coupled fabrics like a crossbar or a bus, they use massively parallel interconnects such as hypercube, 2D torus, and 3D torus. Few companies use this design architecture for their core routers These fabrics are generally scalable However: It is very difficult to guarantee QoS and to include value-added functionalities (e.g., multicast, fair bandwidth allocation) They consume a lot of power They are relatively costly

Massively Parallel Switches

3D Switching Fabric: Avici Three components Topology  3D torus Routing  source routing with randomization Flow control  virtual channels and virtual networks Maximum configuration: 14 x 8 x 5 = 560 Channel speed is 10 Gbps

Packaging Uniformly short wires between adjacent nodes Can be built in passive backplanes Run at high speed Figures are from Scalable Switching Fabrics for Internet Routers, by W. J. Dally (can be found at www.avici.com)

Avici: Velociti™ Switch Fabric Toroidal direct connect fabric (3D Torus) Scales to 560 active modules Each element adds switching & forwarding capacity Each module connects to 6 other modules

Switch fabric chips comparison Here are some information for the comparison of switch fabrics chips for different companies. You will see that “Crossbar” switch architecture will be the dominate technology for designing the switch fabrics nowadays. It is also pretty interesting that the price of the switch fabrics is selling depending on the switching capacity. Counting per 10GBit/s. Also, the power consumption for using “shared-memory” switch architecture will generally yield a relatively smaller power consumption. And obviously, “Buffered crossbar” switch architecture will usually have higher power consumption. http://www.lightreading.com/document.asp?doc_id=47959