Introduction to High-Performance Internet Switches and Routers

Introduction to High-Performance Internet Switches and Routers

Network Architecture Metropolitan Long Haul Network Core 10GbE Core
DWDM Core 10GbE Core Core Routers 10GbE Core Routers Metropolitan Metropolitan Campus / Residential 10GbE Edge Routers Edge switch 10GbE GbE • • • • • • Access switch Access Routers

pop pop pop pop

How the Internet really is: Current Trend
Modems, DSL SONET/SDH DWDM … in the access network, where the majority of users use either a phone modem or a DSL line to connect to the Internet. So, why don’t textbooks talk about these circuit switches, because these circuits not integrated with IP; they are seen by IP as static point-to-point links. We have this architecture for historic reasons, not because of a well thought design process. In the past when an Internet Service Provider in the West coast wanted to connect with another ISP in the East Coast, it had two options; lay down a cable, which was very expensive, or it would rent a circuit from the long-distance telephone carriers, who were using circuit switching. Also when the ISP wanted to access residential customer, it would go through one of the few companies with a connection to the home, the local phone company. Now, is this hybrid architecture the right network architecture? Wouldn’t it be better to have one that uses only packets? or even perhaps one that uses only circuits? Well, let me define some performance criteria that I will use to answer those questions.

Circuit switched crossconnects, DWDM etc.
The Internet is a mesh of routers mostly interconnected by (ATM and) SONET (and DWDM) TDM TDM Internet, as most of us know, is comprised of a mesh of routers interconnected by links. Nodes on the internet – both end hosts and routers – communicate using the Internet Protocol, commonly known as IP. IP packets travel over links from one router to the next on their way towards the final destination. TDM TDM Circuit switched crossconnects, DWDM etc.

Typical (BUT NOT ALL) IP Backbone (Late 1990’s)
SONET/SDH DCS Core Router ATM Switch MUX ADM This slide shows equipment layering for a typical IP backbone in the late 1990’s. Data was piggybacked over a traditional voice/TDM transport network. Historically, this made sense. Today, it doesn’t.

Points of Presence (POPs)
A B C POP1 POP3 POP2 POP4 D E F POP5 POP6 POP7 POP8

Where High Performance Routers are Used
(2.5 Gb/s) R2 (2.5 Gb/s) R1 R6 R5 R4 R3 R7 R9 R10 R8 R11 R12 R14 R13 R16 R15 (2.5 Gb/s) (2.5 Gb/s)

Hierarchical arrangement
End hosts (1000s per mux) Access multiplexer Edge Routers POP Core Routers 10Gb/s “OC192” POP POP POP: Point of Presence. Richly interconnected by mesh of long-haul links. Typically: 40 POPs per national network operator; core routers per POP. Point of Presence (POP)

Typical POP Configuration
Transport Network DWDM/SONET Terminal Backbone routers 10G WAN Transport Links > 50% of high speed interfaces are router-to-router (Core routers) 10G Router-Router Intra-Office Links Aggregation switches/routers (Edge Switches)

Today’s Network Equipment
Routers Switches SONET DWDM LAYER 3 LAYER 2 LAYER 1 LAYER 0 Internet FR & ATM SONET DWDM Protocol

Functions in a packet switch
Interconnect scheduling Route lookup TTL process ing Buffer QoS schedul Control plane Ingress linecard Egress linecard Interconnect Framing Data path Control path Scheduling path

Functions in a circuit switch
Interconnect scheduling Control plane Interconnect Framing Ingress linecard Egress linecard Data path Control path

Our emphasis for now is to look at packet switches (IP, ATM, Ethernet, framerelay, etc.)
Please if anyone has additional comments please speak up

What a Router Looks Like
Cisco GSR 12416 Juniper M160 19” 19” Capacity: 160Gb/s Power: 4.2kW Capacity: 80Gb/s Power: 2.6kW 6ft 3ft 2ft 2.5ft

A Router Chassis Fans/ Power Supplies Linecards

Backplane A Circuit Board with connectors for line cards
High speed electrical traces connecting line cards to fabric Usually passive Typically 30-layer boards

Line Card Picture

What do these two have in common?
Cisco CRS-1 Cisco Catalyst 3750G

What do these two have in common?
CRS-1 linecard 20” x (18”+11”) x 1RU 40Gbps, 80MPPS State-of-the-art 0.13u silicon Full IP routing stack including IPv4 and IPv6 support Distributed IOS Multi-chassis support Cat 3750G Switch 19” x 16” x 1RU 52Gpbs, 78 MPPS State-of-the-art 0.13u silicon Full IP routing stack including IPv4 and IPv6 support Distributed IOS Multi-chassis support

What is different between them?
Cisco CRS-1 Cisco Catalyst 3750G

A lot… CRS-1 linecard Cat 3750G Switch Up to 1024 linecards
Fully programmable forwarding 2M prefix entries and 512K ACLs 46Tbps 3-stage switching fabric MPLS support H-A non-stop routing protocols Cat 3750G Switch Up to 9 stack members Hardwired ASIC forwarding 11K prefix entries and 1.5K ACLs 32Gbps shared stack ring L2 switching support Re-startable routing applications Also note that CRS-1 line card is 30x the material cost of Cat3750G

Other packet switches Cisco 7500 “edge” routers
Lucent GX550 Core ATM switch DSL router

What is Routing? R3 A B C R1 R2 R4 D E F R5 R5 F R3 E D Next Hop
Destination

What is Routing? R3 A B C R1 R2 R4 D E F R5 20 bytes 32 Data
16 32 4 1 Data Options (if any) Destination Address Source Address Header Checksum Protocol TTL Fragment Offset Flags Fragment ID Total Packet Length T.Service HLen Ver 20 bytes D D D R5 F R3 E D Next Hop Destination

What is Routing? A B C R1 R2 R3 R4 D E F R5

Basic Architectural Elements of a Router
Routing Routing table update (OSPF, RIP, IS-IS) Admission Control Congestion Control Reservation Control Plane “Typically in Software” Routing Lookup Packet Classifier Switching Arbitration Scheduling Switch (per-packet processing) “Typically in Hardware” Switching

Basic Architectural Components Datapath: per-packet processing
3. 1. Output Scheduling 2. Forwarding Table Interconnect Forwarding Decision Forwarding Table Forwarding Decision Forwarding Table Forwarding Decision

Per-packet processing in a Switch/Router
1. Accept packet arriving on an ingress line. 2. Lookup packet destination address in the forwarding table, to identify outgoing interface(s). 3. Manipulate packet header: e.g., decrement TTL, update header checksum. 4. Send packet to outgoing interface(s). 5. Queue until line is free. 6. Transmit packet onto outgoing line.

ATM Switch Lookup cell VCI/VPI in VC table.
Replace old VCI/VPI with new. Forward cell to outgoing interface. Transmit cell onto link.

Ethernet Switch Lookup frame DA in forwarding table.
If known, forward to correct port. If unknown, broadcast to all ports. Learn SA of incoming frame. Forward frame to outgoing interface. Transmit frame onto link.

IP Router Lookup packet DA in forwarding table.
If known, forward to correct port. If unknown, drop packet. Decrement TTL, update header Cksum. Forward packet to outgoing interface. Transmit packet onto link.

Special per packet/flow processing
The router can be equipped with additional capabilities to provide special services on a per-packet or per-class basis. The router can perform some additional processing on the incoming packets: Classifying the packet IPv4, IPv6, MPLS, ... Delivering packets according to a pre-agreed service: Absolute service or relative service (e.g., send a packet within a given deadline, give a packet a better service than another packet (IntServ – DiffServ)) Filtering packets for security reasons Treating multicast packets differently from unicast packets

Per packet Processing Must be Fast !!!
Year Aggregate Line-rate Arriving rate of 40B POS packets (Million pkts/sec) 1997 622 Mb/s 1.56 1999 2.5 Gb/s 6.25 2001 10 Gb/s 25 2003 40 Gb/s 100 2006 80 Gb/s 200 Packet processing must be simple and easy to implement Memory access time is the bottleneck 200Mpps × 2 lookups/pkt = 400 Mlookups/sec → 2.5ns per lookup

First Generation Routers
Shared Backplane Route Table CPU Buffer Memory Line Interface MAC Typically <0.5Gb/s aggregate capacity CPU Memory Line Interface

Bus-based Router Architectures with Single Processor
The first generation of IP router Based on software implementations on a single general-purpose CPU. Limitations: Serious processing bottleneck in the central processor Memory intensive operations (e.g. table lookup & data movements) limits the effectiveness of processor power A severe limiting factor to overall router throughput from input/output (I/O) bus

Second Generation Routers
CPU Route Table Buffer Memory Line Card Line Card Line Card Buffer Memory Buffer Memory Buffer Memory Fwding Cache Fwding Cache Fwding Cache MAC MAC MAC Typically <5Gb/s aggregate capacity

Bus-based Router Architectures with Multiple Processors
Architectures with Route Caching Second generation IP routers Distribute packet forwarding operations Network interface cards Processors Route caches Packets are transmitted once over the shared bus Limitations: The central routing table is a bottleneck at high-speeds traffic dependent throughput shared bus is still a bottleneck

Limitation of IP Packet Forwarding based on Route Caching
Routing changes invalidate existing cache entries and need re-establishment The performance depends on: a. how big the cache b. how the cache is maintained c. what the performance of the slow path is Solution: Using a forwarding database in each network interface Benefit: Performance, Scalability, Network resilience, and Functionality

Third Generation Routers
Switched Backplane Line Card CPU Card Line Card Local Buffer Memory Local Buffer Memory CPU Line Interface Routing Table Memory Fwding Table Fwding Table MAC MAC Typically <50Gb/s aggregate capacity

Switch-based Router Architectures with Fully Distributed Processors
To avoid bottlenecks: Processing power Memory bandwidth Internal bus bandwidth Each network interface is equipped with appropriate processing power and buffer space.

Fourth Generation Routers/Switches Optics inside a router for the first time
Optical links 100s of metres Switch Core Linecards Tb/s routers in development

Juniper TX8/T640 Alcatel 7670 RSP TX8 Avici TSR Chiaro

Next Gen. Backbone Network Architecture – One backbone, multiple access networks
DSL, FTTH, Dial Telecommuter Residential (G)MPLS based Multi-service Intelligent Packet Backbone Network IPv6 IX ISP’s GGSN Service POP SGSN Dual Stack IPv4-IPv6 Enterprise Network Dual Stack IPv4-IPv6 DSL/FTTH/Dial access Network Dual Stack IPv4-IPv6 Cable Network ISP offering Native IPv6 services CE router PE Router (Service POP) PE One Backbone Network Maximizes speed, flexibility and manageability

Current Generation: Generic Router Architecture
Header Processing Data Hdr Lookup IP Address Update Header Data Hdr Queue Packet ~1M prefixes Off-chip DRAM Address Table IP Address Next Hop Buffer Memory ~1M packets Off-chip DRAM

Current Generation: Generic Router Architecture (IQ)
Lookup IP Address Update Header Header Processing Address Table Data Hdr Queue Packet 1 2 N Data Hdr Buffer Memory Lookup IP Address Update Header Header Processing Address Table Data Hdr Queue Packet Buffer Memory Scheduler Lookup IP Address Update Header Header Processing Address Table Queue Packet Data Hdr Buffer Memory

Current Generation: Generic Router Architecture (OQ)
Lookup IP Address Update Header Header Processing Address Table Data Hdr 1 1 Queue Packet Buffer Memory Lookup IP Address Update Header Header Processing Address Table 2 2 Queue Packet Buffer Memory Lookup IP Address Update Header Header Processing Address Table N N Queue Packet Buffer Memory

Basic Architectural Elements of a Current Router
Typical IP Router Linecard Buffer & State Memory Scheduler Buffer Mgmt & Scheduling Physical Layer Framing & Maintenance Buffered or Bufferless Fabric (e.g. crossbar, bus) Packet Processing Buffer Mgmt & Scheduling Lookup Tables Buffer & State Memory OC192c Linecard: ~10-30M gates ~2Gbits of memory ~2 square feet >$10k cost; price $100K Backplane

Performance metrics Capacity Throughput Controllable Delay
“maximize C, s.t. volume < 2m3 and power < 5kW” Throughput Operators like to maximize usage of expensive long-haul links. Controllable Delay Some users would like predictable delay. This is feasible with output-queueing plus weighted fair queueing (WFQ). WFQ

Why do we Need Faster Routers?
To prevent routers from becoming the bottleneck in the Internet. To increase POP capacity, and to reduce cost, size and power.

Why we Need Faster Routers To prevent routers from being the bottleneck
Line Capacity 2x / 7 months User Traffic 2x / 12months Router Capacity 2.2x / 18months Moore’s Law 2x / 18 months DRAM Random Access Time 1.1x / 18months

Disparity between traffic
Why we Need Faster Routers 1: To prevent routers from being the bottleneck Disparity between traffic and router growth traffic 5-fold disparity Router capacity

Why we Need Faster Routers 2: To reduce cost, power & complexity of POPs
Big POPs need big routers POP with large routers POP with smaller routers Interfaces: Price >$200k, Power > 400W About 50-60% of interfaces are used for interconnection within the POP. Industry trend is towards large, single router per POP.

A Case study: UUNET Internet Backbone Build Up
1999 View (4Q) 2000 View (4Q) 8 OC-48 links between POPs (not parallel) 52 OC-48 links between POPs: many parallel links 3 OC-192 Super POP links: multiple parallel interfaces between POPs (D.C. – Chicago; NYC – D.C.) So let's take a real case, a real Network: UUNet's. [ENTER]Here is the view of the UUNET backbone as of 4Q/ There were total 8 OC-48 segments, none of which required multiple OC-48 links. [ENTER]Now, here is the same network – 11 months later. Note that now there are 52 OC-48 segments, many of which required multiple parallel OC-48 links. In addition, there are now 3 OC-192 segments, two of which require parallel links. [ENTER]We have visited UUNET and discussed their network requirements. Their highly conservative estimate indicated that they would need 20G links per point-to-point segments in 2 years. In 2003 and beyond, 40G and higher. They clearly indicated that in order to meet the traffic growth demand, higher port speed, in addition to the total system capacity, is required. (Additional Data Points) AT&T: 4 10G segments now. 15 planned for 2001. To Meet the traffic growth, Higher Performance Routers with Higher Port Speed, are required

Why we Need Faster Routers 2: To reduce cost, power & complexity of POPs
Once a router is % available it is possible to make this step Further Reduces CapEx, Operational cost Further increases network stability

Ideal POP CARRIER OPTICAL TRANSPORT Existing Carrier Equipment
Gigabit Routers Gigabit Routers CARRIER OPTICAL TRANSPORT VoIP Gateways VoIP Gateways SONET SONET DWDM and OPTICAL SWITCHES DWDM and OPTICAL SWITCHES Digital Subscriber Line Aggregation Digital Subscriber Line Aggregation ATM ATM Gigabit Ethernet Gigabit Ethernet Cable Modem Aggregation Cable Modem Aggregation

Why are Fast Routers Difficult to Make?
Big disparity between line rates and memory access speed

Problem: Fast Packet Buffers
Example: 40Gb/s packet buffer Size = RTT*BW = 10Gb; 64 byte packets Write Rate, R Buffer Manager Read Rate, R 1 packet every 12.8 ns 1 packet every 12.8 ns Buffer Memory How fast? Get into the motivation – say OC768 line rate buffering is a goal. - Just mention directly the amount of buffering required. Why is it not possible today? Why is it intersting…. Talks abut SRAM/DRAM. Use SRAM? + fast enough random access time, but - too low density to store 10Gb of data. Use DRAM? + high density means we can store data, but - too slow (50ns random access time).

Memory Technology (2006) Technology Max single chip density $/chip
($/MByte) Access speed Watts/chip Networking DRAM 64 MB $30-$50 ($0.50-$0.75) 40-80ns 0.5-2W SRAM 8 MB $50-$60 ($5-$8) 3-4ns 2-3W TCAM 2 MB $200-$250 ($100-$125) 4-8ns 15-30W

How fast a buffer can be made?
~5ns for SRAM ~50ns for DRAM Buffer Memory External Line 64-byte wide bus 64-byte wide bus Rough Estimate: 5/50ns per memory operation. Two memory operations per packet. Therefore, maximum ~50/5 Gb/s. Aside: Buffers need to be large for TCP to work well, so DRAM is usually required.

Packet Caches Buffer Manager SRAM DRAM Buffer Memory
Small ingress SRAM cache of FIFO heads cache of FIFO tails 55 56 96 97 87 88 57 58 59 60 89 90 91 1 Q 2 5 7 6 8 10 9 11 12 14 13 15 50 52 51 53 54 86 82 84 83 85 92 94 93 95 DRAM Buffer Memory Buffer Manager SRAM Arriving 4 3 2 1 Departing 2 Packets 5 4 3 2 1 Packets Q 6 5 4 3 2 1 b>>1 packets at a time DRAM Buffer Memory

Packet processing gets harder What will happen What we’d like: (more features) QoS, Multicast, Security, … Instructions per arriving byte time

Clock cycles per minimum length packet since 1996

Options for packet processing
General purpose processor MIPS PowerPC Intel Network processor Intel IXA and IXP processors IBM Rainier Control plane processors: SiByte (Broadcom), QED (PMC-Sierra). FPGA ASIC

General Observations Up until about 2000, More recently,
Low-end packet switches used general purpose processors, Mid-range packet switches used FPGAs for datapath, general purpose processors for control plane. High-end packet switches used ASICs for datapath, general purpose processors for control plane. More recently, 3rd party network processors now used in many low- and mid-range datapaths. Home-grown network processors used in high-end.

Demand for Router Performance Exceeds Moore’s Law Growth in capacity of commercial routers (per rack): Capacity 1992 ~ 2Gb/s Capacity 1995 ~ 10Gb/s Capacity 1998 ~ 40Gb/s Capacity 2001 ~ 160Gb/s Capacity 2003 ~ 640Gb/s Average growth rate: 2.2x / 18 months.

Maximizing the throughput of a router Engine of the whole router
Operators increasingly demand throughput guarantees: To maximize use of expensive long-haul links For predictability and planning Serve as many customers as possible Increase the lifetime of the equipment Despite lots of effort and theory, no commercial router today has a throughput guarantee.

Ingress linecard Interconnect Egress linecard Framing Route lookup TTL process ing Buffer ing Buffer ing QoS schedul ing Framing Interconnect scheduling Control plane Data path Control path Scheduling path

This depends on the architecture of the switching: Input Queued Output Queued Shared memory It depends on the arbitration/scheduling algorithms within the specific architecture This is key to the overall performance of the router.

Power: It is exceeding the limit

Switching Architectures

Generic Router Architecture
Lookup IP Address Update Header Header Processing Address Table Data Hdr 1 1 Queue Packet Buffer Memory Lookup IP Address Update Header Header Processing Address Table 2 2 N times line rate Queue Packet Buffer Memory N times line rate Lookup IP Address Update Header Header Processing Address Table N N Queue Packet Buffer Memory

Generic Router Architecture
Data Hdr Lookup IP Address Update Header Header Processing Address Table Data Hdr Queue Packet 1 2 N Buffer Memory Lookup IP Address Update Header Header Processing Address Table Data Hdr Queue Packet Buffer Memory Scheduler Lookup IP Address Update Header Header Processing Address Table Queue Packet Buffer Memory

Interconnects Two basic techniques
Input Queueing Output Queueing Usually a non-blocking switch fabric (e.g. crossbar) Usually a fast bus

Simple model of output queued switch
Link 1, ingress Link 1, egress Link 2, ingress Link 2, egress Link 3, ingress Link 3, egress Link 4, ingress Link 4, egress Link rate, R Link 2 Link rate, R Link 1 R1 Link 3 R R Link 4 R R R R

Output Queued (OQ) Switch
How an OQ Switch Works Output Queued (OQ) Switch

Characteristics of an output queued (OQ) switch
Arriving packets are immediately written into the output queue, without intermediate buffering. The flow of packets to one output does not affect the flow to another output. An OQ switch has the highest throughput, and lowest delay. The rate of individual flows, and the delay of packets can be controlled (QoS).

The shared memory switch
A single, physical memory device Link 1, ingress Link 1, egress Link 2, ingress Link 2, egress R R Link 3, ingress Link 3, egress R R Link N, ingress Link N, egress R R

Characteristics of a shared memory switch

Memory bandwidth Basic OQ switch: Shared Memory Switch:
Consider an OQ switch with N different physical memories, and all links operating at rate R bits/s. In the worst case, packets may arrive continuously from all inputs, destined to just one output. Maximum memory bandwidth requirement for each memory is (N+1)R bits/s. Shared Memory Switch: Maximum memory bandwidth requirement for the memory is 2NR bits/s.

How fast can we make a centralized shared memory switch?
5ns SRAM Shared Memory 5ns per memory operation Two memory operations per packet Therefore, up to 160Gb/s (200 x 8/10 nsec) In practice, closer to 80Gb/s 1 2 N 200 byte bus

Output Queueing The “ideal”
1 2 2 1 1

How to Solve the Memory Bandwidth Problem?
Use Input Queued Switches In the worst case, one packet is written and one packet is read from an input buffer Maximum memory bandwidth requirement for each memory is 2R bits/s. However, using FIFO input queues can result in what is called “Head-of-Line (HoL)” blocking

Input Queueing Head of Line Blocking
Delay Load 58.6% 100%

Head of Line Blocking

Virtual Output Queues (VoQ)
At each input port, there are N queues – each associated with an output port Only one packet can go from an input port at a time Only one packet can be received by an output port at a time It retains the scalability of FIFO input-queued switches It eliminates the HoL problem with FIFO input Queues

Input Queueing Virtual output queues

Input Queues Virtual Output Queues
Delay Load 100%

Input Queueing (VoQ) Scheduler Memory b/w = 2R Can be quite complex!

Combined IQ/SQ Architecture Can be a good compromise
1 …. .… Routing fabric N N output queues In one shared memory Packets (data) Flow control

A Comparison Memory speeds for 32x32 switch Cell size = 64 bytes
Shared-Memory Input-queued Line Rate Memory BW Access Time Per cell Memory BW Access Time 100 Mb/s 6.4 Gb/s 80 ns 200 Mb/s 2.56 s 1 Gb/s 64 Gb/s 8 ns 2 Gb/s 256 ns 2.5 Gb/s 160 Gb/s 3.2 ns 5 Gb/s 102.4 ns 10 Gb/s 640 Gb/s 0.8 ns 20 Gb/s 25.6 ns

Scalability of Switching Fabrics

Shared Bus It is the simplest interconnect possible
Protocols are very well established Multicasting and broadcasting is natural They have a scalability problem as we cannot have multiple transmissions concurrently Its maximum bandwidth is around 100 Gbps – it limits the maximum number of I/O ports and/or the line rates It is typically used for “small” shared memory switches or output-queued switches – very good choice for Ethernet switches

Crossbars It is becoming the preferred interconnect of choice for high-speed switches Have a very high throughput, and support QoS and multicast N2 crosspoints – but now it is not the real limitation nowadays Data In Data Out configuration

Limiting factors Crossbar switch N2 crosspoints per chip,
It’s not obvious how to build a crossbar from multiple chips, Capacity of “I/O”s per chip. State of the art: About 200 pins each operating at 3.125Gb/s ~= 600Gb/s per chip. About 1/3 to 1/2 of this capacity available in practice because of overhead and speedup. Crossbar chips today are limited by the “I/O” capacity.

Limitations to Building Large Crossbar Switches: I/O pins
Maximum practical bit rate per pin ~ Gbits/sec At this speed you need between 2-4 pins per single bit To achieve a 10 Gbps/sec (OC-192) line rate, you need around 4 parallel data lines (4-bit parallel transmission) For example, consider a 4-bit data data parallel 64-input crossbar that is designed to support OC-192 line rates per port. Each port interface would require 4 x 3 = 12 pins in each direction. Hence a 64-port crossbar would need 12 x 64 x 2 = 1536 pins just for the I/O data lines Hence, the real problem is I/O pin limitations How to solve the problem?

Scaling: Trying to build a crossbar from multiple chips
16x16 crossbar switch: Building Block: 4 inputs 4 outputs Eight inputs and eight outputs required!

How to build a scalable crossbar
Use bit slicing – parallel crossbars For example, we can use 4-bit crossbars to implement the previous example. So we need 4 parallel 1-bit crossbars. Each port interface would require 1 x 3 = 3 pins in each direction. Hence a 64-port crossbar would need 3 x 64 x 2 = 384 pins for the I/O data lines – which is reasonable (but we need 4 chips here).

Scaling: Bit-slicing N Cell is “striped” across multiple identical planes. Crossbar switched “bus”. Scheduler makes same decision for all slices. Linecard 8 7 6 5 Cell Cell Cell 4 3 2 1 Scheduler

Scaling: Time-slicing
Cell goes over one plane; takes N cell times. Scheduler is unchanged. Scheduler makes decision for each slice in turn. Linecard Cell 8 7 6 5 4 Cell 3 Cell 2 Cell 1 Cell Cell Scheduler

HKUST 10Gb/s 256x256 Crossbar Switch Fabric Design
Our overall switch fabric is an OC *256 crossbar switch Such a system is composed of 8 256*256 crossbar chips, each running at 2Gb/s (to compensate for the overhead and to provide a switch speedup) The Deserializer (DES) is to convert the OC Gb/s data at the fiber link to 8 low speed signals, while the Serializer (SER) is to serialize the low speed signals back to the fiber link

Architecture of the Crossbar Chip
Crossbar Switch Core – fulfills the switch functions Control – configures the crossbar core High speed data link – communicates between this chip and SER/DES PLL – provides on-chip precise clock

Technical Specification of our Core-Crossbar Chip
Full crossbar core 256*256 (embedded with 2 bit-slices) Technology TSMC 0.25mm SCN5M Deep (lambda=0.12 mm) Layout size 14 mm * 8 mm Transistor counts 2000k Supply voltage 2.5v Clock Frequency 1GHz Power 40 W

Layout of a 256*256 crossbar switch core

HKUST Crossbar Chip in the News
Researchers offer alternative to typical crossbar design By Ron Wilson - EE Times August 21, 2002 (10:56 a.m. ET) PALO ALTO, Calif. — In a technical paper presented at the Hot Chips conference here Monday (Aug.19) researchers Ting Wu, Chi-Ying Tsui and Mounir Hamdi from Hong Kong University of Science and Technology (China) offered an alternative pipeline approach to crossbar design. Their approach has yielded a 256-by-256 signal switch with a 2-GHz input bandwidth, simulated in a 0.25-micron, 5-metal process. The growing importance of crossbar switch matrices, now used for on-chip interconnect as well as for switching fabric in routers, has led to increased study of the best ways to build these parts.

Scaling a crossbar Conclusion: scaling the capacity is relatively straightforward (although the chip count and power may become a problem). In each scheme so far, the number of ports stays the same, but the speed of each port is increased. What if we want to increase the number of ports? Can we build a crossbar-equivalent from multiple stages of smaller crossbars? If so, what properties should it have?

Multi-Stage Switches

Basic Switch Element This is equivalent to crosspoint in the crossbar
(no longer a good argument) 1 1 Two States Cross Through Optional Buffering

Example of Multistage Switch
It needs NlogN Internal switches (crosspoints) – less than the crossbar K 000 1 1 1 one half of the deck 1 001 2 010 1 1 1 3 011 N 4 100 1 the other half of deck 1 1 5 101 6 110 1 1 1 7 111 a perfect shuffle a perfect shuffle

Packet Routing The bits of the destination address provide the required routing tags. The digits in the destination address are used to set the state of the stages. destination port address 000 1 1 1 1 001 2 1 010 1 1 1 011 101 white bit controls switch setting in each stage 3 011 101 1 011 101 011 4 011 101 100 1 1 1 5 1 101 6 110 1 1 1 7 111 Perfect shuffle Perfect shuffle Stage 1 Stage 2 Stage 3

Internal blocking Internal link blocking as well as output blocking can happen in a Multistage switch. The following example illustrates an internal blocking for connections of input 0 to output 3 and input 4 to output 2. 011 010 000 1 1 1 011 010 blocking link 1 001 ??? 2 ??? 010 1 1 1 011 3 4 100 1 1 1 5 101 6 110 1 1 1 7 111 Perfect shuffle Perfect shuffle Stage 1 Stage 2 Stage 3

Output Blocking The following example illustrates output blocking for the connections between input 1 and output 6, and input 3 and output 6. 000 1 1 1 1 001 110 010 2 1 1 1 110 011 3 4 100 1 1 1 5 101 6 110 110 1 1 1 7 output blocking 111 Perfect shuffle Perfect shuffle Stage 1 Stage 2 Stage 3

A Solution: Batcher Sorter
One solution to the contention problem is to sort the cells into monotonically increasing order based on desired destination port Done using a bitonic sorter called a Batcher Places the M cells into gap-free increasing sequence on the first M input ports Eliminates duplicate destinations

Batcher-Banyan Example
1 1 1 2 3 2 3 3 4 6 4 4 5 7 5 6 6 7 7

1 6 1 2 1 2 3 3 7 3 4 4 5 5 4 6 6 7 7

1 6 1 2 1 2 7 3 3 4 3 4 5 5 6 6 7 7 4

1 3 1 2 6 2 3 3 4 1 4 5 5 6 7 6 7 7 4

1 3 1 2 2 6 3 3 4 4 1 5 5 6 6 4 7 7 7

1 1 1 2 3 2 3 3 4 4 5 4 5 6 6 6 7 7 7

1 1 1 2 2 3 3 3 4 4 4 5 5 6 6 6 7 7 7

Simple Sort & Route Network
3 6 5 4 Sort 3 4 5 6 Filter 3 4 5 6 Add 3 4 5 6 1 2 Conc. 3 4 5 6 Route 3 4 5 6 Simple components with no buffering. filter eliminates duplicates by comparing consecutive addresses and returns ack to inputs adder computes and inserts “rank” of cells concentrator uses rank as output address routing network delivers to output Adder, concentrator and routing network all have log2n stages

3-stage Clos Network N = n x m k >= n m x m n x k k x n 1 1 1 n 1 2
… 2 … … … N m … m N N = n x m k >= n k

Clos-network Blocking
When a connection is made it can exclude the possibility of certain other connections being made Non-blocking A new connection can always be made without disturbing the existing connections Rearrangeably non-blocking A new connection can be made but it might be necessary to reconfigure some other connections on the switch We have already used the terms blocking and non-blocking which have fairly obvious definitions shown here. The non-blocking switches are further divided into two sets. Wide-sense non-blocking requires a certain rule for setting up connections to be followed in order to avoid getting into a blocked state. Strict-sense non-blocking switches can use any valid path at any time. There is a further class of switch networks known as rearrangably non-blocking. In such switches it is always possible to connect a given set of inputs to a given (valid of course) set of outputs. However, when a new connection is required it may not ‘fit’ with the existing set of connections and new paths have to be found for the old connections as well as the new one.

Connection cannot be set up between input 4 and output 1
2 3 4 Connection cannot be set up between input 4 and output 1 A connection request from input 4 to output 1 is blocked 1 2 3 4 Connection can now be set up between input 4 and output 1 Same connection request can be satisfied by rearranging the existing connection from input 2 to output 2

Clos-network Properties Expansion factors
Strictly Nonblocking iff m >= 2n -1 Rearrangeable Nonblocking iff m >= n

3-stage Fabrics (Basic building block – a crossbar) Clos Network

3-Stage Fabrics Clos Network
Expansion factor required = 2-1/N (but still blocking for multicast)

4-Port Clos Network Strictly Non-blocking

Construction example Switch size 1024x1024 Construction module
Input switch thirty-two 32x48 Central switch forty-eight 48x48 Output switch thirty-two 48x32 Expansion 48/32=1.5 1 32x48 #1 48x32 #1 48x48 #1 32 33 32x48 #2 48x32 #2 48x48 #2 64 993 32x48 #32 48x32 #32 48x48 #48 1024

Lucent Architecture Buffers

MSM Architecture

Cisco’s 46Tbps Switch System
Line Card Chassis Fabric Card Chassis 12.5G 12.5G total 80 chassis 8 sw planes speedup 2.5 1152 LICs 1296x1296 switch fabric 3-stage Benes sw multicast in the sw 1:N fabric redundancy 40 Gbps packet processor (188 RISCs) 40G LC (1) S1/S3 (1) 18 x 18 S2 (1) 72 x 72 LC (16) S1/S3 (8) 18 x 18 S2 (18) 72 x 72 LCC(1) FCC(1) LC (1137) S1/S3 (569) 18 x 18 S2 (127) 72 x 72 LC (1152) S1/S3 (576) 18 x 18 S2 (144) 72 x 72 LCC(72) FCC(8)

Massively Parallel Switches
Instead of using tightly coupled fabrics like a crossbar or a bus, they use massively parallel interconnects such as hypercube, 2D torus, and 3D torus. Few companies use this design architecture for their core routers These fabrics are generally scalable However: It is very difficult to guarantee QoS and to include value-added functionalities (e.g., multicast, fair bandwidth allocation) They consume a lot of power They are relatively costly

Massively Parallel Switches

3D Switching Fabric: Avici
Three components Topology  3D torus Routing  source routing with randomization Flow control  virtual channels and virtual networks Maximum configuration: 14 x 8 x 5 = 560 Channel speed is 10 Gbps

Packaging Uniformly short wires between adjacent nodes
Can be built in passive backplanes Run at high speed Figures are from Scalable Switching Fabrics for Internet Routers, by W. J. Dally (can be found at

Avici: Velociti™ Switch Fabric
Toroidal direct connect fabric (3D Torus) Scales to 560 active modules Each element adds switching & forwarding capacity Each module connects to 6 other modules

Switch fabric chips comparison
Here are some information for the comparison of switch fabrics chips for different companies. You will see that “Crossbar” switch architecture will be the dominate technology for designing the switch fabrics nowadays. It is also pretty interesting that the price of the switch fabrics is selling depending on the switching capacity. Counting per 10GBit/s. Also, the power consumption for using “shared-memory” switch architecture will generally yield a relatively smaller power consumption. And obviously, “Buffered crossbar” switch architecture will usually have higher power consumption.

Introduction to High-Performance Internet Switches and Routers

Similar presentations

Presentation on theme: "Introduction to High-Performance Internet Switches and Routers"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to High-Performance Internet Switches and Routers

Similar presentations

Presentation on theme: "Introduction to High-Performance Internet Switches and Routers"— Presentation transcript:

Similar presentations

About project

Feedback