Introduction to High-Performance Internet Switches and Routers
Network Architecture http://www.ust.hk/itsc/network/ Metropolitan Long Haul Network DWDM Core 10GbE Core Core Routers 10GbE Core Routers Metropolitan Metropolitan Campus / Residential 10GbE Edge Routers Edge switch 10GbE GbE • • • • • • Access switch Access Routers http://www.ust.hk/itsc/network/
pop pop pop pop
How the Internet really is: Current Trend Modems, DSL SONET/SDH DWDM … in the access network, where the majority of users use either a phone modem or a DSL line to connect to the Internet. So, why don’t textbooks talk about these circuit switches, because these circuits not integrated with IP; they are seen by IP as static point-to-point links. We have this architecture for historic reasons, not because of a well thought design process. In the past when an Internet Service Provider in the West coast wanted to connect with another ISP in the East Coast, it had two options; lay down a cable, which was very expensive, or it would rent a circuit from the long-distance telephone carriers, who were using circuit switching. Also when the ISP wanted to access residential customer, it would go through one of the few companies with a connection to the home, the local phone company. Now, is this hybrid architecture the right network architecture? Wouldn’t it be better to have one that uses only packets? or even perhaps one that uses only circuits? Well, let me define some performance criteria that I will use to answer those questions.
What is Routing? A B C R1 R2 R3 R4 D E F R5 5 5
Points of Presence (POPs) A B C POP1 POP3 POP2 POP4 D E F POP5 POP6 POP7 POP8 6 6
Where High Performance Routers are Used (10 Gb/s) R2 (10 Gb/s) R1 R6 R5 R4 R3 R7 R9 R10 R8 R11 R12 R14 R13 R16 R15 (10 Gb/s) (10 Gb/s) 049045 - Router Architectures 7 7
Hierarchical arrangement End hosts (1000s per mux) Access multiplexer Edge Routers POP Core Routers 10Gb/s “OC192” POP POP POP: Point of Presence. Richly interconnected by mesh of long-haul links. Typically: 40 POPs per national network operator; 10-40 core routers per POP. Point of Presence (POP)
Typical POP Configuration Transport Network DWDM/SONET Terminal Backbone routers 10G WAN Transport Links > 50% of high speed interfaces are router-to-router (Core routers) 10G Router-Router Intra-Office Links Aggregation switches/routers (Edge Switches)
Today’s Network Equipment Routers Switches SONET DWDM LAYER 3 LAYER 2 LAYER 1 LAYER 0 Internet FR & ATM SONET DWDM Protocol
Functions in a packet switch Interconnect scheduling Route lookup TTL process ing Buffer QoS schedul Control plane Ingress linecard Egress linecard Interconnect Framing Data path Control path Scheduling path
Functions in a circuit switch Interconnect scheduling Control plane Interconnect Framing Ingress linecard Egress linecard Data path Control path
Our emphasis for now is to look at packet switches (IP, ATM, Ethernet, Framerelay, etc.) Please if anyone has additional comments please speak up
What a Router Looks Like Cisco CRS-1 Juniper T1600 60 cm 44 cm Capacity: 640Gb/s Power: 13.2kW Full rack Capacity: 1.6Tb/s Power: 9.1kW Half-a-rack 214 cm 95 cm Cisco: CRS-1 (“up to 72 boxes for a total capacity of 92 terabits per second “) Juniper: TX (four T-640) 101cm 79 cm (16-Slot Single-Shelf System) (16-Slot System) 049045 - Router Architectures 14 14
What a Router Looks Like Cisco GSR 12416 Juniper M160 19” 19” Capacity: 160Gb/s Power: 4.2kW Capacity: 80Gb/s Power: 2.6kW 6ft 3ft 2ft 2.5ft
A Router Chassis Fans/ Power Supplies Linecards
Backplane A Circuit Board with connectors for line cards High speed electrical traces connecting line cards to fabric Usually passive Typically 30-layer boards
Line Card Picture
What do these two have in common? Cisco CRS-1 Cisco Catalyst 3750G
What do these two have in common? CRS-1 linecard 20” x (18”+11”) x 1RU 40Gbps, 80MPPS State-of-the-art 0.13u silicon Full IP routing stack including IPv4 and IPv6 support Distributed IOS Multi-chassis support Cat 3750G Switch 19” x 16” x 1RU 52Gpbs, 78 MPPS State-of-the-art 0.13u silicon Full IP routing stack including IPv4 and IPv6 support Distributed IOS Multi-chassis support
What is different between them? Cisco CRS-1 Cisco Catalyst 3750G
A lot… CRS-1 linecard Cat 3750G Switch Up to 1024 linecards Fully programmable forwarding 2M prefix entries and 512K ACLs 46Tbps 3-stage switching fabric MPLS support H-A non-stop routing protocols Cat 3750G Switch Up to 9 stack members Hardwired ASIC forwarding 11K prefix entries and 1.5K ACLs 32Gbps shared stack ring L2 switching support Re-startable routing applications Also note that CRS-1 line card is 30x the material cost of Cat3750G
Other packet switches Cisco 7500 “edge” routers Lucent GX550 Core ATM switch DSL router
What is Routing? R3 A B C R1 R2 R4 D E F R5 R5 F R3 E D Next Hop Destination
What is Routing? R3 A B C R1 R2 R4 D E F R5 20 bytes 32 Data 16 32 4 1 Data Options (if any) Destination Address Source Address Header Checksum Protocol TTL Fragment Offset Flags Fragment ID Total Packet Length T.Service HLen Ver 20 bytes D D D R5 F R3 E D Next Hop Destination
What is Routing? A B C R1 R2 R3 R4 D E F R5
Basic Architectural Elements of a Router Routing Routing table update (OSPF, RIP, IS-IS) Admission Control Congestion Control Reservation Control Plane “Typically in Software” Routing Lookup Packet Classifier Switching Arbitration Scheduling Switch (per-packet processing) “Typically in Hardware” Switching
Basic Architectural Components Datapath: per-packet processing 3. 1. Output Scheduling 2. Forwarding Table Interconnect Forwarding Decision Forwarding Table Forwarding Decision Forwarding Table Forwarding Decision
Per-packet processing in a Switch/Router 1. Accept packet arriving on an ingress line. 2. Lookup packet destination address in the forwarding table, to identify outgoing interface(s). 3. Manipulate packet header: e.g., decrement TTL, update header checksum. 4. Send packet to outgoing interface(s). 5. Queue until line is free. 6. Transmit packet onto outgoing line.
ATM Switch Lookup cell VCI/VPI in VC table. Replace old VCI/VPI with new. Forward cell to outgoing interface. Transmit cell onto link.
Ethernet Switch Lookup frame DA in forwarding table. If known, forward to correct port. If unknown, broadcast to all ports. Learn SA of incoming frame. Forward frame to outgoing interface. Transmit frame onto link.
IP Router Lookup packet DA in forwarding table. If known, forward to correct port. If unknown, drop packet. Decrement TTL, update header Cksum. Forward packet to outgoing interface. Transmit packet onto link.
Special per packet/flow processing The router can be equipped with additional capabilities to provide special services on a per-packet or per-class basis. The router can perform some additional processing on the incoming packets: Classifying the packet IPv4, IPv6, MPLS, ... Delivering packets according to a pre-agreed service: Absolute service or relative service (e.g., send a packet within a given deadline, give a packet a better service than another packet (IntServ – DiffServ)) Filtering packets for security reasons Treating multicast packets differently from unicast packets
Per packet Processing Must be Fast !!! Year Aggregate Line-rate Arriving rate of 40B POS packets (Million pkts/sec) 1997 622 Mb/s 1.56 1999 2.5 Gb/s 6.25 2001 10 Gb/s 25 2003 40 Gb/s 100 2006 80 Gb/s 200 2008 … Packet processing must be simple and easy to implement Memory access time is the bottleneck 200Mpps × 2 lookups/pkt = 400 Mlookups/sec → 2.5ns per lookup
First Generation Routers Shared Backplane Route Table CPU Buffer Memory Line Interface MAC Typically <0.5Gb/s aggregate capacity CPU Memory Line Interface
Bus-based Router Architectures with Single Processor The first generation of IP router Based on software implementations on a single general-purpose CPU. Limitations: Serious processing bottleneck in the central processor Memory intensive operations (e.g. table lookup & data movements) limits the effectiveness of processor power A severe limiting factor to overall router throughput from input/output (I/O) bus
Second Generation Routers CPU Route Table Buffer Memory Line Card Line Card Line Card Buffer Memory Buffer Memory Buffer Memory Fwding Cache Fwding Cache Fwding Cache MAC MAC MAC Typically <5Gb/s aggregate capacity
Bus-based Router Architectures with Multiple Processors Architectures with Route Caching Distribute packet forwarding operations Network interface cards Processors Route caches Packets are transmitted once over the shared bus Limitations: The central routing table is a bottleneck at high-speeds traffic dependent throughput (cache) shared bus is still a bottleneck
Third Generation Routers Switched Backplane Line Card CPU Card Line Card Local Buffer Memory Local Buffer Memory CPU Line Interface Routing Table Memory Fwding Table Fwding Table MAC MAC Typically <50Gb/s aggregate capacity
Switch-based Router Architectures with Fully Distributed Processors To avoid bottlenecks: Processing power Memory bandwidth Internal bus bandwidth Each network interface is equipped with appropriate processing power and buffer space.
Fourth Generation Routers/Switches Optics inside a router for the first time Optical links 100s of metres Switch Core Linecards 0.3 - 10Tb/s routers in development
Juniper TX8/T640 Alcatel 7670 RSP Avici TSR Chiaro
Next Gen. Backbone Network Architecture – One backbone, multiple access networks DSL, FTTH, Dial Telecommuter Residential (G)MPLS based Multi-service Intelligent Packet Backbone Network IPv6 IX ISP’s GGSN Service POP SGSN Dual Stack IPv4-IPv6 Enterprise Network Dual Stack IPv4-IPv6 DSL/FTTH/Dial access Network Dual Stack IPv4-IPv6 Cable Network ISP offering Native IPv6 services CE router PE Router (Service POP) PE One Backbone Network Maximizes speed, flexibility and manageability
Current Generation: Generic Router Architecture Header Processing Data Hdr Lookup IP Address Update Header Data Hdr Queue Packet ~1M prefixes Off-chip DRAM Address Table IP Address Next Hop Buffer Memory ~1M packets Off-chip DRAM
Current Generation: Generic Router Architecture (IQ) Lookup IP Address Update Header Header Processing Address Table Data Hdr Queue Packet 1 2 N Data Hdr Buffer Memory Lookup IP Address Update Header Header Processing Address Table Data Hdr Queue Packet Buffer Memory Scheduler Lookup IP Address Update Header Header Processing Address Table Queue Packet Data Hdr Buffer Memory
Current Generation: Generic Router Architecture (OQ) Lookup IP Address Update Header Header Processing Address Table Data Hdr 1 1 Queue Packet Buffer Memory Lookup IP Address Update Header Header Processing Address Table 2 2 Queue Packet Buffer Memory Lookup IP Address Update Header Header Processing Address Table N N Queue Packet Buffer Memory
Basic Architectural Elements of a Current Router Typical IP Router Linecard Buffer & State Memory Scheduler Buffer Mgmt & Scheduling Physical Layer Framing & Maintenance Buffered or Bufferless Fabric (e.g. crossbar, bus) Packet Processing Buffer Mgmt & Scheduling Lookup Tables Buffer & State Memory OC192c Linecard: ~10-30M gates ~2Gbits of memory ~2 square feet >$10k cost; price $100K Backplane
Performance metrics Capacity Throughput Controllable Delay “maximize C, s.t. volume < 2m3 and power < 5kW” Throughput Operators like to maximize usage of expensive long-haul links. Controllable Delay Some users would like predictable delay. This is feasible with output-queueing plus weighted fair queueing (WFQ). WFQ
Why do we Need Faster Routers? To prevent routers from becoming the bottleneck in the Internet. To increase POP capacity, and to reduce cost, size and power.
Why we Need Faster Routers To prevent routers from being the bottleneck Line Capacity 2x / 7 months User Traffic 2x / 12months Router Capacity 2.2x / 18months Moore’s Law 2x / 18 months DRAM Random Access Time 1.1x / 18months
Disparity between traffic Why we Need Faster Routers 1: To prevent routers from being the bottleneck Disparity between traffic and router growth traffic 5-fold disparity Router capacity
Why we Need Faster Routers 2: To reduce cost, power & complexity of POPs Big POPs need big routers POP with large routers POP with smaller routers Interfaces: Price >$200k, Power > 400W About 50-60% of interfaces are used for interconnection within the POP. Industry trend is towards large, single router per POP.
A Case study: UUNET Internet Backbone Build Up 1999 View (4Q) 2002 View (4Q) 8 OC-48 links between POPs (not parallel) 52 OC-48 links between POPs: many parallel links 3 OC-192 Super POP links: multiple parallel interfaces between POPs (D.C. – Chicago; NYC – D.C.) To Meet the traffic growth, Higher Performance Routers with Higher Port Speed, are required
Why we Need Faster Routers 2: To reduce cost, power & complexity of POPs Once a router is 99.999% available it is possible to make this step Further Reduces CapEx, Operational cost Further increases network stability
Ideal POP CARRIER OPTICAL TRANSPORT Existing Carrier Equipment Gigabit Routers Gigabit Routers CARRIER OPTICAL TRANSPORT VoIP Gateways VoIP Gateways SONET SONET DWDM and OPTICAL SWITCHES DWDM and OPTICAL SWITCHES Digital Subscriber Line Aggregation Digital Subscriber Line Aggregation ATM ATM Gigabit Ethernet Gigabit Ethernet Cable Modem Aggregation Cable Modem Aggregation
Why are Fast Routers Difficult to Make? Big disparity between line rates and memory access speed
Problem: Fast Packet Buffers Example: 40Gb/s packet buffer Size = RTT*BW = 10Gb; 64 byte packets Write Rate, R Buffer Manager Read Rate, R 1 packet every 12.8 ns 1 packet every 12.8 ns Buffer Memory How fast? Get into the motivation – say OC768 line rate buffering is a goal. - Just mention directly the amount of buffering required. Why is it not possible today? Why is it intersting…. Talks abut SRAM/DRAM. Use SRAM? + fast enough random access time, but - too low density to store 10Gb of data. Use DRAM? + high density means we can store data, but - too slow (50ns random access time).
Memory Technology (2007) Technology Max single chip density $/chip ($/MByte) Access speed Watts/chip Networking DRAM 64 MB $30-$50 ($0.50-$0.75) 40-80ns 0.5-2W SRAM 8 MB $50-$60 ($5-$8) 3-4ns 2-3W TCAM 2 MB $200-$250 ($100-$125) 4-8ns 15-30W
How fast a buffer can be made? ~5ns for SRAM ~50ns for DRAM Buffer Memory External Line 64-byte wide bus 64-byte wide bus Rough Estimate: 5/50ns per memory operation. Two memory operations per packet. Therefore, maximum ~50/5 Gb/s. Aside: Buffers need to be large for TCP to work well, so DRAM is usually required.
Packet Caches Buffer Manager SRAM DRAM Buffer Memory Small ingress SRAM cache of FIFO heads cache of FIFO tails 55 56 96 97 87 88 57 58 59 60 89 90 91 1 Q 2 5 7 6 8 10 9 11 12 14 13 15 50 52 51 53 54 86 82 84 83 85 92 94 93 95 DRAM Buffer Memory Buffer Manager SRAM Arriving 4 3 2 1 Departing 2 Packets 5 4 3 2 1 Packets Q 6 5 4 3 2 1 b>>1 packets at a time DRAM Buffer Memory
Why are Fast Routers Difficult to Make? Packet processing gets harder What will happen What we’d like: (more features) QoS, Multicast, Security, … Instructions per arriving byte time
Why are Fast Routers Difficult to Make? Clock cycles per minimum length packet since 1996
Options for packet processing General purpose processor MIPS PowerPC Intel Network processor Intel IXA and IXP processors IBM Rainier Control plane processors: SiByte (Broadcom), QED (PMC-Sierra). FPGA ASIC
General Observations Up until about 2000, More recently, Low-end packet switches used general purpose processors, Mid-range packet switches used FPGAs for datapath, general purpose processors for control plane. High-end packet switches used ASICs for datapath, general purpose processors for control plane. More recently, 3rd party network processors now used in many low- and mid-range datapaths. Home-grown network processors used in high-end.
Why are Fast Routers Difficult to Make? Demand for Router Performance Exceeds Moore’s Law Growth in capacity of commercial routers (per rack): Capacity 1992 ~ 2Gb/s Capacity 1995 ~ 10Gb/s Capacity 1998 ~ 40Gb/s Capacity 2001 ~ 160Gb/s Capacity 2003 ~ 640Gb/s Capacity 2007 ~ 11.5Tb/s Average growth rate: 2.2x / 18 months.
Maximizing the throughput of a router Engine of the whole router Operators increasingly demand throughput guarantees: To maximize use of expensive long-haul links For predictability and planning Serve as many customers as possible Increase the lifetime of the equipment Despite lots of effort and theory, no commercial router today has a throughput guarantee.
Maximizing the throughput of a router Engine of the whole router Ingress linecard Interconnect Egress linecard Framing Route lookup TTL process ing Buffer ing Buffer ing QoS schedul ing Framing Interconnect scheduling Control plane Data path Control path Scheduling path
Maximizing the throughput of a router Engine of the whole router This depends on the architecture of the switching: Input Queued Output Queued Shared memory It depends on the arbitration/scheduling algorithms within the specific architecture This is key to the overall performance of the router.
Why are Fast Routers Difficult to Make? Power: It is exceeding the limit
Switching Architectures
Generic Router Architecture Lookup IP Address Update Header Header Processing Address Table Data Hdr 1 1 Queue Packet Buffer Memory Lookup IP Address Update Header Header Processing Address Table 2 2 N times line rate Queue Packet Buffer Memory N times line rate Lookup IP Address Update Header Header Processing Address Table N N Queue Packet Buffer Memory
Generic Router Architecture Data Hdr Lookup IP Address Update Header Header Processing Address Table Data Hdr Queue Packet 1 2 N Buffer Memory Lookup IP Address Update Header Header Processing Address Table Data Hdr Queue Packet Buffer Memory Scheduler Lookup IP Address Update Header Header Processing Address Table Queue Packet Buffer Memory
Interconnects Two basic techniques Input Queueing Output Queueing Usually a non-blocking switch fabric (e.g. crossbar) Usually a fast bus
Simple model of output queued switch Link 1, ingress Link 1, egress Link 2, ingress Link 2, egress Link 3, ingress Link 3, egress Link 4, ingress Link 4, egress Link rate, R Link 2 Link rate, R Link 1 R1 Link 3 R R Link 4 R R R R
Output Queued (OQ) Switch How an OQ Switch Works Output Queued (OQ) Switch
Characteristics of an output queued (OQ) switch Arriving packets are immediately written into the output queue, without intermediate buffering. The flow of packets to one output does not affect the flow to another output. An OQ switch has the highest throughput, and lowest delay. The rate of individual flows, and the delay of packets can be controlled (QoS).
The shared memory switch A single, physical memory device Link 1, ingress Link 1, egress Link 2, ingress Link 2, egress R R Link 3, ingress Link 3, egress R R Link N, ingress Link N, egress R R
Characteristics of a shared memory switch
Memory bandwidth Basic OQ switch: Shared Memory Switch: Consider an OQ switch with N different physical memories, and all links operating at rate R bits/s. In the worst case, packets may arrive continuously from all inputs, destined to just one output. Maximum memory bandwidth requirement for each memory is (N+1)R bits/s. Shared Memory Switch: Maximum memory bandwidth requirement for the memory is 2NR bits/s.
How fast can we make a centralized shared memory switch? 5ns SRAM Shared Memory 5ns per memory operation Two memory operations per packet Therefore, up to 160Gb/s (200 x 8/10 nsec) In practice, closer to 80Gb/s 1 2 N 200 byte bus
Output Queueing The “ideal” 1 2 2 1 1
How to Solve the Memory Bandwidth Problem? Use Input Queued Switches In the worst case, one packet is written and one packet is read from an input buffer Maximum memory bandwidth requirement for each memory is 2R bits/s. However, using FIFO input queues can result in what is called “Head-of-Line (HoL)” blocking
Input Queueing Head of Line Blocking Delay Load 58.6% 100%
Head of Line Blocking
Virtual Output Queues (VoQ) At each input port, there are N queues – each associated with an output port Only one packet can go from an input port at a time Only one packet can be received by an output port at a time It retains the scalability of FIFO input-queued switches It eliminates the HoL problem with FIFO input Queues
Input Queueing Virtual output queues
Input Queues Virtual Output Queues Delay Load 100%
Input Queueing (VoQ) Scheduler Memory b/w = 2R Can be quite complex!
Combined IQ/SQ Architecture Can be a good compromise 1 …. .… Routing fabric N N output queues In one shared memory Packets (data) Flow control
A Comparison Memory speeds for 32x32 switch Cell size = 64 bytes Shared-Memory Input-queued Line Rate Memory BW Access Time Per cell Memory BW Access Time 100 Mb/s 6.4 Gb/s 80 ns 200 Mb/s 2.56 s 1 Gb/s 64 Gb/s 8 ns 2 Gb/s 256 ns 2.5 Gb/s 160 Gb/s 3.2 ns 5 Gb/s 102.4 ns 10 Gb/s 640 Gb/s 0.8 ns 20 Gb/s 25.6 ns
Scalability of Switching Fabrics
Shared Bus It is the simplest interconnect possible Protocols are very well established Multicasting and broadcasting is natural They have a scalability problem as we cannot have multiple transmissions concurrently Its maximum bandwidth is around 100 Gbps – it limits the maximum number of I/O ports and/or the line rates It is typically used for “small” shared memory switches or output-queued switches – very good choice for Ethernet switches
Crossbars It is becoming the preferred interconnect of choice for high-speed switches Have a very high throughput, and support QoS and multicast N2 crosspoints – but now it is not the real limitation nowadays Data In Data Out configuration
Limiting factors Crossbar switch N2 crosspoints per chip, It’s not obvious how to build a crossbar from multiple chips, Capacity of “I/O”s per chip. State of the art: About 200 pins each operating at 3.125Gb/s ~= 600Gb/s per chip. About 1/3 to 1/2 of this capacity available in practice because of overhead and speedup. Crossbar chips today are limited by the “I/O” capacity.
Limitations to Building Large Crossbar Switches: I/O pins Maximum practical bit rate per pin ~ 3.125 Gbits/sec At this speed you need between 2-4 pins per single bit To achieve a 10 Gbps/sec (OC-192) line rate, you need around 4 parallel data lines (4-bit parallel transmission) For example, consider a 4-bit data data parallel 64-input crossbar that is designed to support OC-192 line rates per port. Each port interface would require 4 x 3 = 12 pins in each direction. Hence a 64-port crossbar would need 12 x 64 x 2 = 1536 pins just for the I/O data lines Hence, the real problem is I/O pin limitations How to solve the problem?
Scaling: Trying to build a crossbar from multiple chips 16x16 crossbar switch: Building Block: 4 inputs 4 outputs Eight inputs and eight outputs required!
How to build a scalable crossbar Use bit slicing – parallel crossbars For example, we can use 4-bit crossbars to implement the previous example. So we need 4 parallel 1-bit crossbars. Each port interface would require 1 x 3 = 3 pins in each direction. Hence a 64-port crossbar would need 3 x 64 x 2 = 384 pins for the I/O data lines – which is reasonable (but we need 4 chips here).
Scaling: Bit-slicing N Cell is “striped” across multiple identical planes. Crossbar switched “bus”. Scheduler makes same decision for all slices. Linecard 8 7 6 5 Cell Cell Cell 4 3 2 1 Scheduler
Scaling: Time-slicing Cell goes over one plane; takes N cell times. Scheduler is unchanged. Scheduler makes decision for each slice in turn. Linecard Cell 8 7 6 5 4 Cell 3 Cell 2 Cell 1 Cell Cell Scheduler
HKUST 10Gb/s 256x256 Crossbar Switch Fabric Design Our overall switch fabric is an OC-192 256*256 crossbar switch Such a system is composed of 8 256*256 crossbar chips, each running at 2Gb/s (to compensate for the overhead and to provide a switch speedup) The Deserializer (DES) is to convert the OC-192 10Gb/s data at the fiber link to 8 low speed signals, while the Serializer (SER) is to serialize the low speed signals back to the fiber link
Architecture of the Crossbar Chip Crossbar Switch Core – fulfills the switch functions Control – configures the crossbar core High speed data link – communicates between this chip and SER/DES PLL – provides on-chip precise clock
Technical Specification of our Core-Crossbar Chip Full crossbar core 256*256 (embedded with 2 bit-slices) Technology TSMC 0.25mm SCN5M Deep (lambda=0.12 mm) Layout size 14 mm * 8 mm Transistor counts 2000k Supply voltage 2.5v Clock Frequency 1GHz Power 40 W
Layout of a 256*256 crossbar switch core
HKUST Crossbar Chip in the News Researchers offer alternative to typical crossbar design http://www.eetimes.com/story/OEG20020820S0054 By Ron Wilson - EE Times August 21, 2002 (10:56 a.m. ET) PALO ALTO, Calif. — In a technical paper presented at the Hot Chips conference here Monday (Aug.19) researchers Ting Wu, Chi-Ying Tsui and Mounir Hamdi from Hong Kong University of Science and Technology (China) offered an alternative pipeline approach to crossbar design. Their approach has yielded a 256-by-256 signal switch with a 2-GHz input bandwidth, simulated in a 0.25-micron, 5-metal process. The growing importance of crossbar switch matrices, now used for on-chip interconnect as well as for switching fabric in routers, has led to increased study of the best ways to build these parts.
Scaling a crossbar Conclusion: scaling the capacity is relatively straightforward (although the chip count and power may become a problem). In each scheme so far, the number of ports stays the same, but the speed of each port is increased. What if we want to increase the number of ports? Can we build a crossbar-equivalent from multiple stages of smaller crossbars? If so, what properties should it have?
Multi-Stage Switches
Basic Switch Element This is equivalent to crosspoint in the crossbar (no longer a good argument) 1 1 Two States Cross Through Optional Buffering
Example of Multistage Switch It needs NlogN Internal switches (crosspoints) – less than the crossbar K 000 1 1 1 one half of the deck 1 001 2 010 1 1 1 3 011 N 4 100 1 the other half of deck 1 1 5 101 6 110 1 1 1 7 111 a perfect shuffle a perfect shuffle
Packet Routing The bits of the destination address provide the required routing tags. The digits in the destination address are used to set the state of the stages. destination port address 000 1 1 1 1 001 2 1 010 1 1 1 011 101 white bit controls switch setting in each stage 3 011 101 1 011 101 011 4 011 101 100 1 1 1 5 1 101 6 110 1 1 1 7 111 Perfect shuffle Perfect shuffle Stage 1 Stage 2 Stage 3
Internal blocking Internal link blocking as well as output blocking can happen in a Multistage switch. The following example illustrates an internal blocking for connections of input 0 to output 3 and input 4 to output 2. 011 010 000 1 1 1 011 010 blocking link 1 001 ??? 2 ??? 010 1 1 1 011 3 4 100 1 1 1 5 101 6 110 1 1 1 7 111 Perfect shuffle Perfect shuffle Stage 1 Stage 2 Stage 3
Output Blocking The following example illustrates output blocking for the connections between input 1 and output 6, and input 3 and output 6. 000 1 1 1 1 001 110 010 2 1 1 1 110 011 3 4 100 1 1 1 5 101 6 110 110 1 1 1 7 output blocking 111 Perfect shuffle Perfect shuffle Stage 1 Stage 2 Stage 3
3-stage Clos Network N = n x m k >= n m x m n x k k x n 1 1 1 n 1 2 … 2 … … … N m … m N N = n x m k >= n k
Clos-network Properties Expansion factors Strictly Nonblocking iff m >= 2n -1 Rearrangeable Nonblocking iff m >= n
3-stage Fabrics (Basic building block – a crossbar) Clos Network
3-Stage Fabrics Clos Network Expansion factor required = 2-1/N (but still blocking for multicast)
4-Port Clos Network Strictly Non-blocking
Construction example Switch size 1024x1024 Construction module Input switch thirty-two 32x48 Central switch forty-eight 48x48 Output switch thirty-two 48x32 Expansion 48/32=1.5 1 32x48 #1 48x32 #1 48x48 #1 32 33 32x48 #2 48x32 #2 48x48 #2 64 993 32x48 #32 48x32 #32 48x48 #48 1024
Lucent Architecture Buffers
MSM Architecture
Cisco’s 46Tbps Switch System Line Card Chassis Fabric Card Chassis 12.5G 12.5G total 80 chassis 8 sw planes speedup 2.5 1152 LICs 1296x1296 switch fabric 3-stage Benes sw multicast in the sw 1:N fabric redundancy 40 Gbps packet processor (188 RISCs) 40G LC (1) S1/S3 (1) 18 x 18 S2 (1) 72 x 72 LC (16) S1/S3 (8) 18 x 18 S2 (18) 72 x 72 LCC(1) FCC(1) LC (1137) S1/S3 (569) 18 x 18 S2 (127) 72 x 72 LC (1152) S1/S3 (576) 18 x 18 S2 (144) 72 x 72 LCC(72) FCC(8)
Massively Parallel Switches Instead of using tightly coupled fabrics like a crossbar or a bus, they use massively parallel interconnects such as hypercube, 2D torus, and 3D torus. Few companies use this design architecture for their core routers These fabrics are generally scalable However: It is very difficult to guarantee QoS and to include value-added functionalities (e.g., multicast, fair bandwidth allocation) They consume a lot of power They are relatively costly
Massively Parallel Switches
3D Switching Fabric: Avici Three components Topology 3D torus Routing source routing with randomization Flow control virtual channels and virtual networks Maximum configuration: 14 x 8 x 5 = 560 Channel speed is 10 Gbps
Packaging Uniformly short wires between adjacent nodes Can be built in passive backplanes Run at high speed Figures are from Scalable Switching Fabrics for Internet Routers, by W. J. Dally (can be found at www.avici.com)
Avici: Velociti™ Switch Fabric Toroidal direct connect fabric (3D Torus) Scales to 560 active modules Each element adds switching & forwarding capacity Each module connects to 6 other modules
Switch fabric chips comparison Here are some information for the comparison of switch fabrics chips for different companies. You will see that “Crossbar” switch architecture will be the dominate technology for designing the switch fabrics nowadays. It is also pretty interesting that the price of the switch fabrics is selling depending on the switching capacity. Counting per 10GBit/s. Also, the power consumption for using “shared-memory” switch architecture will generally yield a relatively smaller power consumption. And obviously, “Buffered crossbar” switch architecture will usually have higher power consumption. http://www.lightreading.com/document.asp?doc_id=47959