Packet Switching on Raw

Packet Switching on Raw
Research Qualifying Exam Gleb A Chuvpilo January 28, 2005

Project Publications High-Bandwidth Packet Switching on the Raw General-Purpose Architecture, Gleb A. Chuvpilo and Saman Amarasinghe In Proceedings of the International Conference on Parallel Processing (ICPP-03), Kaohsiung, Taiwan, Republic of China, October 6-9, 2003. High-Bandwidth Packet Switching on the Raw General-Purpose Architecture, Gleb A. Chuvpilo, S.M. Thesis, Massachusetts Institute of Technology, Cambridge, Massachusetts, August, 2002. RawNet: Network Processing on the Raw Processor, David Wentzlaff, Gleb A. Chuvpilo, Arvind Saraf, Saman Amarasinghe, and Anant Agarwal, In Research Abstracts of the MIT Laboratory for Computer Science, Cambridge, Massachusetts, March 2002. Gigabit IP Routing on Raw, Gleb A. Chuvpilo, David Wentzlaff, and Saman Amarasinghe, In Proceedings of the 1st HPCA Workshop on Network Processors, Cambridge, Massachusetts, February 3, 2002. Also, unpublished work on Network Calculus at the Computer Engineering and Networks Laboratory of the ETH Swiss Federal Institute of Technology

Outline Introduction Packet Switching on Raw Results Conclusion
Raw Processor Overview Internet Router Overview Packet Switching on Raw Raw Router Architecture Rotating Crossbar Design for Switch Fabric Distributed Scheduling Algorithm Minimization and Scheduling Results Conclusion

Introduction

Goal Build an IP router on a general-purpose processor Why?
Flexibility  new protocols and services Price  economies of scale

Raw Processor A scalable computation fabric
4 x 4 mesh of tiles, each tile is a RISC microprocessor Ultra fast interconnect network Exposes the wires to the compiler Compiler orchestrates the communication

Raw Facts Performance 16 OPS/FLOPS per cycle
230 Gb/s of on-chip “bisection bandwidth” 201 Gb/s off-chip I/O bandwidth 57 GB/s of on-chip memory bandwidth

Raw Facts Layout Longest wire is the length of tile  fast clocking
Each tile: MIPS R router + interconnect 32 KB IMEM 32 KB data cache 64 KB SMEM  2 MB total per chip

Raw Facts Instruction Set Architecture
Eight stage pipeline: FETCH, DECODE, RF/STALL, EXE, MUL, MEM, FPU MIPS instruction set 28 general-purpose registers 4 register-mapped network ports 2-way set-associative cache, 3 cycle latency, 32 byte lines

Raw Facts Implementation ASIC @ 250 MHz Worst Case
122 million transistors (P4: 43 million) 18.2mm x 18.2mm die (P4 : 15mm x 15mm) 1080 signal I/O pins 25 Watts IBM SA-27E 6 layer metal copper 0.15μ process (P4: 0.13μ)

Raw Layout

Communication Mechanisms
2 static networks 2 dynamic networks

Static Networks Destinations known at compile time
Message size known at compile time Cycle-by-cycle switch schedule Three-cycle nearest neighbor send-to-use latency No processing overhead

Static Network: Send A tile wants to communicate a value to its southern neighbor

Static Network: Receive

Dynamic Networks Unpredictable events
External asynchronous interrupts Cache misses 15- to 30-cycle nearest neighbor send-to-use latency (message header processing overhead) Wormhole routed, two-stage pipelined, dimension-ordered

Routing

What is Routing? RM OSI… Let’s take a look at the Open Systems Interconnection Reference Model to figure out where routers stand.

IP Router Network Processor Switch Fabric Forwarding Engine Interface

Switch Fabric Cisco Gigabit Switch Router backplane interconnecting multiple line cards. A centralized scheduler connects to each line card and determines the configuration of the crossbar switch for each time slot

Click Modular Router Modular software router
MIT Parallel and Distributed OS Group 435, byte packets a second on a 700 MHz Pentium III (commodity hardware) Flexible, configurable, and easy to understand Interconnected collection of modules called elements

Click Modular Router Software router running on Intel x86 architecture

Packet Switching on Raw

Problem: Four Networks…
2 1 4 3

… and Sixteen Tiles:

What is the Mapping? ? Static Interconnect Dynamic Communication

Solution: Rotating Crossbar
Out 0 Out 1 Lookup Processor Egress Processor PORT 0 PORT 1 Ingress Processor Crossbar Processor ROTATING CROSSBAR PORT 3 PORT 2 In 0 In 1 In 3 In 2 Notice the symmetry of design END: now, let’s jump inside the center of the picture… Out 3 Out 2

Switch Fabric Design The idea of a Token Ring network  absolute fairness Algorithm uses two static networks, dynamic networks are idle All deadlock-free configurations are scheduled at compile time Four headers and token location define a global configuration Global configuration is computed in a distributed manner at run time

Rotating Crossbar Illustrated
Lookup Processor Egress Processor PORT 0 PORT 1 Ingress Processor Crossbar Processor ROTATING CROSSBAR PORT 3 PORT 2

Phases of the Algorithm
TILE PROCESSOR SWITCH PROCESSOR headers_request headers send_prev_config choose_new_config route_body Pipelining = overlap routing with computation of configuration update_token confirm

Distributed Scheduling Algorithm
Let’s enumerate the number of configurations: SPACE = |Hdr0| x … x |Hdr3| x |Token|, where |Hdr0| = … = |Hdr3| = 5, and |Token| = 4  therefore SPACE = 54 x 4 = 2,500 distinct configurations the most straightforward enumeration of the configuration space is the product of four headers and a token; each of the headers is of size 5, and the token can be in four different locations in the crossbar – this enumeration is global

So What?... Each tile has 8,192 words of instruction memory, same for switch   8,192/2,500 = 3.3 instructions per configuration  not enough!  need to use off-chip memory  slow!   need to minimize SPACE you may ask: “so what, memory is cheap!” But here’s the thing: each tile of the Raw processor only has 8 k words of instruction memory. The same is true for the switch. So what are we left with? 8,192 divided by 2,500 leaves us 3.3 instructions per configuration.

Minimization Egress Processor PORT 0 Ingress Processor Crossbar Processor out cwnext in ccwprev Let’s think locally!! The symmetry of our design lets us do the enumeration of configurations in a local manner. Let’s shift the focus in order to minimize the configuration space and make things simpler: instead of enumerating global configurations of the Rotating Crossbar, let’s concentrate on a specific Crossbar Processor. As you can see in the figure, what we need to do is name all possible clients, or potential incoming occupants, of a Crossbar Processor’s servers – static networks connecting a Crossbar Processor to its outgoing neighboring tiles. cwprev ccwnext

Clients and Servers of a Crossbar Processor
out cwnext ccwnext clients  in cwprev ccwprev Here are the possible values that “servers” and “clients” can take: three for servers (out, cwnext, ccwnext), and four for clients (empty, in, cwprev, and ccwprev)

Minimization and Scheduling
We cut down the number of configurations by 78 times! Now there are only 32 entries!  the program can fit in the local instruction memory! Code generated by an automatic compile-time scheduler In addition, software pipelining + loop unrolling of the assembly code of the switch processors of the crossbar to avoid deadlock

Scheduler Output /* AUTOGENERATED SCHEDULE FOR PORT 0 */
/* Tile Processor */ /* …*/ conf_1_0303: mtsri SW_PC, %lo(sw_conf_1000) j conf_done conf_1_0304: conf_1_0310: mtsri SW_PC, %lo(sw_conf_2001) conf_1_0311: mtsri SW_PC, %lo(sw_conf_1210) /* HAND-CODED SCHEDULE FOR PORT 0 */ /* Switch Processor */ /* …*/ /* in->out, prev->next, dist=1 */ sw_conf_1210: nop route $IN->$OUT nop route $IN->$OUT, $PREV->$NEXT

Results

Implementation Raw Router was tested in a cycle-accurate simulator of the Raw processor Raw prototype clock speed is assumed to be 250 MHz The focus of research is on switch fabric, NOT on route lookup, etc. Over 75,000 lines of assembly code, many of them hand-coded As a disclaimer, I would like to notice that the research presented so far has been focused on the design and implementation of the switch fabric of the Raw Router, and the rest of the router implementation is in the future work.

Raw Router Results Features 4-port edge router 3.3 Mpps 26.9 Gbps
Uses Raw static networks to stream data

Conclusion

Conclusion Implemented a gigabit switch on Raw
Mapped dynamic communication to static interconnect Can intermix switch fabric with computation High-bandwidth I/O allows performance of custom ASIC processors

Future Work + Critique Take advantage of dynamic networks
Implement IP route lookup Add computation on data (encryption) Add support of multicast traffic Implement Quality of Service Add virtual output queueing Explore larger router configurations We are planning to implement longest prefix match for IP route lookup multicast traffic is the one when a single source is simultaneously sending the same information to a set of subscribers; Quality of Service is a number of mechanisms to allow prioritization of traffic according to its pricing; virtual output queueing is a method to avoid head-of-line blocking of packets

End of the “official” part!

Current Research Probabilistic Robotics with Prof. John Leonard
Robust Feature-Relative Navigation for Autonomous Underwater Vehicles

Robotic Kayaks

Questions?

Packet Switching on Raw

Similar presentations

Presentation on theme: "Packet Switching on Raw"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Packet Switching on Raw

Similar presentations

Presentation on theme: "Packet Switching on Raw"— Presentation transcript:

Similar presentations

About project

Feedback