Presentation is loading. Please wait.

Presentation is loading. Please wait.

High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,

Similar presentations


Presentation on theme: "High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,"— Presentation transcript:

1 High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9, 2003

2 2 Talk at a Glance Raw Processor Overview Raw Router Architecture Related Work: SimpleFit Analysis Framework Discussion

3 3 We are on…

4 4 What is Raw? Taylor, 1999 Next Generation General-Purpose Processor!

5 5 More Specifically, Raw is… A scalable computation fabric 4 x 4 mesh of tiles, each tile is a simple microprocessor Ultra fast interconnect network Exposes the wires to the compiler Compiler orchestrates the communication

6 6 Raw Facts Performance 16 OPS/FLOPS per cycle 462 Gb/s of on-chip “bisection bandwidth” 201 Gb/s I/O bandwidth 57 GB/s of on-chip memory bandwidth

7 7 Raw Facts Layout Longest wire is the length of tile  fast clocking! 16 tiles Each tile: MIPS R4000 + router + interconnect 32 KB IMEM 32 KB data cache 64 KB SMEM 2048 KB total!

8 8 Raw Facts Instruction Set Architecture Eight stage pipeline: FETCH, DECODE, RF/STALL, EXE, MUL, MEM, FPU MIPS instruction set 28 general-purpose registers 4 register-mapped network ports 2-way set-associative cache, 3 cycle latency, 32 byte lines Fast!

9 9 Raw Facts Implementation ASIC @ 250 MHz 122 million transistors (P4: 43 million) 18.2mm x 18.2mm die (P4 : 15mm x 15mm) 1080 signal I/O pins 25 Watts IBM SA-27E 6 layer metal copper 0.15μ process (P4: 0.13μ)

10 10 Raw Layout

11 11 Communication Mechanisms 2 static networks 2 dynamic networks

12 12 Static Networks Destinations known at compile time Message size known at compile time Cycle-by-cycle switch schedule Three-cycle nearest neighbor send-to-use latency No processing overhead

13 13 Static Network: Send

14 14 Static Network: Receive

15 15 Dynamic Networks Unpredictable events External asynchronous interrupts Cache misses 15- to 30-cycle nearest neighbor send-to-use latency (message header processing overhead) Wormhole routed, two-stage pipelined, dimension-ordered

16 16 How to Program? StreamIt! Thies et al., 2001 Hierarchical structures: Pipeline SplitJoin Feedback Loop Basic programmable unit: Filter

17 17 StreamIt In Action

18 18 Compiling Streamit StreamIt language exposes the data movement Graph structure is architecture independent Each architecture is different in granularity and topology Communication is exposed to the compiler The compiler needs to efficiently bridge the abstraction Map the computation and communication pattern of the program to the processors, memory and the communication substrate The StreamIt Compiler Partitioning Placement Scheduling Code generation

19 19 We are on…

20 20 Motivation Build a fast IP router on a general- purpose architecture Why? Flexibility  new protocols and services Price  economies of scale

21 21 Raw Router Chuvpilo et al., 2002 Features 4-port edge router 3.3 Mpps 26.9 Gbps uses one Raw static network to stream data

22 22 What is Routing? RM OSI…

23 23 Architecture of Internet Routers Network Processor Switch Fabric Forwarding Engine Interface Forwarding Engine Interface

24 24 Switch Fabric

25 25 Click Modular Router

26 26 Problem: Four Networks… 2 1 4 3

27 27 … and Sixteen Tiles:

28 28 What is the Mapping? Dynamic Communication Static Interconnect

29 29 Solution: Rotating Crossbar Lookup Processor Egress Processor Lookup Processor PORT 0PORT 1 Ingress Processor Crossbar Processor Ingress Processor ROTATING CROSSBAR Ingress Processor Crossbar Processor Ingress Processor PORT 3PORT 2 Lookup Processor Egress Processor Lookup Processor In 0 In 3 In 1 In 2 Out 0Out 1 Out 3Out 2

30 30 Switch Fabric Design The idea of a Token Ring network  absolute fairness Algorithm uses two static networks, dynamic networks are idle All deadlock-free configurations are scheduled at compile time Four headers and token location define a global configuration Global configuration is computed in a distributed manner at run time

31 31 Rotating Crossbar Illustrated Lookup Processor Egress Processor Lookup Processor PORT 0PORT 1 Ingress Processor Crossbar Processor Ingress Processor ROTATING CROSSBAR Ingress Processor Crossbar Processor Ingress Processor PORT 3PORT 2 Lookup Processor Egress Processor Lookup Processor

32 32 Rotating Crossbar Illustrated Lookup Processor Egress Processor Lookup Processor PORT 0PORT 1 Ingress Processor Crossbar Processor Ingress Processor ROTATING CROSSBAR Ingress Processor Crossbar Processor Ingress Processor PORT 3PORT 2 Lookup Processor Egress Processor Lookup Processor

33 33 Phases of the Algorithm TILE PROCESSOR SWITCH PROCESSOR headers_request headers choose_new_config send_prev_config route_body confirm update_token

34 34 Distributed Scheduling Algorithm Let’s enumerate the number of configurations: SPACE = |Hdr 0 | x … x |Hdr 3 | x |Token|, where |Hdr 0 | = … = |Hdr 3 | = 5, and |Token| = 4  therefore SPACE = 5 4 x 4 = 2,500 distinct configurations

35 35 So What?... Each tile has 8,192 words of instruction memory, same for switch   8,192/2,500 = 3.3 instructions per configuration  not enough!  need to use off-chip memory  slow!   need to minimize SPACE

36 36 Minimization Egress Processor PORT 0 Ingress Processor Crossbar Processor in out cwnext ccwprev cwprevccwnext

37 37 Clients and Servers of a Crossbar Processor serversout cwnext ccwnext clients  in cwprev ccwprev

38 38 Outcome of Minimization We cut down the number of configurations by 78 times! Now there are only 32 entries!   the program can fit in the local instruction memory!

39 39 Implementation Raw Router was tested in a cycle- accurate simulator of the Raw processor and the FPGA emulator Raw prototype clock speed is assumed to be 250 MHz The focus of research is on switch fabric, NOT on route lookup, etc.

40 40 Peak Throughput

41 41 Average Throughput

42 42 Future Work Take advantage of dynamic networks Implement IP route lookup Add computation on data (encryption) Add support of multicast traffic Implement Quality of Service Add virtual output queueing Explore larger router configurations

43 43 Conclusion Implemented a gigabit switch on Raw Mapped dynamic communication to static interconnect Can intermix switch fabric with computation High-bandwidth I/O allows performance of custom ASIC processors

44 44 We are on…

45 45 SimpleFit Moritz et al., 2001 A Framework for Analyzing Design Tradeoffs in Raw Architectures

46 46 Analytical Framework

47 47 Architecture Model Constrained optimization problem: find P, p, c, m to minimize T = max(T p, T c ) subject to B ≥ K, where T p, T c – performance off app. in terms of processing and communication, B is area budget, and K is cost

48 48 Cost Model Processor Memory Communication Global communication Global latency

49 49 Application Model Required processing per node Required amount of memory words Required number of words of local communication per node Required local communication events Required latency of events Required global communication Required global communication events

50 50 Performance Functions The maximum of the runtimes in terms of processing and communication

51 51 Optimization Problem Constrained based nonlinear optimization problem: Given: a fixed chip area or budget and problem size Objective: minimum runtime Constraints: budget, balanced local and global computation and communication, sufficiency of memory on a tile

52 52 Results Application-specific results Sensitivity of grain size Sensitivity to different processor cost model assumptions Sensitivity to communication overlapping assumptions Design comparisons

53 53 Example: Processors vs. Problem Size

54 54 Conclusions of the Talk Raw is good for streaming applications and combining computation with communication StreamIt is a good interface to tiled architectures Routing on Raw achieves the performance of custom ASICs, but remains flexible SimpleFit framework provides good reasoning about Raw

55 55 We are on…

56 56 Discussion Questions? Comments? Ideas?

57 57 References Check out the website: http://cag.lcs.mit.edu


Download ppt "High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,"

Similar presentations


Ads by Google