Download presentation
Presentation is loading. Please wait.
Published byDaniela Bryant Modified over 9 years ago
1
High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9, 2003
2
2 Talk at a Glance Raw Processor Overview Raw Router Architecture Related Work: SimpleFit Analysis Framework Discussion
3
3 We are on…
4
4 What is Raw? Taylor, 1999 Next Generation General-Purpose Processor!
5
5 More Specifically, Raw is… A scalable computation fabric 4 x 4 mesh of tiles, each tile is a simple microprocessor Ultra fast interconnect network Exposes the wires to the compiler Compiler orchestrates the communication
6
6 Raw Facts Performance 16 OPS/FLOPS per cycle 462 Gb/s of on-chip “bisection bandwidth” 201 Gb/s I/O bandwidth 57 GB/s of on-chip memory bandwidth
7
7 Raw Facts Layout Longest wire is the length of tile fast clocking! 16 tiles Each tile: MIPS R4000 + router + interconnect 32 KB IMEM 32 KB data cache 64 KB SMEM 2048 KB total!
8
8 Raw Facts Instruction Set Architecture Eight stage pipeline: FETCH, DECODE, RF/STALL, EXE, MUL, MEM, FPU MIPS instruction set 28 general-purpose registers 4 register-mapped network ports 2-way set-associative cache, 3 cycle latency, 32 byte lines Fast!
9
9 Raw Facts Implementation ASIC @ 250 MHz 122 million transistors (P4: 43 million) 18.2mm x 18.2mm die (P4 : 15mm x 15mm) 1080 signal I/O pins 25 Watts IBM SA-27E 6 layer metal copper 0.15μ process (P4: 0.13μ)
10
10 Raw Layout
11
11 Communication Mechanisms 2 static networks 2 dynamic networks
12
12 Static Networks Destinations known at compile time Message size known at compile time Cycle-by-cycle switch schedule Three-cycle nearest neighbor send-to-use latency No processing overhead
13
13 Static Network: Send
14
14 Static Network: Receive
15
15 Dynamic Networks Unpredictable events External asynchronous interrupts Cache misses 15- to 30-cycle nearest neighbor send-to-use latency (message header processing overhead) Wormhole routed, two-stage pipelined, dimension-ordered
16
16 How to Program? StreamIt! Thies et al., 2001 Hierarchical structures: Pipeline SplitJoin Feedback Loop Basic programmable unit: Filter
17
17 StreamIt In Action
18
18 Compiling Streamit StreamIt language exposes the data movement Graph structure is architecture independent Each architecture is different in granularity and topology Communication is exposed to the compiler The compiler needs to efficiently bridge the abstraction Map the computation and communication pattern of the program to the processors, memory and the communication substrate The StreamIt Compiler Partitioning Placement Scheduling Code generation
19
19 We are on…
20
20 Motivation Build a fast IP router on a general- purpose architecture Why? Flexibility new protocols and services Price economies of scale
21
21 Raw Router Chuvpilo et al., 2002 Features 4-port edge router 3.3 Mpps 26.9 Gbps uses one Raw static network to stream data
22
22 What is Routing? RM OSI…
23
23 Architecture of Internet Routers Network Processor Switch Fabric Forwarding Engine Interface Forwarding Engine Interface
24
24 Switch Fabric
25
25 Click Modular Router
26
26 Problem: Four Networks… 2 1 4 3
27
27 … and Sixteen Tiles:
28
28 What is the Mapping? Dynamic Communication Static Interconnect
29
29 Solution: Rotating Crossbar Lookup Processor Egress Processor Lookup Processor PORT 0PORT 1 Ingress Processor Crossbar Processor Ingress Processor ROTATING CROSSBAR Ingress Processor Crossbar Processor Ingress Processor PORT 3PORT 2 Lookup Processor Egress Processor Lookup Processor In 0 In 3 In 1 In 2 Out 0Out 1 Out 3Out 2
30
30 Switch Fabric Design The idea of a Token Ring network absolute fairness Algorithm uses two static networks, dynamic networks are idle All deadlock-free configurations are scheduled at compile time Four headers and token location define a global configuration Global configuration is computed in a distributed manner at run time
31
31 Rotating Crossbar Illustrated Lookup Processor Egress Processor Lookup Processor PORT 0PORT 1 Ingress Processor Crossbar Processor Ingress Processor ROTATING CROSSBAR Ingress Processor Crossbar Processor Ingress Processor PORT 3PORT 2 Lookup Processor Egress Processor Lookup Processor
32
32 Rotating Crossbar Illustrated Lookup Processor Egress Processor Lookup Processor PORT 0PORT 1 Ingress Processor Crossbar Processor Ingress Processor ROTATING CROSSBAR Ingress Processor Crossbar Processor Ingress Processor PORT 3PORT 2 Lookup Processor Egress Processor Lookup Processor
33
33 Phases of the Algorithm TILE PROCESSOR SWITCH PROCESSOR headers_request headers choose_new_config send_prev_config route_body confirm update_token
34
34 Distributed Scheduling Algorithm Let’s enumerate the number of configurations: SPACE = |Hdr 0 | x … x |Hdr 3 | x |Token|, where |Hdr 0 | = … = |Hdr 3 | = 5, and |Token| = 4 therefore SPACE = 5 4 x 4 = 2,500 distinct configurations
35
35 So What?... Each tile has 8,192 words of instruction memory, same for switch 8,192/2,500 = 3.3 instructions per configuration not enough! need to use off-chip memory slow! need to minimize SPACE
36
36 Minimization Egress Processor PORT 0 Ingress Processor Crossbar Processor in out cwnext ccwprev cwprevccwnext
37
37 Clients and Servers of a Crossbar Processor serversout cwnext ccwnext clients in cwprev ccwprev
38
38 Outcome of Minimization We cut down the number of configurations by 78 times! Now there are only 32 entries! the program can fit in the local instruction memory!
39
39 Implementation Raw Router was tested in a cycle- accurate simulator of the Raw processor and the FPGA emulator Raw prototype clock speed is assumed to be 250 MHz The focus of research is on switch fabric, NOT on route lookup, etc.
40
40 Peak Throughput
41
41 Average Throughput
42
42 Future Work Take advantage of dynamic networks Implement IP route lookup Add computation on data (encryption) Add support of multicast traffic Implement Quality of Service Add virtual output queueing Explore larger router configurations
43
43 Conclusion Implemented a gigabit switch on Raw Mapped dynamic communication to static interconnect Can intermix switch fabric with computation High-bandwidth I/O allows performance of custom ASIC processors
44
44 We are on…
45
45 SimpleFit Moritz et al., 2001 A Framework for Analyzing Design Tradeoffs in Raw Architectures
46
46 Analytical Framework
47
47 Architecture Model Constrained optimization problem: find P, p, c, m to minimize T = max(T p, T c ) subject to B ≥ K, where T p, T c – performance off app. in terms of processing and communication, B is area budget, and K is cost
48
48 Cost Model Processor Memory Communication Global communication Global latency
49
49 Application Model Required processing per node Required amount of memory words Required number of words of local communication per node Required local communication events Required latency of events Required global communication Required global communication events
50
50 Performance Functions The maximum of the runtimes in terms of processing and communication
51
51 Optimization Problem Constrained based nonlinear optimization problem: Given: a fixed chip area or budget and problem size Objective: minimum runtime Constraints: budget, balanced local and global computation and communication, sufficiency of memory on a tile
52
52 Results Application-specific results Sensitivity of grain size Sensitivity to different processor cost model assumptions Sensitivity to communication overlapping assumptions Design comparisons
53
53 Example: Processors vs. Problem Size
54
54 Conclusions of the Talk Raw is good for streaming applications and combining computation with communication StreamIt is a good interface to tiled architectures Routing on Raw achieves the performance of custom ASICs, but remains flexible SimpleFit framework provides good reasoning about Raw
55
55 We are on…
56
56 Discussion Questions? Comments? Ideas?
57
57 References Check out the website: http://cag.lcs.mit.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.