Presentation is loading. Please wait.

Presentation is loading. Please wait.

Streaming Supercomputer Strawman Architecture November 27, 2001 Ben Serebrin.

Similar presentations


Presentation on theme: "Streaming Supercomputer Strawman Architecture November 27, 2001 Ben Serebrin."— Presentation transcript:

1 Streaming Supercomputer Strawman Architecture November 27, 2001 Ben Serebrin

2 High-level Programming Model Streams are partitioned across nodes

3 Programming: Partitioning Across nodes is straightforward domain decomposition Within nodes we have 2 choices (SW) Domain decomposition Each cluster receives neighboring record

4 High-level Programming Model Parallelism within a node

5 Streams vs. Vectors Compound operations on records  Traverse operations first and records second Temporary values encapsulated within kernel Global instruction bandwidth is of kernels  Group whole records into streams Gather records from memory – one stream buffer per record type Simple operations on vectors of elements  First fetch all elements of all records then operate Large set of temporary values Global instruction bandwidth is of many simple operations  Group like-elements of records into vectors Gather elements from memory – one stream buffer per record element type

6 Example – Vertex Transform x y z w t 00 x t 10 x t 20 x t 30 x t 01 y t 11 y t 21 y t 31 y t 02 z t 12 z t 22 z t 32 z t 03 w t 13 w t 23 w t 33 w x’ y’ z’ w’ input record intermediate results result record

7 Example encapsulate intermediate results  enable small and fast LRFs large working set of intermediates  must use the global RF

8 Instruction Set Architecture Machine State Program Counter (pc) Scalar Registers: part of MIPS/ARM core Local Registers (LRF): local to each ALU in cluster Scratchpad: Small RAM within the cluster Stream Buffers (SB): between SRF and clusters  Serve to make SRF appear multi-ported

9 Instruction Set Architecture Machine state (continued) Stream Register File (SRF): Clustered memory that sources most data Stream Cache (SC): to make graph stream accesses efficient. With SRF or outside? Segment Registers: A set of registers to provide paging and protection Global Memory (M)

10 ISA: Instruction Types Scalar processor Scalar: Standard RISC Stream Load/Store Stream Prefetch (graph stream) Execute Kernel Clusters Kernel Instructions: VLIW instructions

11 ISA: Memory Model Memory Model for global shared addressing Segmented (to allow time-sharing?) Descriptor contains node and size information  Length of segment (power of 2)  Base address (aligned to multiple of length)  Range of nodes owning the data (power of 2)  Interleaving (which bits select nodes)  Cache behavior? (non-cached, read-only, (full?)) No paging, no TLBs

12 ISA: Caching Stream cache improves bandwidth and latency for graph accesses (irregular structures) Pseudo read-only (like a texture cache— changes very infrequently) Explicit gang-invalidation Scalar Processor has Instruction and Data caches

13 Global Mechanisms Remote Memory access Processor can busy wait on a location until Remote processor updates Signal and Wait (on named broadcast signals) Fuzzy barriers – split barriers Processor signals “I’m done” and can continue with other work When next phase is reached the processor waits for all other processors to signal Barriers are named can be implemented with signals and atomic ops Atomic Remote Operations Fetch&op (add, or, etc …) Compare&Swap

14 Scan Example Prefix-sum operation Recursively:  Higher level processor (“thread”): clear memory locations for partial sums and ready bits signal S i poll ready bits and add to local sum when ready  Lower level processor: calculate local sum wait on S i write local sum to prepared memory location atomic update of ready bit in higher level

15 System Architecture

16 Node Microarchitecture

17 uArch: Scalar Processor Standard RISC (MIPS, ARM) Scalar ops and stream dispatch are interleaved (no synchronization needed) Accesses same memory space (SRF & global memory) as clusters I and D caches Small RTOS

18 uArch: Arithmetic Clusters 16 identical arithmetic clusters 2 ADD, 2 MUL, 1 DSQ, scratchpad (?) ALUs connect to SRF via Stream Buffers and Local Register Files  LRF: one for each ALU input, 32 64-bit entries each Local inter-cluster crossbar Statically-scheduled VLIW control SIMD/MIMD?

19 uArch: Stream Register File Stream Register File (SRF) Arranged in clusters parallel to Arithmetic Clusters Accessible by clusters, scalar processor, memory system Kernels refer to stream number (and offset?)  Stream Descriptor Registers track start, end, direction of streams

20 uArch: Memory Address generator (above cache) Creates a stream of addresses for strided Accepts a stream of addresses for gather/scatter Memory access: Check: In cache? Check: In local memory? Else: Get from network Network Send and receive memory requests Memory Controller Talks to SRF and to Network

21 Feeds and Speeds: in node 2 GByte DRDRAM local memory: 38 GByte/s On-chip memory: 64 GByte/s Stream registers: 256 GByte/s Local registers: 1520 GByte/s

22 Feeds and Speeds: Global Card-level (16 nodes): 20 GBytes/sec Backplane (64 cards): 10 GBytes/sec System (16 backplanes): 4 Gbytes/sec Expect < 1  sec latency (500 ns?) for memory request to random address

23 Open Issues 2-port DRF? Currently, the ALUs all have LRFs for each input

24 Open Issues Is rotate enough or do we want fully random access SRF with reduced BW if accessing same bank? Rotate allows arbitrary linear rotation and is simpler Full random access requires a big switch  Can trade BW for size

25 Open Issues Do we need an explicitly managed cache (for locking root of a tree for example)?

26 Open Issues Do we want messaging (probably yes) allows elegant distributed control allows complex “fetch&ops” (remote procedures) can build software coherency protocols and such Do we need coherency in the scalar part

27 Open Issues Is dynamic migration important? Moving data from one node to another not possible without pages or COMA

28 Open Issues Exceptions? No external exceptions Arithmetic overflow/underflow, div by 0, etc. Exception on cache miss? (Can we guarantee no cache misses?) Disrupts stream sequencing and control flow Interrupts and scalar/stream sync Interrupts from Network? From stream to scalar? From scalar to stream?

29 Experiments Conditionals Experiment Are predications and conditional stream sufficient? Experiment with adding instruction sequencers for each cluster (quasi-MIMD) Examine cost and performance


Download ppt "Streaming Supercomputer Strawman Architecture November 27, 2001 Ben Serebrin."

Similar presentations


Ads by Google