Streaming Supercomputer Strawman Architecture November 27, 2001 Ben Serebrin.

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Presented by: Quinn Gaumer CPS 221.  16,384 Processing Nodes (32 MHz)  30 m x 30 m  Teraflop  1992.
Streaming Supercomputer Strawman Bill Dally, Jung-Ho Ahn, Mattan Erez, Ujval Kapasi, Tim Knight, Ben Serebrin April 15, 2002.
PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.
Multiple Processor Systems
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
OS2-1 Chapter 2 Computer System Structures. OS2-2 Outlines Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 Interfacing Processors and Peripherals I/O Design affected by many factors (expandability, resilience) Performance: — access latency — throughput — connection.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
University College Cork IRELAND Hardware Concepts An understanding of computer hardware is a vital prerequisite for the study of operating systems.
Jan 30, 2003 GCAFE: 1 Compilation Targets Ian Buck, Francois Labonte February 04, 2003.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
Microprocessor Systems Design I Instructor: Dr. Michael Geiger Spring 2012 Lecture 2: 80386DX Internal Architecture & Data Organization.
PRASHANTHI NARAYAN NETTEM.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Pipelining By Toan Nguyen.
CS-334: Computer Architecture
Synchronization and Communication in the T3E Multiprocessor.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
Computer Architecture
1 Chapter 04 Authors: John Hennessy & David Patterson.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
I/O Example: Disk Drives To access data: — seek: position head over the proper track (8 to 20 ms. avg.) — rotational latency: wait for desired sector (.5.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Top Level View of Computer Function and Interconnection.
Multiple-bus organization
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
EEE440 Computer Architecture
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Introduction First 32 bit Processor in Intel Architecture. Full 32 bit processor family Sixth member of 8086 Family SX.
Overview of Super-Harvard Architecture (SHARC) Daniel GlickDaniel Glick – May 15, 2002 for V (Dewar)
1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Lx: A Technology Platform for Customizable VLIW Embedded Processing.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Fundamentals of Programming Languages-II
Chapter 3 System Buses.  Hardwired systems are inflexible  General purpose hardware can do different tasks, given correct control signals  Instead.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Higher Level Parallelism
15-740/ Computer Architecture Lecture 3: Performance
CSC 4250 Computer Architectures
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Laxmi Narayan Bhuyan SIMD Architectures Laxmi Narayan Bhuyan
CMSC 611: Advanced Computer Architecture
Computer Architecture
Chapter Four The Processor: Datapath and Control
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
William Stallings Computer Organization and Architecture 7th Edition
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Chapter 13: I/O Systems.
Multiprocessors and Multi-computers
Presentation transcript:

Streaming Supercomputer Strawman Architecture November 27, 2001 Ben Serebrin

High-level Programming Model Streams are partitioned across nodes

Programming: Partitioning Across nodes is straightforward domain decomposition Within nodes we have 2 choices (SW) Domain decomposition Each cluster receives neighboring record

High-level Programming Model Parallelism within a node

Streams vs. Vectors Compound operations on records  Traverse operations first and records second Temporary values encapsulated within kernel Global instruction bandwidth is of kernels  Group whole records into streams Gather records from memory – one stream buffer per record type Simple operations on vectors of elements  First fetch all elements of all records then operate Large set of temporary values Global instruction bandwidth is of many simple operations  Group like-elements of records into vectors Gather elements from memory – one stream buffer per record element type

Example – Vertex Transform x y z w t 00 x t 10 x t 20 x t 30 x t 01 y t 11 y t 21 y t 31 y t 02 z t 12 z t 22 z t 32 z t 03 w t 13 w t 23 w t 33 w x’ y’ z’ w’ input record intermediate results result record

Example encapsulate intermediate results  enable small and fast LRFs large working set of intermediates  must use the global RF

Instruction Set Architecture Machine State Program Counter (pc) Scalar Registers: part of MIPS/ARM core Local Registers (LRF): local to each ALU in cluster Scratchpad: Small RAM within the cluster Stream Buffers (SB): between SRF and clusters  Serve to make SRF appear multi-ported

Instruction Set Architecture Machine state (continued) Stream Register File (SRF): Clustered memory that sources most data Stream Cache (SC): to make graph stream accesses efficient. With SRF or outside? Segment Registers: A set of registers to provide paging and protection Global Memory (M)

ISA: Instruction Types Scalar processor Scalar: Standard RISC Stream Load/Store Stream Prefetch (graph stream) Execute Kernel Clusters Kernel Instructions: VLIW instructions

ISA: Memory Model Memory Model for global shared addressing Segmented (to allow time-sharing?) Descriptor contains node and size information  Length of segment (power of 2)  Base address (aligned to multiple of length)  Range of nodes owning the data (power of 2)  Interleaving (which bits select nodes)  Cache behavior? (non-cached, read-only, (full?)) No paging, no TLBs

ISA: Caching Stream cache improves bandwidth and latency for graph accesses (irregular structures) Pseudo read-only (like a texture cache— changes very infrequently) Explicit gang-invalidation Scalar Processor has Instruction and Data caches

Global Mechanisms Remote Memory access Processor can busy wait on a location until Remote processor updates Signal and Wait (on named broadcast signals) Fuzzy barriers – split barriers Processor signals “I’m done” and can continue with other work When next phase is reached the processor waits for all other processors to signal Barriers are named can be implemented with signals and atomic ops Atomic Remote Operations Fetch&op (add, or, etc …) Compare&Swap

Scan Example Prefix-sum operation Recursively:  Higher level processor (“thread”): clear memory locations for partial sums and ready bits signal S i poll ready bits and add to local sum when ready  Lower level processor: calculate local sum wait on S i write local sum to prepared memory location atomic update of ready bit in higher level

System Architecture

Node Microarchitecture

uArch: Scalar Processor Standard RISC (MIPS, ARM) Scalar ops and stream dispatch are interleaved (no synchronization needed) Accesses same memory space (SRF & global memory) as clusters I and D caches Small RTOS

uArch: Arithmetic Clusters 16 identical arithmetic clusters 2 ADD, 2 MUL, 1 DSQ, scratchpad (?) ALUs connect to SRF via Stream Buffers and Local Register Files  LRF: one for each ALU input, bit entries each Local inter-cluster crossbar Statically-scheduled VLIW control SIMD/MIMD?

uArch: Stream Register File Stream Register File (SRF) Arranged in clusters parallel to Arithmetic Clusters Accessible by clusters, scalar processor, memory system Kernels refer to stream number (and offset?)  Stream Descriptor Registers track start, end, direction of streams

uArch: Memory Address generator (above cache) Creates a stream of addresses for strided Accepts a stream of addresses for gather/scatter Memory access: Check: In cache? Check: In local memory? Else: Get from network Network Send and receive memory requests Memory Controller Talks to SRF and to Network

Feeds and Speeds: in node 2 GByte DRDRAM local memory: 38 GByte/s On-chip memory: 64 GByte/s Stream registers: 256 GByte/s Local registers: 1520 GByte/s

Feeds and Speeds: Global Card-level (16 nodes): 20 GBytes/sec Backplane (64 cards): 10 GBytes/sec System (16 backplanes): 4 Gbytes/sec Expect < 1  sec latency (500 ns?) for memory request to random address

Open Issues 2-port DRF? Currently, the ALUs all have LRFs for each input

Open Issues Is rotate enough or do we want fully random access SRF with reduced BW if accessing same bank? Rotate allows arbitrary linear rotation and is simpler Full random access requires a big switch  Can trade BW for size

Open Issues Do we need an explicitly managed cache (for locking root of a tree for example)?

Open Issues Do we want messaging (probably yes) allows elegant distributed control allows complex “fetch&ops” (remote procedures) can build software coherency protocols and such Do we need coherency in the scalar part

Open Issues Is dynamic migration important? Moving data from one node to another not possible without pages or COMA

Open Issues Exceptions? No external exceptions Arithmetic overflow/underflow, div by 0, etc. Exception on cache miss? (Can we guarantee no cache misses?) Disrupts stream sequencing and control flow Interrupts and scalar/stream sync Interrupts from Network? From stream to scalar? From scalar to stream?

Experiments Conditionals Experiment Are predications and conditional stream sufficient? Experiment with adding instruction sequencers for each cluster (quasi-MIMD) Examine cost and performance