Streaming Supercomputer Strawman Bill Dally, Jung-Ho Ahn, Mattan Erez, Ujval Kapasi, Tim Knight, Ben Serebrin April 15, 2002.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

CPU Structure and Function
Threads, SMP, and Microkernels
More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
CGS 3763 Operating Systems Concepts Spring 2013 Dan C. Marinescu Office: HEC 304 Office hours: M-Wd 11: :30 AM.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Computer Organization and Architecture
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
OS2-1 Chapter 2 Computer System Structures. OS2-2 Outlines Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection.
June 11, 2002 SS-SQ-W: 1 Stanford Streaming Supercomputer (SSS) Spring Quarter Wrapup Meeting Bill Dally, Computer Systems Laboratory Stanford University.
Streaming Supercomputer Strawman Architecture November 27, 2001 Ben Serebrin.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
Inter Process Communication:  It is an essential aspect of process management. By allowing processes to communicate with each other: 1.We can synchronize.
Home: Phones OFF Please Unix Kernel Parminder Singh Kang Home:
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Informationsteknologi Friday, November 16, 2007Computer Architecture I - Class 121 Today’s class Operating System Machine Level.
SSS Software Update Ian Buck Mattan Erez August 2002.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong,
Main Memory. Background Program must be brought (from disk) into memory and placed within a process for it to be run Main memory and registers are only.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p
General System Architecture and I/O.  I/O devices and the CPU can execute concurrently.  Each device controller is in charge of a particular device.
Processor Structure & Operations of an Accumulator Machine
ITEC 325 Lecture 29 Memory(6). Review P2 assigned Exam 2 next Friday Demand paging –Page faults –TLB intro.
Synchronization and Communication in the T3E Multiprocessor.
Chapter 8 Memory Management Dr. Yingwu Zhu. Outline Background Basic Concepts Memory Allocation.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and Module-Level Prefetching Junghee Lee *, Hyung Gyu Lee *, Soonhoi Ha.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Chapter 2: Computer-System Structures Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection Network Structure.
CE Operating Systems Lecture 14 Memory management.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
1 Memory Management (b). 2 Paging  Logical address space of a process can be noncontiguous; process is allocated physical memory whenever the latter.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Demand Paging.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Review of Computer System Organization. Computer Startup For a computer to start running when it is first powered up, it needs to execute an initial program.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Fundamentals of Programming Languages-II
The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.
Module 3 Distributed Multiprocessor Architectures.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Computer Architecture Lecture 12: Virtual Memory I
Chapter 2: Computer-System Structures
Distributed Shared Memory
CMSC 611: Advanced Computer Architecture
Lecture on High Performance Processor Architecture (CS05162)
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Memory Management-I 1.
Implementing an OpenFlow Switch on the NetFPGA platform
Virtual Memory Overcoming main memory size limitation
Computer Architecture
Page Allocation and Replacement
Main Memory Background
Presentation transcript:

Streaming Supercomputer Strawman Bill Dally, Jung-Ho Ahn, Mattan Erez, Ujval Kapasi, Tim Knight, Ben Serebrin April 15, 2002

2 Outline Overview Operation Stream-ISA Kernel-ISA Micro-architecture Next steps

3 System Overview Roundtrip memory access latency ~ 500ns = 500 processor cycles

4 Board Overview Chips –16 SS nodes –4 router chips – provide 4 independent routing planes on each board Ports –32 to back-plane network –16 to global network

5 Node Overview

6 Node Operation [1] MIPS assembly COP2 instructions encode stream instructions VLIW microcode Called by instructions in the scalar program

7 Node Operation [2] Example Execution Transfer data from DRDRAM and network into SRF Execute Kernel k1 Execute Kernel k2 Synchronize (across nodes) Store results to DRDRAM and network Synchronize

8 Stream-ISA Visible State State which is used by a scalar (MIPS) program: MIPS registers –Standard processor registers –Coprocessor 2 interface registers – the MIPS program’s interface to the SSS SRF Global memory: all memory on all nodes can be addressed by any node Control registers: –Segment registers: implement virtual memory –Stream descriptor registers & memory descriptor registers: hold parameters such as length, record size, etc. Stream cache

9 Stream Instruction Set MIPS instructions –Standard processor instructions –COP2 instruction used to issue stream instructions Stream instructions –Write control registers –Stream load, store –Stream cache prefetch, invalidate –Kernel load, execute More on… –Messaging instructions –Global Synchronization –Exceptions/Interrupts Open Issues –COP2 interface –Critical sections for sending stream instructions

10 Memory Model Address space shared by scalar/stream units No data-duplication of global data in local DRAM Virtual memory implemented via programmable segment registers –Virtual address: (SegNum, SegOffset) –Physical address: (NodeNum, LocalOffset) Segments can be interleaved across nodes in user- specified amounts. Read-mostly stream cache with gang invalidation –MIPS instructions specify which stream elements to cache No HW paging support Open Issues –Cache write policy –Coherence

11 Global Synchronization Minimal global synchronization and communication mechanisms –Barrier –Remote update - Fetch-and-add/Compare-and-swap Implement all other synchronization primitives in software –General messaging mechanism interrupts scalar processor –e.g., General-purpose synchronization For example, end a parallel search as soon as one node finds a match Open Issues –Hardware acceleration Locking: a table with 1K entries for locked locations

12 Exceptions/Interrupts Scalar processor handles all interrupts –e.g., Message received from remote node interrupts scalar processor Stream operations delay exceptions until the end of the operation –Examples: divide-by-0 in clusters, invalid address in memory system –Information about exception saved –Scalar processor must read this state to figure out nature of exception Open Issues –Best way to save info about stream operation exceptions and transfer this to scalar processor –Multithreading scalar processor

13 Kernel-ISA Visible State State which is used by a kernel function: Per-cluster: –Local register files Per ALU register files Cluster condition code registers Scratchpad: small indexable register file, used for: –lookup tables –complex data structures –register spills Per-node: –Microcontroller register file, used for: Loop counters Data to be broadcast to clusters Passing parameters between the MIPS processor and the kernel functions –Microcode store contains the kernel VLIW instructions. –Microcontroller condition code registers, for looping and conditional streams –Microprogram counter

14 Kernel Instruction Set Kernel microprogram consists of VLIW instructions Each cluster in the node is sent the same control signals each cycle (i.e.) they execute in lockstep (SIMD) Kernel VLIW instructions control: –Microcontroller units and register files –Cluster arithmetic units and register files –Inter-cluster switch –Transfers of stream data between the clusters and the SRF More on… –Conditional Operations

15 Conditionals Hardware select (analog to C “?:” operator) Scratchpad (using register indexing) Conditional Streams Open Issues –Assuming that Brook will retain ‘if-statements’ Need to automatically map code to conditional streams or predication Requires a method to “split” a kernel

16 Scalar Execution Unit Scalar processor is a MIPS (or Tensilica) core with data and instruction caches Scalar processor issues stream instructions to the stream controller –Stream Controller stores them in a scoreboard –Instructions issued when 1.all required resources are available 2.all inter-instruction dependencies have been satisfied. Open Issues –On-chip L2 cache for MIPS Core –Best method to encode dependencies between stream instructions

17 Stream Execution Unit Kernel instructions issued by microcontroller –SIMD clusters –VLIW control of ALUs within a cluster Open Issues –“Aspect Ratio” –Inter-cluster switch implementation –Local register file organization –FU Mix –Division / Square root support –Integer unit for logical ops

18 Stream Register File Single-ported SRAM 64KW 1024 x 2048b 64W/cycle Arbiter To/From Arithmetic Clusters Streambuffers

19 Memory System Memory Unit –Translates virtual addresses –Routes requests and replies –Frames new network message for each external word Requestors –Address generators –Scalar processor –Stream cache –Network Suppliers –Local DRAM –Stream cache –Network Open Issues –Multidimensional strides

20 Network Interface Flit-reservation flow-control Expect to be able to service messages faster than arrival rate Open Issues: –Need to flesh out the details of this module

21 Next Steps SVM will be used to study system-wide architecture –e.g. System bandwidths, synchronization mechanisms, etc. Cycle-accurate simulator (ssim) –Used for node architecture studies –Validate multi-node results –Current emphasis on getting single-node simulations to work –Feedback single-node results (i.e., kernel results) into SVM for quick but fairly accurate system-level results Global architecture studies –Feasibility, Global mechanisms, network topologies Node architecture studies –Aspect ratio, FU mix, inter-cluster switch, conditionals, SRF size/bandwidth Most are gated on getting apps ported We can start with Imagine apps for now –Area/Power estimates End-of-quarter project-wide goal is to have Brook apps running on SVM and ssim. –We should be able to conduct architectural experiments at that point