Rough Schedule 1:30-2:15 IRAM overview 2:15-3:00 ISTORE overview break

Slides:



Advertisements
Similar presentations
Clare Smtih SHARC Presentation1 The SHARC Super Harvard Architecture Computer.
Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
The University of Adelaide, School of Computer Science
Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA A Parameterizable.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
7/14/2000 Page 1 Design of the IRAM FPU Ioannis Mavroidis IRAM retreat July 12-14, 2000.
Slide 1 Exploiting 0n-Chip Bandwidth The vector ISA + compiler technology uses high bandwidth to mask latency Compiled matrix-vector multiplication: 2.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Retrospective on the VIRAM-1 Design Decisions Christoforos E. Kozyrakis IRAM Retreat January 9, 2001.
SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu
1 IRAM and ISTORE David Patterson, Katherine Yelick, John Kubiatowicz U.C. Berkeley, EECS
Welcome Three related projects at Berkeley –Intelligent RAM (IRAM) –Intelligent Storage (ISTORE) –OceanStore Groundrules –Questions are welcome during.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.
SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
Slide 1 IRAM and ISTORE Projects Aaron Brown, Jim Beck, Rich Fromm, Joe Gebis, Kimberly Keeton, Christoforos Kozyrakis, David Martin, Morley Mao, Rich.
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Compilers and Applications Kathy Yelick Dave Judd, Ronny Krashinsky, Randi Thomas, Samson Kwok, Simon Yau, Kar Ming Tang, Adam Janin, Thinh Nguyen Computer.
Slide 1 Computers for the Post-PC Era David Patterson University of California at Berkeley UC Berkeley IRAM Group UC Berkeley.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Los Alamos National Laboratory Streams-C Maya Gokhale Los Alamos National Laboratory September, 1999.
Vector computers.
Chapter 3 Getting Started. Copyright © 2005 Pearson Addison-Wesley. All rights reserved. Objectives To give an overview of the structure of a contemporary.
1 IRAM Vision Microprocessor & DRAM on a single chip: –on-chip memory latency 5-10X, bandwidth X –improve energy efficiency 2X-4X (no off-chip bus)
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
GCSE Computing - The CPU
Nios II Processor: Memory Organization and Access
Ioannis E. Venetis Department of Computer Engineering and Informatics
IRAM and ISTORE Projects
Embedded Systems Design
Berkeley Cluster: Zoom Project
Morgan Kaufmann Publishers
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
Architecture & Organization 1
Cache Memory Presentation I
Scaling for the Future Katherine Yelick U.C. Berkeley, EECS
Processor (I).
Vector Processing => Multimedia
Architecture & Organization 1
Welcome Three related projects at Berkeley Groundrules Introductions
IRAM and ISTORE Projects
Multivector and SIMD Computers
Today’s agenda Hardware architecture and runtime system
Computers for the Post-PC Era
CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang
Modified from notes by Saeid Nooshabadi
Computer Architecture
Dave Judd Kathy Yelick Computer Science Division UC Berkeley
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 16 – RISC-V Vectors Krste Asanovic Electrical Engineering and.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
GCSE Computing - The CPU
A microprocessor into a memory chip Dave Patterson, Berkeley, 1997
CSE 502: Computer Architecture
ADSP 21065L.
IRAM Vision Microprocessor & DRAM on a single chip:
Interconnection Network and Prefetching
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Rough Schedule 1:30-2:15 IRAM overview 2:15-3:00 ISTORE overview break 3:15-3:30 Financial 4:00-5:00 Future

IRAM Hardware and Software Kathy Yelick Computer Science Division UC Berkeley

Intelligent RAM: IRAM Proc L2$ L o g i c f a b Bus D R A M I/O Microprocessor & DRAM on a single chip: 10X capacity vs. DRAM on-chip memory latency 5-10X, bandwidth 50-100X improve energy efficiency 2X-4X (no off-chip bus) serial I/O 5-10X v. buses smaller board area/volume IRAM advantages extend to: a single chip system a building block for larger systems D R A M f a b Proc Bus I/O $B for separate lines for logic and memory Single chip: either processor in DRAM or memory in logic fab

VIRAM: System on a Chip 0.18 um EDL process 16 MB DRAM, 8 banks MIPS Scalar core and caches @ 200 MHz 4 64-bit vector unit pipelines @ 200 MHz 17x17 mm, 2 Watts target 25.6 GB/s memory (6.4 GB/s per direction and per Xbar) 0.8 Gflops (64-bit), 6.4 GOPs (16-bit) Memory (64 Mbits / 8 MBytes) 4 Vector Pipes/Lanes C P U +$ Xbar Memory (64 Mbits / 8 MBytes)

IRAM Chip Update IBM supplying embedded DRAM/Logic (100%) Agreement in place and technology files available MIPS supplying scalar core (100%) MIPS processor, caches, TLB MIT supplying FPU (100%) VIRAM-1 Tape-out scheduled for late-2000 Simplifications Floating point Network Interface

VIRAM-1 Chip Design Status MIPS scalar core Synthesizable RTL code received from MIPS Cache RAMs to be compiled for IBM technology FPU RTL code almost compete Vector unit RTL models for sub-blocks developed; currently integrated and tested Control logic to be compiled for IBM technology Full-custom layout for multipliers/adders developed; layout for shifters to be developed Memory system Synthesizable model for DRAM controllers done To be integrated with IBM DRAM macros Full-custom layout for crossbar under development Testing infrastructure Environment developed for automatic test & validation Directed tests for single/multiple instruction groups developed Random instruction sequence generator developed

IRAM Architecture Update ISA mostly frozen since 6/99 Changes in 2H 99 for better fixed-point model and some instructions for short vectors (auto increment and in-register permutations) Minor changes in 1H 00 to address new co-processor interface in MIPS core ISA manual publicly available http://www.cs.berkeley.edu Suite of simulators actively used vsim-isa (functional) Major rewrite underway for new scalar processor All UCB code vsim-p (performance), vsim-db (debugger), vsim-sync (memory synchronization)

IRAM Compiler Status Vectorizer C Fortran C++ Frontends Code Generators PDGCS IRAM C90 Retarget of Cray Backend Steps in compiler development Build MIPS backend (done) Build VIRAM bacckend for vectorized loops (done) Instruction scheduling for VIRAM-1 (works, but could be improved) Insertion of memory barriers (using Cray strategy, improving) Optimizations for short loops (reduce overhead) Feedback results to Cray, new version from Cray (ongoing)

IRAM Compiler Update Study of compiler quality using 100 “Dongarra loops” 70 vectorized Average 10x reduction in dynamic instruction count Average vector length of 42 30 did not, usually due to a dependence Some reductions missed Vector version of math libraries (sin, cos, etc.) needed Some failed due to bugs in benchmark Identified 2 specific areas for improvements in loop overhead Use VL and MVL more carefully Use auto-increment instruction more extensively

Compiled Applications Update Applications using compiler Speech processing under development Developed new small-memory algorithm for speech processing Uses some existing kernels (FFT and MM) Vector search algorithm is most challenging DIS image understanding application under development Compiles, but does not yet vectorize well Singular Value Decomposition Better than 2 VLIW machines (TI C67 and TM 1100) Challenging BLAS-1,2 work well on IRAM because of memory BW Kernels SAXPY, MVM, etc. Will include DIS stress-marks

(10n x n SVD, rank 10) (From Herman, Loo, Tang, CS252 project)

Hand-Coded Applications Update Image processing kernels (old FPU model) Note BLAS-2 performance

Problem: General Element Permutation 16 1 15 Hardware for a full vector permutation instruction (128 16b elements, 256b datapath) Datapath: 16 x 16 (x 16b) crossbar; scales by 0(N^2) Control: 16 16-to-1 multiplexors; scales by 0(N*logN) Time/energy wasted on wide vector register file port

Simple Vector Permutations 1 15 Simple steps of butterfly permutations A register provides the butterfly radix Separate instructions for moving elements to left/right Sufficient semantics for Fast reductions of vector registers (dot products) Fast FFT kernels

Hardware for Simple Permutations 64 shift 3 Hardware for 128 16b elements, 256b datapath Datapath: 2 buses, 8 tristate drivers, 4 multiplexors, 4 shifters (by 0, 16b, 32b only); Scales by O(N) Control: 6 control cases; scales by O(N) Other benefits Consecutive result elements written together; Buses used only for small radices

FFT: Uses In-Register Permutations Without in-register permutations

Summary IRAM takes advantage of high on-chip bandwidth BLAS-2 performance confirms this Vector IRAM ISA utilizes this bandwidth Unit, strided, and indexed memory access patterns supported Exploits fine-grained parallelism, even with pointer chasing Compiler Well-understood compiler model, semi-automatic Still some work on code generation quality Application benchmarks Compiled and hand-coded Include FFT, SVD, MVM, sparse MVM, and other kernels used in image and signal processing

IRAM as Building Block for ISTORE System-on-a-chip enables computer, memory, redundant network interfaces without significantly increasing size of disk Target for + 5-7 years: building block: 2006 MicroDrive integrated with IRAM 9GB disk, 50 MB/sec disk (projected) connected via crossbar switch O(10) Gflops 10,000+ nodes fit into one rack!