Welcome Three related projects at Berkeley Groundrules Introductions

Slides:



Advertisements
Similar presentations
Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
The University of Adelaide, School of Computer Science
Slide 1 Exploiting 0n-Chip Bandwidth The vector ISA + compiler technology uses high bandwidth to mask latency Compiled matrix-vector multiplication: 2.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 4: January 22, 2007 Memories.
Scalable Vector Coprocessor for Media Processing Christoforos Kozyrakis ( ) IRAM Project Retreat, July 12 th, 2000.
Retrospective on the VIRAM-1 Design Decisions Christoforos E. Kozyrakis IRAM Retreat January 9, 2001.
1 CS402 PPP # 1 Computer Architecture Evolution. 2 John Von Neuman original concept.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Welcome Three related projects at Berkeley –Intelligent RAM (IRAM) –Intelligent Storage (ISTORE) –OceanStore Groundrules –Questions are welcome during.
Computer Architecture ECE 4801 Berk Sunar Erkay Savas.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Slide 1 IRAM and ISTORE Projects Aaron Brown, Jim Beck, Rich Fromm, Joe Gebis, Kimberly Keeton, Christoforos Kozyrakis, David Martin, Morley Mao, Rich.
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Compilers and Applications Kathy Yelick Dave Judd, Ronny Krashinsky, Randi Thomas, Samson Kwok, Simon Yau, Kar Ming Tang, Adam Janin, Thinh Nguyen Computer.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
1 IRAM Vision Microprocessor & DRAM on a single chip: –on-chip memory latency 5-10X, bandwidth X –improve energy efficiency 2X-4X (no off-chip bus)
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Chapter Overview General Concepts IA-32 Processor Architecture
Cache Memory.
Memory COMPUTER ARCHITECTURE
IRAM and ISTORE Projects
Rough Schedule 1:30-2:15 IRAM overview 2:15-3:00 ISTORE overview break
Berkeley Cluster: Zoom Project
Assembly Language for Intel-Based Computers, 5th Edition
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
Architecture & Organization 1
Cache Memory Presentation I
CS-301 Introduction to Computing Lecture 17
Morgan Kaufmann Publishers
Scaling for the Future Katherine Yelick U.C. Berkeley, EECS
Vector Processing => Multimedia
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Introduction to Computer Systems
CMSC 611: Advanced Computer Architecture
Architecture & Organization 1
IRAM and ISTORE Projects
Chapter 6 Memory System Design
Today’s agenda Hardware architecture and runtime system
MICROPROCESSOR MEMORY ORGANIZATION
What is Computer Architecture?
The Vector-Thread Architecture
ECE 463/563 Fall `18 Memory Hierarchies, Cache Memories H&P: Appendix B and Chapter 2 Prof. Eric Rotenberg Fall 2018 ECE 463/563, Microprocessor Architecture,
CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang
What is Computer Architecture?
What is Computer Architecture?
Modified from notes by Saeid Nooshabadi
Dave Judd Kathy Yelick Computer Science Division UC Berkeley
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 16 – RISC-V Vectors Krste Asanovic Electrical Engineering and.
Computer Architecture
A microprocessor into a memory chip Dave Patterson, Berkeley, 1997
CSE 502: Computer Architecture
ADSP 21065L.
CSE378 Introduction to Machine Organization
IRAM Vision Microprocessor & DRAM on a single chip:
Presentation transcript:

Welcome Three related projects at Berkeley Groundrules Introductions Intelligent RAM (IRAM) Intelligent Storage (ISTORE) OceanStore Groundrules Questions are welcome during talks Feedback required Friday morning Time for rafting and talking Introductions

Overview of the IRAM Project Kathy Yelick Aaron Brown, James Beck, Rich Fromm, Joe Gebis, Paul Harvey, Adam Janin, Dave Judd, Christoforos Kozyrakis, David Martin, Thinh Nguyen, David Oppenheimer, Steve Pope, Randi Thomas, Noah Treuhaft, Sam Williams, John Kubiatowicz, and David Patterson http://iram.cs.berkeley.edu/ Summer 2000 Retreat

Outline IRAM Motivation VIRAM architecture and VIRAM-1 microarchitecture Benchmarks Compiler Back to the future: 1997 and today

Original IRAM Motivation: Processor-DRAM Gap (latency) 60%/yr. 1000 CPU 100 Processor-Memory Performance Gap: (grows 50% / year) Performance 10 DRAM 7%/yr. Yaxis is performance Xasix is time Latency Cliché: Not e that x86 didn’t have cache on chip until 1989 DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time

Intelligent RAM: IRAM Proc L2$ L o g i c f a b Bus D R A M I/O Microprocessor & DRAM on a single chip: 10X capacity vs. DRAM on-chip memory latency 5-10X memory bandwidth 50-100X improve energy efficiency 2X-4X (no off-chip bus) smaller board area/volume IRAM advantages extend to: a single chip system a building block for larger systems D R A M f a b Proc Bus I/O $B for separate lines for logic and memory Single chip: either processor in DRAM or memory in logic fab

1997 “Vanilla” IRAM Study Estimate performance IRAM version of Alpha (same caches, benchmarks, standard DRAM) Assumed logic slower, DRAM faster Results: Spec92 slower, sparse matrices faster, DBs even Two conclusions Conventional benchmarks like Spec match conventional architectures Conventional architectures do not utilize memory bandwidth Research plan Focus on power/area advantages, including portable, hand-held devices Focus on multimedia benchmarks to match these devices Develop an architecture that can exploit the enormous memory bandwidth

Vector IRAM Architecture Maximum Vector Length (mvl) = # elts per register VP0 VP1 VPvl-1 vr0 vr1 Data Registers vpw vr31 Maximum vector length is given by a read-only register mvl E.g., in VIRAM-1 implementation, each register holds 32 64-bit values Vector Length is given by the register vl This is the # of “active” elements or “virtual processors” To handle variable-width data (8,16,32,64-bit): Width of each VP given by the register vpw vpw is one of {8b,16b,32b,64b} (no 8b in VIRAM-1) mvl depends on implementation and vpw: 32 64-bit, 64 32-bit, 128 16-bit,…

IRAM Architecture Update ISA mostly frozen since 6/99 Changes in 2H 99 for better fixed-point model and some instructions for short vectors (auto increment and in-register permutations) Minor changes in 1H 00 to address new co-processor interface in MIPS core ISA manual publicly available http://www.cs.berkeley.edu Suite of simulators actively used vsim-isa (functional) Major rewrite nearly complete for new scalar processor All UCB code vsim-p (performance), vsim-db (debugger), vsim-sync (memory synchronization)

VIRAM-1 Implementation 16 MB DRAM, 8 banks MIPS Scalar core and caches @ 200 MHz Vector unit @ 200 MHz 4 64-bit lanes 8 32-bit virtual lanes 16 16-bit virtual lanes 0.18 um EDL process 17x17 mm 2 Watt power target Memory (64 Mbits / 8 MBytes) Vector Pipes/Lanes C P U +$ Xbar Memory (64 Mbits / 8 MBytes) Design easily scales in number of lanes, e.g.: 2 64-bit lanes: lower power version 8 64-bit lanes: higher performance version Number of memory banks is independent

VIRAM-1 Microarchitecture Memory system 8 DRAM banks 256-bit synchronous interface 1 sub-bank per bank 16 Mbytes total capacity Peak performance 3.2 GOPS64, 12.8 GOPS16 (w. madd) 1.6 GOPS64, 6.4 GOPS16 (wo. madd) 0.8 GFLOPS64, 1.6 GFLOPS32 6.4 Gbyte/s memory bandwidth comsumed by VU 2 arithmetic units both execute integer operations one executes FP operations 4 64-bit datapaths (lanes) per unit 2 flag processing units for conditional execution and speculation support 1 load-store unit optimized for strides 1,2,3, and 4 4 addresses/cycle for indexed and strided operations decoupled indexed and strided stores

VIRAM-1 block diagram

IRAM Chip Update IBM supplying embedded DRAM/Logic (100%) Agreement in place as of June 1, 2000 MIPS supplying scalar core (100%) MIPS processor, caches, TLB MIT supplying FPU (100%) VIRAM-1 Tape-out scheduled for January 2001 Simplifications Floating point Network Interface

Hand-Coded Benchmark Review Image processing kernels (old FPU model) Note BLAS-2 performance

Base-line system comparison All numbers in cycles/pixel MMX and VIS results assume all data in L1 cache

FFT: Uses In-Register Permutations Without in-register permutations

IRAM Compiler Status Vectorizer C Fortran C++ Frontends Code Generators PDGCS IRAM C90 Retarget of Cray compiler Steps in compiler development Build MIPS backend (done) Build VIRAM backend for vectorized loops (done) Instruction scheduling for VIRAM-1 (done) Insertion of memory barriers (using Cray strategy, improving) Additional optimizations (ongoing) Feedback results to Cray, new version from Cray (ongoing)

Compiled Applications Update Applications using compiler Speech processing under development Developed new small-memory algorithm for speech processing Uses some existing kernels (FFT and MM) Vector search algorithm is most challenging DIS image understanding application under development Compiles, but does not yet vectorize well Singular Value Decomposition Better than 2 VLIW machines (TI C67 and TM 1100) Challenging BLAS-1,2 work well on IRAM because of memory BW Kernels Simple floating point kernels are very competitive with hand-code

(10n x n SVD, rank 10) (From Herman, Loo, Tang, CS252 project)

IRAM Latency Advantage 1997 estimate: 5-10x improvement No parallel DRAMs, memory controller, bus to turn around, SIMM module, pins… 30ns for IRAM (or much lower with DRAM redesign) Compare to Alpha 600: 180 ns for 128b; 270 ns for 512b 2000 estimate: 5x improvement IRAM memory latency is 25 ns for 256 bits, fixed pipeline delay Alpha 4000/4100: 120 ns 1st 2nd innovate inside DRAM Even compared to latest Alpha

IRAM Bandwidth Advantage 1997 estimate: 100x 1024 1Mbit modules, each 1Kb wide(1Gb chip) 10% @ 40 ns RAS/CAS = 320 GBytes/sec If cross bar switch or multiple busses deliver 1/3 to 2/3 of total Þ 100 - 200 GBytes/sec Compare to: AlphaServer 8400 = 1.2 GBytes/sec, 4100=1.1 Gbytes/sec 2000 estimate: 10-100x VIRAM-1 16 MB chip divided into 8 banks => 51.2 GB peak from memory banks Crossbar can consume 12.8 GB/s 6.4GB/s from Vector Unit + 6.4 GB/s from either scalar or I/O 2nd reason Delivered BW on Alpha Server

Power and Energy Advantages 1997 Case study of StrongARM memory hierarchy vs. IRAM memory hierarchy cell size advantages Þ much larger cache Þ fewer off-chip references Þ up to 2X-4X energy efficiency for memory-intensive algorithms less energy per bit access for DRAM Power target for VIRAM-1 2 watt goal Based on preliminary spice runs, this looks very feasible today Scalar core included bigger caches or less memory on board Cache in logic process vs. SRAM in SRAM process vs. DRAM in DRAM process Main reason

Summary IRAM takes advantage of high on-chip bandwidth Vector IRAM ISA utilizes this bandwidth Unit, strided, and indexed memory access patterns supported Exploits fine-grained parallelism, even with pointer chasing Compiler Well-understood compiler model, semi-automatic Still some work on code generation quality Application benchmarks Compiled and hand-coded Include FFT, SVD, MVM, sparse MVM, and other kernels used in image and signal processing

IRAM Applications: Intelligent PDA Pilot PDA + gameboy, cell phone, radio, timer, camera, TV remote, am/fm radio, garage door opener, ... + Wireless data (WWW) + Speech, vision recog. + Voice output for conversations Speech control +Vision to see, scan documents, read bar code, ...

IRAM as Building Block for ISTORE System-on-a-chip enables computer, memory, redundant network interfaces without significantly increasing size of disk Target for + 5-7 years: building block: 2006 MicroDrive integrated with IRAM 9GB disk, 50 MB/sec disk (projected) connected via crossbar switch O(10) Gflops 10,000+ nodes fit into one rack!