Welcome Three related projects at Berkeley Groundrules Introductions

Welcome Three related projects at Berkeley Groundrules Introductions
Intelligent RAM (IRAM) Intelligent Storage (ISTORE) OceanStore Groundrules Questions are welcome during talks Feedback required Friday morning Time for rafting and talking Introductions

Overview of the IRAM Project
Kathy Yelick Aaron Brown, James Beck, Rich Fromm, Joe Gebis, Paul Harvey, Adam Janin, Dave Judd, Christoforos Kozyrakis, David Martin, Thinh Nguyen, David Oppenheimer, Steve Pope, Randi Thomas, Noah Treuhaft, Sam Williams, John Kubiatowicz, and David Patterson Summer 2000 Retreat

Outline IRAM Motivation
VIRAM architecture and VIRAM-1 microarchitecture Benchmarks Compiler Back to the future: 1997 and today

Original IRAM Motivation: Processor-DRAM Gap (latency)
60%/yr. 1000 CPU 100 Processor-Memory Performance Gap: (grows 50% / year) Performance 10 DRAM 7%/yr. Yaxis is performance Xasix is time Latency Cliché: Not e that x86 didn’t have cache on chip until 1989 DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time

Intelligent RAM: IRAM Proc L2$ L o g i c f a b Bus D R A M I/O
Microprocessor & DRAM on a single chip: 10X capacity vs. DRAM on-chip memory latency 5-10X memory bandwidth X improve energy efficiency 2X-4X (no off-chip bus) smaller board area/volume IRAM advantages extend to: a single chip system a building block for larger systems D R A M f a b Proc Bus I/O $B for separate lines for logic and memory Single chip: either processor in DRAM or memory in logic fab

1997 “Vanilla” IRAM Study Estimate performance IRAM version of Alpha (same caches, benchmarks, standard DRAM) Assumed logic slower, DRAM faster Results: Spec92 slower, sparse matrices faster, DBs even Two conclusions Conventional benchmarks like Spec match conventional architectures Conventional architectures do not utilize memory bandwidth Research plan Focus on power/area advantages, including portable, hand-held devices Focus on multimedia benchmarks to match these devices Develop an architecture that can exploit the enormous memory bandwidth

Vector IRAM Architecture
Maximum Vector Length (mvl) = # elts per register VP0 VP1 VPvl-1 vr0 vr1 Data Registers vpw vr31 Maximum vector length is given by a read-only register mvl E.g., in VIRAM-1 implementation, each register holds bit values Vector Length is given by the register vl This is the # of “active” elements or “virtual processors” To handle variable-width data (8,16,32,64-bit): Width of each VP given by the register vpw vpw is one of {8b,16b,32b,64b} (no 8b in VIRAM-1) mvl depends on implementation and vpw: bit, bit, bit,…

IRAM Architecture Update
ISA mostly frozen since 6/99 Changes in 2H 99 for better fixed-point model and some instructions for short vectors (auto increment and in-register permutations) Minor changes in 1H 00 to address new co-processor interface in MIPS core ISA manual publicly available Suite of simulators actively used vsim-isa (functional) Major rewrite nearly complete for new scalar processor All UCB code vsim-p (performance), vsim-db (debugger), vsim-sync (memory synchronization)

VIRAM-1 Implementation
16 MB DRAM, 8 banks MIPS Scalar core and caches @ 200 MHz Vector 200 MHz 4 64-bit lanes 8 32-bit virtual lanes 16 16-bit virtual lanes 0.18 um EDL process 17x17 mm 2 Watt power target Memory (64 Mbits / 8 MBytes) Vector Pipes/Lanes C P U +$ Xbar Memory (64 Mbits / 8 MBytes) Design easily scales in number of lanes, e.g.: 2 64-bit lanes: lower power version 8 64-bit lanes: higher performance version Number of memory banks is independent

VIRAM-1 Microarchitecture
Memory system 8 DRAM banks 256-bit synchronous interface 1 sub-bank per bank 16 Mbytes total capacity Peak performance 3.2 GOPS64, 12.8 GOPS16 (w. madd) 1.6 GOPS64, 6.4 GOPS16 (wo. madd) 0.8 GFLOPS64, 1.6 GFLOPS32 6.4 Gbyte/s memory bandwidth comsumed by VU 2 arithmetic units both execute integer operations one executes FP operations 4 64-bit datapaths (lanes) per unit 2 flag processing units for conditional execution and speculation support 1 load-store unit optimized for strides 1,2,3, and 4 4 addresses/cycle for indexed and strided operations decoupled indexed and strided stores

VIRAM-1 block diagram

IRAM Chip Update IBM supplying embedded DRAM/Logic (100%)
Agreement in place as of June 1, 2000 MIPS supplying scalar core (100%) MIPS processor, caches, TLB MIT supplying FPU (100%) VIRAM-1 Tape-out scheduled for January 2001 Simplifications Floating point Network Interface

Hand-Coded Benchmark Review
Image processing kernels (old FPU model) Note BLAS-2 performance

Base-line system comparison
All numbers in cycles/pixel MMX and VIS results assume all data in L1 cache

FFT: Uses In-Register Permutations
Without in-register permutations

IRAM Compiler Status Vectorizer C Fortran C++ Frontends
Code Generators PDGCS IRAM C90 Retarget of Cray compiler Steps in compiler development Build MIPS backend (done) Build VIRAM backend for vectorized loops (done) Instruction scheduling for VIRAM-1 (done) Insertion of memory barriers (using Cray strategy, improving) Additional optimizations (ongoing) Feedback results to Cray, new version from Cray (ongoing)

Compiled Applications Update
Applications using compiler Speech processing under development Developed new small-memory algorithm for speech processing Uses some existing kernels (FFT and MM) Vector search algorithm is most challenging DIS image understanding application under development Compiles, but does not yet vectorize well Singular Value Decomposition Better than 2 VLIW machines (TI C67 and TM 1100) Challenging BLAS-1,2 work well on IRAM because of memory BW Kernels Simple floating point kernels are very competitive with hand-code

(10n x n SVD, rank 10) (From Herman, Loo, Tang, CS252 project)

IRAM Latency Advantage
1997 estimate: 5-10x improvement No parallel DRAMs, memory controller, bus to turn around, SIMM module, pins… 30ns for IRAM (or much lower with DRAM redesign) Compare to Alpha 600: 180 ns for 128b; 270 ns for 512b 2000 estimate: 5x improvement IRAM memory latency is 25 ns for 256 bits, fixed pipeline delay Alpha 4000/4100: 120 ns 1st 2nd innovate inside DRAM Even compared to latest Alpha

IRAM Bandwidth Advantage
1997 estimate: 100x 1024 1Mbit modules, each 1Kb wide(1Gb chip) 40 ns RAS/CAS = 320 GBytes/sec If cross bar switch or multiple busses deliver 1/3 to 2/3 of total Þ GBytes/sec Compare to: AlphaServer 8400 = 1.2 GBytes/sec, =1.1 Gbytes/sec 2000 estimate: x VIRAM-1 16 MB chip divided into 8 banks => 51.2 GB peak from memory banks Crossbar can consume 12.8 GB/s 6.4GB/s from Vector Unit GB/s from either scalar or I/O 2nd reason Delivered BW on Alpha Server

Power and Energy Advantages
1997 Case study of StrongARM memory hierarchy vs. IRAM memory hierarchy cell size advantages Þ much larger cache Þ fewer off-chip references Þ up to 2X-4X energy efficiency for memory-intensive algorithms less energy per bit access for DRAM Power target for VIRAM-1 2 watt goal Based on preliminary spice runs, this looks very feasible today Scalar core included bigger caches or less memory on board Cache in logic process vs. SRAM in SRAM process vs. DRAM in DRAM process Main reason

Summary IRAM takes advantage of high on-chip bandwidth
Vector IRAM ISA utilizes this bandwidth Unit, strided, and indexed memory access patterns supported Exploits fine-grained parallelism, even with pointer chasing Compiler Well-understood compiler model, semi-automatic Still some work on code generation quality Application benchmarks Compiled and hand-coded Include FFT, SVD, MVM, sparse MVM, and other kernels used in image and signal processing

IRAM Applications: Intelligent PDA
Pilot PDA + gameboy, cell phone, radio, timer, camera, TV remote, am/fm radio, garage door opener, ... + Wireless data (WWW) + Speech, vision recog. + Voice output for conversations Speech control +Vision to see, scan documents, read bar code, ...

IRAM as Building Block for ISTORE
System-on-a-chip enables computer, memory, redundant network interfaces without significantly increasing size of disk Target for years: building block: 2006 MicroDrive integrated with IRAM 9GB disk, 50 MB/sec disk (projected) connected via crossbar switch O(10) Gflops 10,000+ nodes fit into one rack!

Welcome Three related projects at Berkeley Groundrules Introductions

Similar presentations

Presentation on theme: "Welcome Three related projects at Berkeley Groundrules Introductions"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Welcome Three related projects at Berkeley Groundrules Introductions

Similar presentations

Presentation on theme: "Welcome Three related projects at Berkeley Groundrules Introductions"— Presentation transcript:

Similar presentations

About project

Feedback