Welcome Three related projects at Berkeley –Intelligent RAM (IRAM) –Intelligent Storage (ISTORE) –OceanStore Groundrules –Questions are welcome during.

Welcome Three related projects at Berkeley –Intelligent RAM (IRAM) –Intelligent Storage (ISTORE) –OceanStore Groundrules –Questions are welcome during talks –Feedback required Friday morning –Time for rafting and talking Introductions

Overview of the IRAM Project Kathy Yelick Aaron Brown, James Beck, Rich Fromm, Joe Gebis, Paul Harvey, Adam Janin, Dave Judd, Christoforos Kozyrakis, David Martin, Thinh Nguyen, David Oppenheimer, Steve Pope, Randi Thomas, Noah Treuhaft, Sam Williams, John Kubiatowicz, and David Patterson http://iram.cs.berkeley.edu/ Summer 2000 Retreat

Outline IRAM Motivation VIRAM architecture and VIRAM-1 microarchitecture Benchmarks Compiler Back to the future: 1997 and today

Original IRAM Motivation: Processor-DRAM Gap (latency) µProc 60%/yr. DRAM 7%/yr. 1 10 100 1000 19801981198319841985198619871988198919901991199219931994199519961997199819992000 DRAM CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance Time

Intelligent RAM: IRAM Microprocessor & DRAM on a single chip: –10X capacity vs. DRAM –on-chip memory latency 5-10X –memory bandwidth 50-100X –improve energy efficiency 2X-4X (no off-chip bus) –smaller board area/volume IRAM advantages extend to: –a single chip system –a building block for larger systems DRAMDRAM fabfab Proc Bus DRAM I/O $$ Proc L2 $ LogicLogic fabfab Bus DRAM I/O

1997 “Vanilla” IRAM Study Estimate performance IRAM version of Alpha (same caches, benchmarks, standard DRAM) –Assumed logic slower, DRAM faster –Results: Spec92 slower, sparse matrices faster, DBs even Two conclusions –Conventional benchmarks like Spec match conventional architectures –Conventional architectures do not utilize memory bandwidth Research plan –Focus on power/area advantages, including portable, hand-held devices –Focus on multimedia benchmarks to match these devices –Develop an architecture that can exploit the enormous memory bandwidth

Vector IRAM Architecture Maximum Vector Length (mvl) = # elts per register VP 0 VP 1 VP vl-1 vr 0 vr 1 vr 31 vpw Data Registers Maximum vector length is given by a read-only register mvl –E.g., in VIRAM-1 implementation, each register holds 32 64-bit values Vector Length is given by the register vl –This is the # of “active” elements or “virtual processors” To handle variable-width data (8,16,32,64-bit): Width of each VP given by the register vpw –vpw is one of {8b,16b,32b,64b} (no 8b in VIRAM-1) –mvl depends on implementation and vpw: 32 64-bit, 64 32-bit, 128 16-bit,…

IRAM Architecture Update ISA mostly frozen since 6/99 –Changes in 2H 99 for better fixed-point model and some instructions for short vectors (auto increment and in-register permutations) –Minor changes in 1H 00 to address new co-processor interface in MIPS core ISA manual publicly available –http://www.cs.berkeley.edu Suite of simulators actively used –vsim-isa (functional) Major rewrite nearly complete for new scalar processor All UCB code – vsim-p (performance), vsim-db (debugger), vsim-sync (memory synchronization)

VIRAM-1 Implementation 16 MB DRAM, 8 banks MIPS Scalar core and caches @ 200 MHz Vector unit @ 200 MHz 4 64-bit lanes 8 32-bit virtual lanes 16 16-bit virtual lanes 0.18 um EDL process 17x17 mm 2 Watt power target C P U +$ Vector Pipes/Lanes Memory (64 Mbits / 8 MBytes) Xbar Design easily scales in number of lanes, e.g.: 2 64-bit lanes: lower power version 8 64-bit lanes: higher performance version Number of memory banks is independent

VIRAM-1 Microarchitecture 2 arithmetic units –both execute integer operations –one executes FP operations –4 64-bit datapaths (lanes) per unit 2 flag processing units –for conditional execution and speculation support 1 load-store unit –optimized for strides 1,2,3, and 4 –4 addresses/cycle for indexed and strided operations –decoupled indexed and strided stores Memory system –8 DRAM banks –256-bit synchronous interface –1 sub-bank per bank –16 Mbytes total capacity Peak performance –3.2 GOPS 64, 12.8 GOPS 16 (w. madd) –1.6 GOPS 64, 6.4 GOPS 16 (wo. madd) –0.8 GFLOPS 64, 1.6 GFLOPS 32 –6.4 Gbyte/s memory bandwidth comsumed by VU

VIRAM-1 block diagram

IRAM Chip Update IBM supplying embedded DRAM/Logic (100%) –Agreement in place as of June 1, 2000 MIPS supplying scalar core (100%) –MIPS processor, caches, TLB MIT supplying FPU (100%) VIRAM-1 Tape-out scheduled for January 2001 Simplifications –Floating point –Network Interface

Hand-Coded Benchmark Review Image processing kernels (old FPU model) –Note BLAS-2 performance

Base-line system comparison All numbers in cycles/pixel MMX and VIS results assume all data in L1 cache

FFT: Uses In-Register Permutations Without in-register permutations

Problem: General Element Permutation Hardware for a full vector permutation instruction (128 16b elements, 256b datapath) Datapath: 16 x 16 (x 16b) crossbar; scales by 0(N^2) Control: 16 16-to-1 multiplexors; scales by 0(N*logN) Time/energy wasted on wide vector register file port 16 0115 0 1

Simple Vector Permutations Simple steps of butterfly permutations –A register provides the butterfly radix –Separate instructions for moving elements to left/right Sufficient semantics for –Fast reductions of vector registers (dot products) –Fast FFT kernels 0115

Hardware for Simple Permutations Hardware for 128 16b elements, 256b datapath Datapath: 2 buses, 8 tristate drivers, 4 multiplexors, 4 shifters (by 0, 16b, 32b only); Scales by O(N) Control: 6 control cases; scales by O(N) Other benefits –Consecutive result elements written together; –Buses used only for small radices 64 shift 64 03

IRAM Compiler Status Retarget of Cray compiler Steps in compiler development –Build MIPS backend (done) –Build VIRAM backend for vectorized loops (done) –Instruction scheduling for VIRAM-1 (done) –Insertion of memory barriers (using Cray strategy, improving) –Additional optimizations (ongoing) –Feedback results to Cray, new version from Cray (ongoing) Vectorizer C Fortran C++ Frontends Code Generators PDGCS IRAM C90

Compiled Applications Update Applications using compiler –Speech processing under development Developed new small-memory algorithm for speech processing Uses some existing kernels (FFT and MM) Vector search algorithm is most challenging –DIS image understanding application under development Compiles, but does not yet vectorize well –Singular Value Decomposition Better than 2 VLIW machines (TI C67 and TM 1100) Challenging BLAS-1,2 work well on IRAM because of memory BW –Kernels Simple floating point kernels are very competitive with hand-code

(10n x n SVD, rank 10) (From Herman, Loo, Tang, CS252 project)

IRAM Latency Advantage 1997 estimate: 5-10x improvement –No parallel DRAMs, memory controller, bus to turn around, SIMM module, pins… –30ns for IRAM (or much lower with DRAM redesign) –Compare to Alpha 600: 180 ns for 128b; 270 ns for 512b 2000 estimate: 5x improvement –IRAM memory latency is 25 ns for 256 bits, fixed pipeline delay –Alpha 4000/4100: 120 ns

IRAM Bandwidth Advantage 1997 estimate: 100x –1024 1Mbit modules, each 1Kb wide(1Gb chip) –10% @ 40 ns RAS/CAS = 320 GBytes/sec –If cross bar switch or multiple busses deliver 1/3 to 2/3 of total  100 - 200 GBytes/sec –Compare to: AlphaServer 8400 = 1.2 GBytes/sec, 4100=1.1 Gbytes/sec 2000 estimate: 10-100x –VIRAM-1 16 MB chip divided into 8 banks => 51.2 GB peak from memory banks –Crossbar can consume 12.8 GB/s –6.4GB/s from Vector Unit + 6.4 GB/s from either scalar or I/O

Power and Energy Advantages 1997 Case study of StrongARM memory hierarchy vs. IRAM memory hierarchy –cell size advantages  much larger cache  fewer off-chip references  up to 2X-4X energy efficiency for memory-intensive algorithms –less energy per bit access for DRAM Power target for VIRAM-1 –2 watt goal –Based on preliminary spice runs, this looks very feasible today –Scalar core included

Summary IRAM takes advantage of high on-chip bandwidth Vector IRAM ISA utilizes this bandwidth –Unit, strided, and indexed memory access patterns supported –Exploits fine-grained parallelism, even with pointer chasing Compiler –Well-understood compiler model, semi-automatic –Still some work on code generation quality Application benchmarks –Compiled and hand-coded –Include FFT, SVD, MVM, sparse MVM, and other kernels used in image and signal processing

IRAM Applications: Intelligent PDA Pilot PDA + gameboy, cell phone, radio, timer, camera, TV remote, am/fm radio, garage door opener,... + Wireless data (WWW) + Speech, vision recog. + Voice output for conversations Speech control +Vision to see, scan documents, read bar code,...

IRAM as Building Block for ISTORE System-on-a-chip enables computer, memory, redundant network interfaces without significantly increasing size of disk Target for + 5-7 years: –building block: 2006 MicroDrive integrated with IRAM 9GB disk, 50 MB/sec disk (projected) connected via crossbar switch O(10) Gflops –10,000+ nodes fit into one rack!

Welcome Three related projects at Berkeley –Intelligent RAM (IRAM) –Intelligent Storage (ISTORE) –OceanStore Groundrules –Questions are welcome during.

Similar presentations

Presentation on theme: "Welcome Three related projects at Berkeley –Intelligent RAM (IRAM) –Intelligent Storage (ISTORE) –OceanStore Groundrules –Questions are welcome during."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Welcome Three related projects at Berkeley –Intelligent RAM (IRAM) –Intelligent Storage (ISTORE) –OceanStore Groundrules –Questions are welcome during.

Similar presentations

Presentation on theme: "Welcome Three related projects at Berkeley –Intelligent RAM (IRAM) –Intelligent Storage (ISTORE) –OceanStore Groundrules –Questions are welcome during."— Presentation transcript:

Similar presentations

About project

Feedback