Welcome Three related projects at Berkeley –Intelligent RAM (IRAM) –Intelligent Storage (ISTORE) –OceanStore Groundrules –Questions are welcome during.

Slides:



Advertisements
Similar presentations
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Advertisements

Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
The University of Adelaide, School of Computer Science
Chapter 6 Computer Architecture
Slide 1 Perspective on Post-PC Era PostPC Era will be driven by 2 technologies: 1) Mobile Consumer Devices –e.g., successor to PDA, cell phone, wearable.
Slide 1 Exploiting 0n-Chip Bandwidth The vector ISA + compiler technology uses high bandwidth to mask latency Compiled matrix-vector multiplication: 2.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 20 - Memory.
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.
Computer System Overview
Computer ArchitectureFall 2007 © November 7th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 4: January 22, 2007 Memories.
Scalable Vector Coprocessor for Media Processing Christoforos Kozyrakis ( ) IRAM Project Retreat, July 12 th, 2000.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Retrospective on the VIRAM-1 Design Decisions Christoforos E. Kozyrakis IRAM Retreat January 9, 2001.
1 CS402 PPP # 1 Computer Architecture Evolution. 2 John Von Neuman original concept.
1 IRAM: A Microprocessor for the Post-PC Era David A. Patterson EECS, University of.
GCSE Computing - The CPU
1 IRAM and ISTORE David Patterson, Katherine Yelick, John Kubiatowicz U.C. Berkeley, EECS
COM181 Computer Hardware Ian McCrumRoom 5B18,
Systems I Locality and Caching
Computer performance.
Computer Architecture ECE 4801 Berk Sunar Erkay Savas.
Chapter 4 The System Unit: Processing and Memory Prepared by : Mrs. Sara salih.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Alpha 21364: A Scalable Single-chip SMP
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 7: February 6, 2012 Memories.
EEL5708/Bölöni Lec 4.1 Fall 2004 September 10, 2004 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Review: Memory Hierarchy.
Computer Memory Storage Decoding Addressing 1. Memories We've Seen SIMM = Single Inline Memory Module DIMM = Dual IMM SODIMM = Small Outline DIMM RAM.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 5: February 1, 2010 Memories.
Slide 1 IRAM and ISTORE Projects Aaron Brown, Jim Beck, Rich Fromm, Joe Gebis, Kimberly Keeton, Christoforos Kozyrakis, David Martin, Morley Mao, Rich.
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
Oct 26, 2005 FEC: 1 Custom vs Commodity Processors Bill Dally October 26, 2005.
Compilers and Applications Kathy Yelick Dave Judd, Ronny Krashinsky, Randi Thomas, Samson Kwok, Simon Yau, Kar Ming Tang, Adam Janin, Thinh Nguyen Computer.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 8: February 19, 2014 Memories.
New-School Machine Structures Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions.
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
1 IRAM Vision Microprocessor & DRAM on a single chip: –on-chip memory latency 5-10X, bandwidth X –improve energy efficiency 2X-4X (no off-chip bus)
Memory COMPUTER ARCHITECTURE
Visit for more Learning Resources
IRAM and ISTORE Projects
Rough Schedule 1:30-2:15 IRAM overview 2:15-3:00 ISTORE overview break
Embedded Systems Design
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
Scaling for the Future Katherine Yelick U.C. Berkeley, EECS
CMSC 611: Advanced Computer Architecture
Welcome Three related projects at Berkeley Groundrules Introductions
IRAM and ISTORE Projects
Today’s agenda Hardware architecture and runtime system
CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang
Dave Judd Kathy Yelick Computer Science Division UC Berkeley
A microprocessor into a memory chip Dave Patterson, Berkeley, 1997
CSE 502: Computer Architecture
ADSP 21065L.
IRAM Vision Microprocessor & DRAM on a single chip:
Presentation transcript:

Welcome Three related projects at Berkeley –Intelligent RAM (IRAM) –Intelligent Storage (ISTORE) –OceanStore Groundrules –Questions are welcome during talks –Feedback required Friday morning –Time for rafting and talking Introductions

Overview of the IRAM Project Kathy Yelick Aaron Brown, James Beck, Rich Fromm, Joe Gebis, Paul Harvey, Adam Janin, Dave Judd, Christoforos Kozyrakis, David Martin, Thinh Nguyen, David Oppenheimer, Steve Pope, Randi Thomas, Noah Treuhaft, Sam Williams, John Kubiatowicz, and David Patterson Summer 2000 Retreat

Outline IRAM Motivation VIRAM architecture and VIRAM-1 microarchitecture Benchmarks Compiler Back to the future: 1997 and today

Original IRAM Motivation: Processor-DRAM Gap (latency) µProc 60%/yr. DRAM 7%/yr DRAM CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance Time

Intelligent RAM: IRAM Microprocessor & DRAM on a single chip: –10X capacity vs. DRAM –on-chip memory latency 5-10X –memory bandwidth X –improve energy efficiency 2X-4X (no off-chip bus) –smaller board area/volume IRAM advantages extend to: –a single chip system –a building block for larger systems DRAMDRAM fabfab Proc Bus DRAM I/O $$ Proc L2 $ LogicLogic fabfab Bus DRAM I/O

1997 “Vanilla” IRAM Study Estimate performance IRAM version of Alpha (same caches, benchmarks, standard DRAM) –Assumed logic slower, DRAM faster –Results: Spec92 slower, sparse matrices faster, DBs even Two conclusions –Conventional benchmarks like Spec match conventional architectures –Conventional architectures do not utilize memory bandwidth Research plan –Focus on power/area advantages, including portable, hand-held devices –Focus on multimedia benchmarks to match these devices –Develop an architecture that can exploit the enormous memory bandwidth

Vector IRAM Architecture Maximum Vector Length (mvl) = # elts per register VP 0 VP 1 VP vl-1 vr 0 vr 1 vr 31 vpw Data Registers Maximum vector length is given by a read-only register mvl –E.g., in VIRAM-1 implementation, each register holds bit values Vector Length is given by the register vl –This is the # of “active” elements or “virtual processors” To handle variable-width data (8,16,32,64-bit): Width of each VP given by the register vpw –vpw is one of {8b,16b,32b,64b} (no 8b in VIRAM-1) –mvl depends on implementation and vpw: bit, bit, bit,…

IRAM Architecture Update ISA mostly frozen since 6/99 –Changes in 2H 99 for better fixed-point model and some instructions for short vectors (auto increment and in-register permutations) –Minor changes in 1H 00 to address new co-processor interface in MIPS core ISA manual publicly available – Suite of simulators actively used –vsim-isa (functional) Major rewrite nearly complete for new scalar processor All UCB code – vsim-p (performance), vsim-db (debugger), vsim-sync (memory synchronization)

VIRAM-1 Implementation 16 MB DRAM, 8 banks MIPS Scalar core and 200 MHz Vector 200 MHz 4 64-bit lanes 8 32-bit virtual lanes bit virtual lanes 0.18 um EDL process 17x17 mm 2 Watt power target C P U +$ Vector Pipes/Lanes Memory (64 Mbits / 8 MBytes) Xbar Design easily scales in number of lanes, e.g.: 2 64-bit lanes: lower power version 8 64-bit lanes: higher performance version Number of memory banks is independent

VIRAM-1 Microarchitecture 2 arithmetic units –both execute integer operations –one executes FP operations –4 64-bit datapaths (lanes) per unit 2 flag processing units –for conditional execution and speculation support 1 load-store unit –optimized for strides 1,2,3, and 4 –4 addresses/cycle for indexed and strided operations –decoupled indexed and strided stores Memory system –8 DRAM banks –256-bit synchronous interface –1 sub-bank per bank –16 Mbytes total capacity Peak performance –3.2 GOPS 64, 12.8 GOPS 16 (w. madd) –1.6 GOPS 64, 6.4 GOPS 16 (wo. madd) –0.8 GFLOPS 64, 1.6 GFLOPS 32 –6.4 Gbyte/s memory bandwidth comsumed by VU

VIRAM-1 block diagram

IRAM Chip Update IBM supplying embedded DRAM/Logic (100%) –Agreement in place as of June 1, 2000 MIPS supplying scalar core (100%) –MIPS processor, caches, TLB MIT supplying FPU (100%) VIRAM-1 Tape-out scheduled for January 2001 Simplifications –Floating point –Network Interface

Hand-Coded Benchmark Review Image processing kernels (old FPU model) –Note BLAS-2 performance

Base-line system comparison All numbers in cycles/pixel MMX and VIS results assume all data in L1 cache

FFT: Uses In-Register Permutations Without in-register permutations

Problem: General Element Permutation Hardware for a full vector permutation instruction (128 16b elements, 256b datapath) Datapath: 16 x 16 (x 16b) crossbar; scales by 0(N^2) Control: to-1 multiplexors; scales by 0(N*logN) Time/energy wasted on wide vector register file port

Simple Vector Permutations Simple steps of butterfly permutations –A register provides the butterfly radix –Separate instructions for moving elements to left/right Sufficient semantics for –Fast reductions of vector registers (dot products) –Fast FFT kernels 0115

Hardware for Simple Permutations Hardware for b elements, 256b datapath Datapath: 2 buses, 8 tristate drivers, 4 multiplexors, 4 shifters (by 0, 16b, 32b only); Scales by O(N) Control: 6 control cases; scales by O(N) Other benefits –Consecutive result elements written together; –Buses used only for small radices 64 shift 64 03

IRAM Compiler Status Retarget of Cray compiler Steps in compiler development –Build MIPS backend (done) –Build VIRAM backend for vectorized loops (done) –Instruction scheduling for VIRAM-1 (done) –Insertion of memory barriers (using Cray strategy, improving) –Additional optimizations (ongoing) –Feedback results to Cray, new version from Cray (ongoing) Vectorizer C Fortran C++ Frontends Code Generators PDGCS IRAM C90

Compiled Applications Update Applications using compiler –Speech processing under development Developed new small-memory algorithm for speech processing Uses some existing kernels (FFT and MM) Vector search algorithm is most challenging –DIS image understanding application under development Compiles, but does not yet vectorize well –Singular Value Decomposition Better than 2 VLIW machines (TI C67 and TM 1100) Challenging BLAS-1,2 work well on IRAM because of memory BW –Kernels Simple floating point kernels are very competitive with hand-code

(10n x n SVD, rank 10) (From Herman, Loo, Tang, CS252 project)

IRAM Latency Advantage 1997 estimate: 5-10x improvement –No parallel DRAMs, memory controller, bus to turn around, SIMM module, pins… –30ns for IRAM (or much lower with DRAM redesign) –Compare to Alpha 600: 180 ns for 128b; 270 ns for 512b 2000 estimate: 5x improvement –IRAM memory latency is 25 ns for 256 bits, fixed pipeline delay –Alpha 4000/4100: 120 ns

IRAM Bandwidth Advantage 1997 estimate: 100x –1024 1Mbit modules, each 1Kb wide(1Gb chip) 40 ns RAS/CAS = 320 GBytes/sec –If cross bar switch or multiple busses deliver 1/3 to 2/3 of total  GBytes/sec –Compare to: AlphaServer 8400 = 1.2 GBytes/sec, 4100=1.1 Gbytes/sec 2000 estimate: x –VIRAM-1 16 MB chip divided into 8 banks => 51.2 GB peak from memory banks –Crossbar can consume 12.8 GB/s –6.4GB/s from Vector Unit GB/s from either scalar or I/O

Power and Energy Advantages 1997 Case study of StrongARM memory hierarchy vs. IRAM memory hierarchy –cell size advantages  much larger cache  fewer off-chip references  up to 2X-4X energy efficiency for memory-intensive algorithms –less energy per bit access for DRAM Power target for VIRAM-1 –2 watt goal –Based on preliminary spice runs, this looks very feasible today –Scalar core included

Summary IRAM takes advantage of high on-chip bandwidth Vector IRAM ISA utilizes this bandwidth –Unit, strided, and indexed memory access patterns supported –Exploits fine-grained parallelism, even with pointer chasing Compiler –Well-understood compiler model, semi-automatic –Still some work on code generation quality Application benchmarks –Compiled and hand-coded –Include FFT, SVD, MVM, sparse MVM, and other kernels used in image and signal processing

IRAM Applications: Intelligent PDA Pilot PDA + gameboy, cell phone, radio, timer, camera, TV remote, am/fm radio, garage door opener,... + Wireless data (WWW) + Speech, vision recog. + Voice output for conversations Speech control +Vision to see, scan documents, read bar code,...

IRAM as Building Block for ISTORE System-on-a-chip enables computer, memory, redundant network interfaces without significantly increasing size of disk Target for years: –building block: 2006 MicroDrive integrated with IRAM 9GB disk, 50 MB/sec disk (projected) connected via crossbar switch O(10) Gflops –10,000+ nodes fit into one rack!