IRAM: A Microprocessor for the Post-PC Era

Slides:

Advertisements

Similar presentations

Lecture 2: Modern Trends 1. 2 Microprocessor Performance Only 7% improvement in memory performance every year! 50% improvement in microprocessor performance.

Advertisements

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.

Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.

Slide 1 Computers for the Post-PC Era John Kubiatowicz, Kathy Yelick, and David Patterson IBM Visit.

Chapter 1. Introduction This course is all about how computers work But what do we mean by a computer? –Different types: desktop, servers, embedded devices.

Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.

CIS 314 : Computer Organization Lecture 1 – Introduction.

1 The Future of Microprocessors Embedded in Memory David A. Patterson EECS, University.

1 IRAM: A Microprocessor for the Post-PC Era David A. Patterson EECS, University of.

CPE 731 Advanced Computer Architecture Multiprocessor Introduction

CalStan 3/2011 VIRAM-1 Floorplan – Tapeout June 01 Microprocessor –256-bit media processor –12-14 MBytes DRAM – Gops –2W at MHz –Industrial.

1 IRAM and ISTORE David Patterson, Katherine Yelick, John Kubiatowicz U.C. Berkeley, EECS

1 Chapter 4 The Central Processing Unit and Memory.

Computer Organization and Assembly language

Computer performance.

UC Berkeley 1 The Datacenter is the Computer David Patterson Director, RAD Lab January, 2007.

Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.

Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu

EEL 5708 Main Memory Organization Lotzi Bölöni Fall 2003.

Introduction CSE 410, Spring 2008 Computer Systems

Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.

Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

1 Recap (from Previous Lecture). 2 Computer Architecture Computer Architecture involves 3 inter- related components – Instruction set architecture (ISA):

Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.

Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.

Slide 1 IRAM and ISTORE Projects Aaron Brown, Jim Beck, Rich Fromm, Joe Gebis, Kimberly Keeton, Christoforos Kozyrakis, David Martin, Morley Mao, Rich.

Academic PowerPoint Computer System – Architecture.

Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University

Succeeding with Technology Chapter 2 Hardware Designed to Meet the Need The Digital Revolution Integrated Circuits and Processing Storage Input, Output,

Lecture # 10 Processors Microcomputer Processors.

Hardware Architecture

Introduction CSE 410, Spring 2005 Computer Systems

1 IRAM Vision Microprocessor & DRAM on a single chip: –on-chip memory latency 5-10X, bandwidth X –improve energy efficiency 2X-4X (no off-chip bus)

SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.

Introduction to Computers - Hardware

William Stallings Computer Organization and Architecture 6th Edition

Lynn Choi School of Electrical Engineering

Processing Device and Storage Devices

Hardware Technology Trends and Database Opportunities

CSE 410, Spring 2006 Computer Systems

Rough Schedule 1:30-2:15 IRAM overview 2:15-3:00 ISTORE overview break

Computer Systems are Different!

Berkeley Cluster: Zoom Project

HISTORY OF MICROPROCESSORS

Architecture & Organization 1

Scaling for the Future Katherine Yelick U.C. Berkeley, EECS

CS775: Computer Architecture

Computer Architecture

Architecture & Organization 1

BIC 10503: COMPUTER ARCHITECTURE

Microprocessor & Assembly Language

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Course Description: Parallel Computer Architecture

Chapter 1 Introduction.

Welcome to Architectures of Digital Systems

Computer Evolution and Performance

IRAM: A Microprocessor for the Post-PC Era

What is Computer Architecture?

COMS 361 Computer Organization

What is Computer Architecture?

What is Computer Architecture?

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Chapter 4 Multiprocessors

IRAM: A Microprocessor for the Post-PC Era

The University of Adelaide, School of Computer Science

A microprocessor into a memory chip Dave Patterson, Berkeley, 1997

CSE378 Introduction to Machine Organization

IRAM Vision Microprocessor & DRAM on a single chip:

Presentation transcript:

IRAM: A Microprocessor for the Post-PC Era David A. Patterson Early As a result of thinking about 2020: really going to keep spending billions per fab, separate for memory and microprocessor for next 25 years? SLIDES TO ADD: Breakdown RAS/CAS times and where it goes to show hat 60 ns means Put back in DRAM opening chip Get photos/GIFs of boards and chips of Sun Server? History of sizes Excel, Word? Ask Gray? Ask Sites on quote? http://cs.berkeley.edu/~patterson/talks patterson@cs.berkeley.edu EECS, University of California Berkeley, CA 94720-1776

Perspective on Post-PC Era PostPC Era will be driven by 2 technologies: 1) Mobile Consumer Devices e.g., successor to PDA, cell phone, wearable computers 2) Infrastructure to Support such Devices e.g., successor to Big Fat Web Servers, Database Servers

A Better Media for Mobile Multimedia MPUs: Logic+DRAM Crash of DRAM market inspires new use of wafers Faster logic in DRAM process DRAM vendors offer faster transistors + same number metal layers as good logic process? @ ~ 20% higher cost per wafer? Called Intelligent RAM (“IRAM”) since most of transistors will be DRAM Lessons for last 20 years Large memory Uniform memory access

IRAM Vision Statement Microprocessor & DRAM on a single chip: f a b Microprocessor & DRAM on a single chip: on-chip memory latency 5-10X, bandwidth 50-100X improve energy efficiency 2X-4X (no off-chip bus) serial I/O 5-10X v. buses smaller board area/volume adjustable memory size/width $ $ L2$ I/O I/O Bus Bus $B for separate lines for logic and memory Single chip: either processor in DRAM or memory in logic fab D R A M I/O I/O Proc D R A M f a b Bus D R A M

Potential Multimedia Architecture “New” model: VSIW=Very Short Instruction Word! Compact: Describe N operations with 1 short instruct. Predictable (real-time) performance vs. statistical performance (cache) Multimedia ready: choose N*64b, 2N*32b, 4N*16b Easy to get high performance Easy to scale hardware Cost/Performance up, down Compiler technology already developed, for sale! Don’t have to write all programs in assembly language Why MPP? Best potential performance! Few successes Operator on vectors of registers Its easier to vectorize than parallelize Scales well: more hardware and slower clock rate Crazy research

Revive Vector (= VSIW) Architecture! Cost: ~ $1M each? Low latency, high BW memory system? Code density? Compilers? Power/Energy? Limited to scientific applications? Single-chip CMOS MPU/IRAM IRAM Much smaller than VLIW For sale, mature (>20 years) Parallel to save energy, keep perf Multimedia apps vectorizable too: N*64b, 2N*32b, 4N*16b Supercomputer industry dead? Very attractive to scale New class of applications Before had a lousy scalar processor; modest CPU will do well on many programs, vector do great on others

V-IRAM1: 0. 18 µm, Fast Logic, 200 MHz 1. 6 GFLOPS(64b)/6 V-IRAM1: 0.18 µm, Fast Logic, 200 MHz 1.6 GFLOPS(64b)/6.4 GOPS(16b)/32MB + Vector Registers x ÷ Load/Store Vector 4 x 64 or 8 x 32 16 x 16 Queue Instruction 16K I cache 16K D cache 2-way Superscalar Processor I/O Serial 1Gbit technology Put in perspective 10X of Cray T90 today Memory Crossbar Switch M … 4 x 64

Tentative VIRAM-1 Floorplan 0.18 µm DRAM 16-32 MB in 16 banks x 256b 0.18 µm, 5 Metal Logic ~ 200 MHz MIPS IV, 16K I$, 16K D$ ~ 4 200 MHz FP/int. vector units die: ~ 20x20 mm xtors: ~ 130-250M power: ~2 Watts Memory (128 Mbits / 16 MBytes) 4 Vector Pipes/Lanes C P U +$ Ring- based Switch Floor plan showing memory in purple Crossbar in blue (need to match vector unit, not maximum memory system) vector units in pink CPU in orange I/O in yellow How to spend 1B transistors vs. all CPU! VFU size based on looking at 3 MPUs in 0.25 micron technology; MIPS 12000 15 mm2 for 1FPU (Mul,Add, misc) IBM Power3 48 mm2 for 2 FPUs (2 mul/add units) HAL SPARC III 40 mm2 for 2 FPUs (2 multiple, add units) I/O Memory (128 Mbits / 16 MBytes)

VIRAM-1 Simulated Performance Kernel GOPS % Peak Cycles/pixel (small=fast) 16-bit VIRAM MMX TMS‘C82 Compositing 6.40 100% 0.13 -- -- 16b iDCT 3.10 48% 0.75 3.75 5.70 32b Color Conversion 2.95 92% 0.78 8.00 -- 32-bit Convolution 3.16 99% 1.21 5.49 6.50 32b FP Matrix Multiply 3.19 97% n.a. n.a. n.a.

Tentative VIRAM-”0.25” Floorplan Kernel GOPS V-1 V-0.25 Comp. 6.40 1.6 iDCT 3.10 0.8 Clr.Conv. 2.95 0.8 Convol. 3.16 0.8 FP Matrix 3.19 0.8 Demonstrate scalability via 2nd layout (automatic from 1st) 4-8 MB in 2 banks x 256b ~ 200 MHz CPU, 8K I$, 8K D$ 1 ~ 200 MHz FP/int. vector units die: ~ 5 x 20 mm xtors: ~ 35M - 70M power: ~0.5 Watts Memory (32 Mb / 4 MB) C P U +$ 1 VU Floor plan showing memory in purple Crossbar in blue (need to match vector unit, not maximum memory system) vector units in pink CPU in orange I/O in yellow How to spend 1B transistors vs. all CPU! VFU size based on looking at 3 MPUs in 0.25 micron technology; MIPS 12000 15 mm2 for 1FPU (Mul,Add, misc) IBM Power3 48 mm2 for 2 FPUs (2 mul/add units) HAL SPARC III 40 mm2 for 2 FPUs (2 multiple, add units) Memory (32 Mb / 4 MB)

V-IRAM-1 Tentative Plan Phase I: Feasibility stage (H2’98) Test chip, CAD agreement, architecture defined Phase 2: Design & Layout Stage (~’99) Test chip, Simulated design and layout Phase 3: Verification (~1Q’00) Tape-out Q2’00 Phase 4: Fabrication,Testing, and Demonstration (~3Q’00) Functional integrated circuit 100M transistor microprocessor before Intel?

Bits of Arithmetic Unit IRAM not a new idea 1000 IRAMUNI? IRAMMPP? Stone, ‘70 “Logic-in memory” Barron, ‘78 “Transputer” Dally, ‘90 “J-machine” Patterson, ‘90 panel session Kogge, ‘94 “Execube” PPRAM 100 Mitsubishi M32R/D PIP-RAM Mbits of Memory Computational RAM Scale no. proc. with memory capacity  on-chip MPP  difficult SW problem, especially with limited memory/proc Scale memory capacity with processor speed  uniprocessor  easier SW problem, especially with more memory/proc 10 Pentium Pro Execube SIMD on chip (DRAM) Uniprocessor (SRAM) MIMD on chip (DRAM) Uniprocessor (DRAM) MIMD component (SRAM ) 1 Alpha 21164 Transputer T9 0.1 Terasys 10 100 1000 10000

IRAM Chip Challenges Merged Logic-DRAM process: Cost of wafer, Impact on yield, testing cost of logic and DRAM Price of on-chip DRAM vs. separate DRAM chips? Time delay of transistor speeds, memory cell sizes in Merged process vs. Logic only or DRAM only DRAM block: flexibility via DRAM “compiler” (vary size, width, no. subbanks) vs. fixed block; synchronous interface available? Applications: advantages in memory bandwidth, energy, system size to offset above challenges? Or Speed, Area, power, yield of DRAM in logic process Can slowdown in performance of portion and still be attractive Testing time much worse, or better due to BIST? DRAM operate at 1 watt: every 10 degrees increase in operative temperature doubles refresh rate; what to do? IRAM: acts as MP, acts as Cache to real memory, acts as low part of physical address space + OS?

Commercial IRAM highway is governed by memory per IRAM? Laptop 32 MB Network Computer Super PDA/Phone 8 MB Limited by DRAM on chip: DRAM/chip increases faster than application memory demand, so I expect new applications to become popular as memory per chip increases 1MB to 4MB to 16MBto 64 MB (1Gbit = 128 MB) Video Games Graphics Acc. 2 MB (Slide prepared February 1997)

Sony Playstation 2000 Emotion Engine: 6.2 GFLOPS, 75 million polygons per second (Microprocessor Report, 13:5) MIPS core + vector coprocessor + graphics/DRAM Claim: Toy Story realism brought to games!

Infrastructure for Next Generation Servers today based on desktop MPUs: Central Processsor Units + Peripheral Disks What would servers look like if based on mobile, multimedia microprocessors? Include processor, network interface inside disk ISTORE: a HW/software architecture for building scaleable, self-maintaining storage An introspective system: processor/disk  it monitors itself and acts on its observations No administrators to configure, monitor, tune

Intelligent Chassis: scaleable, redundant, fast network + UPS ISTORE-I Hardware ISTORE uses “intelligent” hardware Device CPU, memory, NI Intelligent Chassis: scaleable, redundant, fast network + UPS Intelligent Disk “Brick”: a disk, plus a fast embedded CPU, memory, and redundant network interfaces

2006 ISTORE IBM MicroDrive ISTORE node 1.7” x 1.4” x 0.2” 1999: 340 MB, 5400 RPM, 5 MB/s, 15 ms seek 2006: 9 GB, 50 MB/s? ISTORE node MicroDrive + IRAM Crossbar switches growing by Moore’s Law 16 x 16 in 1999  64 x 64 in 2005 ISTORE rack (19” x 33” x 84”) 1 tray (3” high)  16 x 32  512 ISTORE nodes 20 trays+switches+UPS  10,240 ISTORE nodes(!)

IRAM Conclusion IRAM potential in mem/IO BW, energy, board area; challenges in power/performance, testing, yield 10X-100X improvements based on technology shipping for 20 years (not JJ, photons, MEMS, ...) Suppose IRAM is successful Revolution in computer implementation Potential Impact #1: turn server industry inside-out? Potential #2: shift semiconductor balance of power? Who ships the most memory? Most microprocessors? Captain of industry challenge is taking advantage of new technology once see quantification Balance of power: MPer companies shipping most of DRAM, or DRAM companies shipping most of MPers Not talking about exotic technology, based on photons or neurons, based on opening up technology shipped in 20 years

Acknowledgments Looking for ideas of VIRAM enabled apps Contact us if you’re interested: email: patterson@cs.berkeley.edu http://iram.cs.berkeley.edu/ Thanks for advice/support: DARPA, California MICRO, Hitachi, IBM, Intel, LG Semicon, Micron, Microsoft, Neomagic, Sandcraft, SGI/Cray, Sun Microsystems, TI, TSMC

(The following slides are used to help answer questions) Backup Slides (The following slides are used to help answer questions)

Near-term IRAM Applications (slide done in 1997) “Intelligent” Set-top 2.6M Nintendo 64 (~ $150) sold in 1st year 4-chip Nintendo 1-chip: 3D graphics, sound, fun! “Intelligent” Personal Digital Assistant 0.6M PalmPilots (~ $300) sold in 1st 6 months Handwriting + learn new alphabet ( = K, = T, = 4) v. Speech input A supercomputer you could lose? Honey, I can’t find my supercomputer; have you seen it? Look at the speed of processor and amount of I/O: seems that can have a balanced system using GHz serial I/O Point 2: DRAM vs. Disk: now 104 faster latency and bandwidth

Words to Remember “...a strategic inflection point is a time in the life of a business when its fundamentals are about to change. ... Let's not mince words: A strategic inflection point can be deadly when unattended to. Companies that begin a decline as a result of its changes rarely recover their previous greatness.” Only the Paranoid Survive, Andrew S. Grove, 1996