IRAM: A Microprocessor for the Post-PC Era

IRAM: A Microprocessor for the Post-PC Era
David A. Patterson Early As a result of thinking about 2020: really going to keep spending billions per fab, separate for memory and microprocessor for next 25 years? SLIDES TO ADD: Breakdown RAS/CAS times and where it goes to show hat 60 ns means Put back in DRAM opening chip Get photos/GIFs of boards and chips of Sun Server? History of sizes Excel, Word? Ask Gray? Ask Sites on quote? EECS, University of California Berkeley, CA

Perspective on Post-PC Era
PostPC Era will be driven by 2 technologies: 1) Mobile Consumer Devices e.g., successor to PDA, cell phone, wearable computers 2) Infrastructure to Support such Devices e.g., successor to Big Fat Web Servers, Database Servers

A Better Media for Mobile Multimedia MPUs: Logic+DRAM
Crash of DRAM market inspires new use of wafers Faster logic in DRAM process DRAM vendors offer faster transistors + same number metal layers as good logic ≈ 20% higher cost per wafer? Called Intelligent RAM (“IRAM”) since most of transistors will be DRAM Lessons for last 20 years Large memory Uniform memory access

IRAM Vision Statement Microprocessor & DRAM on a single chip:
f a b Microprocessor & DRAM on a single chip: on-chip memory latency 5-10X, bandwidth X improve energy efficiency 2X-4X (no off-chip bus) serial I/O 5-10X v. buses smaller board area/volume adjustable memory size/width $ $ L2$ I/O I/O Bus Bus $B for separate lines for logic and memory Single chip: either processor in DRAM or memory in logic fab D R A M I/O I/O Proc D R A M f a b Bus D R A M

Potential Multimedia Architecture
“New” model: VSIW=Very Short Instruction Word! Compact: Describe N operations with 1 short instruct. Predictable (real-time) performance vs. statistical performance (cache) Multimedia ready: choose N*64b, 2N*32b, 4N*16b Easy to get high performance Compiler technology already developed, for sale! Don’t have to write all programs in assembly language Why MPP? Best potential performance! Few successes Operator on vectors of registers Its easier to vectorize than parallelize Scales well: more hardware and slower clock rate Crazy research

Revive Vector (= VSIW) Architecture!
Cost: ≈ $1M each? Low latency, high BW memory system? Code density? Compilers? Performance? Power/Energy? Limited to scientific applications? Single-chip CMOS MPU/IRAM IRAM Much smaller than VLIW For sale, mature (>20 years) Easy scale speed with technology Parallel to save energy, keep perf Multimedia apps vectorizable too: N*64b, 2N*32b, 4N*16b Supercomputer industry dead? Very attractive to scale New class of applications Before had a lousy scalar processor; modest CPU will do well on many programs, vector do great on others

V-IRAM1: 0. 18 µm, Fast Logic, 200 MHz 1. 6 GFLOPS(64b)/6
V-IRAM1: 0.18 µm, Fast Logic, 200 MHz 1.6 GFLOPS(64b)/6.4 GOPS(16b)/16MB + 4 x 64 or 8 x 32 16 x 16 x 2-way Superscalar Vector Instruction Processor ÷ Queue I/O Load/Store I/O 1Gbit technology Put in perspective 10X of Cray T90 today 16K I cache Vector Registers 16K D cache 4 x 64 4 x 64 Serial I/O Memory Crossbar Switch M M M M M M M M M M M M M M M M M M … M M I/O 4 x 64 4 x 64 4 x 64 4 x 64 … … … … … … … … 4 x 64 … … I/O M M M M M M M M M M

Tentative VIRAM-1 Floorplan
0.18 µm DRAM MB in 16 banks x 256b 0.18 µm, 5 Metal Logic ≈ 200 MHz MIPS IV, K I$, 16K D$ ≈ MHz FP/int. vector units die: ≈ 20x20 mm xtors: ≈ M power: ≈2 Watts Memory (128 Mbits / 16 MBytes) 4 Vector Pipes/Lanes C P U +$ Ring- based Switch Floor plan showing memory in purple Crossbar in blue (need to match vector unit, not maximum memory system) vector units in pink CPU in orange I/O in yellow How to spend 1B transistors vs. all CPU! VFU size based on looking at 3 MPUs in 0.25 micron technology; MIPS mm2 for 1FPU (Mul,Add, misc) IBM Power3 48 mm2 for 2 FPUs (2 mul/add units) HAL SPARC III 40 mm2 for 2 FPUs (2 multiple, add units) I/O Memory (128 Mbits / 16 MBytes)

VIRAM-1 Simulated Performance
Kernel GOPS % Peak Cycles/pixel (small=fast) 16b VIRAM MMX TMS‘C82 Compositing % 16b iDCT % 32b Color Conversion % 32b Convolution % 32b FP Matrix Multiply %

Tentative VIRAM-”0.25” Floorplan
Kernel GOPS V-1 V-0.25 Comp iDCT Clr.Conv Convol FP Matrix Demonstrate scalability via 2nd layout (automatic from 1st) 8 MB in 2 banks x 256b, 32 subbanks ≈ 200 MHz CPU, 8K I$, 8K D$ 1 ≈ 200 MHz FP/int. vector units die: ≈ 5 x 20 mm xtors: ≈ 70M power: ≈0.5 Watts Memory (32 Mb / 4 MB) C P U +$ 1 VU Floor plan showing memory in purple Crossbar in blue (need to match vector unit, not maximum memory system) vector units in pink CPU in orange I/O in yellow How to spend 1B transistors vs. all CPU! VFU size based on looking at 3 MPUs in 0.25 micron technology; MIPS mm2 for 1FPU (Mul,Add, misc) IBM Power3 48 mm2 for 2 FPUs (2 mul/add units) HAL SPARC III 40 mm2 for 2 FPUs (2 multiple, add units) Memory (32 Mb / 4 MB)

V-IRAM-1 Tentative Plan
Phase I: Feasibility stage (≈H2’98) Test chip, CAD agreement, architecture defined Phase 2: Design & Layout Stage (≈’99) Test chip, Simulated design and layout Phase 3: Verification (≈1Q’00) Tape-out Q2’00 Phase 4: Fabrication,Testing, and Demonstration (≈3Q’00) Functional integrated circuit 100M transistor microprocessor before Intel?

Bits of Arithmetic Unit
IRAM not a new idea 1000 IRAMUNI? IRAMMPP? Stone, ‘70 “Logic-in memory” Barron, ‘78 “Transputer” Dally, ‘90 “J-machine” Patterson, ‘90 panel session Kogge, ‘94 “Execube” PPRAM 100 Mitsubishi M32R/D PIP-RAM Mbits of Memory Computational RAM Scale no. proc. with memory capacity  on-chip MPP  difficult SW problem, especially with limited memory/proc Scale memory capacity with processor speed  uniprocessor  easier SW problem, especially with more memory/proc 10 Pentium Pro Execube SIMD on chip (DRAM) Uniprocessor (SRAM) MIMD on chip (DRAM) Uniprocessor (DRAM) MIMD component (SRAM ) 1 Alpha 21164 Transputer T9 0.1 Terasys 10 100 1000 10000

IRAM Chip Challenges Merged Logic-DRAM process: Cost of wafer, Impact on yield, testing cost of logic and DRAM Price of on-chip DRAM vs. separate DRAM chips? Time delay of transistor speeds, memory cell sizes in Merged process vs. Logic only or DRAM only DRAM block: flexibility via DRAM “compiler” (very size, width, no. subbanks) vs. fixed block; synchronous interface available? Applications: advantages in memory bandwidth, energy, system size to offset above challenges? Or Speed, Area, power, yield of DRAM in logic process Can slowdown in performance of portion and still be attractive Testing time much worse, or better due to BIST? DRAM operate at 1 watt: every 10 degrees increase in operative temperature doubles refresh rate; what to do? IRAM: acts as MP, acts as Cache to real memory, acts as low part of physical address space + OS?

Sony Playstation 2000 Emotion Engine: 6.2 GFLOPS, 75 million polygons per second (Microprocessor Report, 13:5) Superscalar MIPS core + vector coprocessor + graphics/DRAM Claim: Toy Story realism brought to games!

Infrastructure for Next Generation
Servers today based on desktop MPUs: Central Processsor Units + Peripheral Disks What would servers look like if based on mobile, multimedia microprocessors? Include processor, network interface inside disk ISTORE: a HW/software architecture for building scaleable, self-maintaining storage An introspective system: processor/disk  it monitors itself and acts on its observations No administrators to configure, monitor, tune

Intelligent Chassis: scaleable, redundant, fast network + UPS
ISTORE-I Hardware ISTORE uses “intelligent” hardware Device CPU, memory, NI Intelligent Chassis: scaleable, redundant, fast network + UPS Intelligent Disk “Brick”: a disk, plus a fast embedded CPU, memory, and redundant network interfaces

IRAM Conclusion IRAM potential in mem/IO BW, energy, board area; challenges in power/performance, testing, yield 10X-100X improvements based on technology shipping for 20 years (not JJ, photons, MEMS, ...) Suppose IRAM is successful Revolution in computer implementation Potential Impact #1: turn server industry inside-out? Potential #2: shift semiconductor balance of power? Who ships the most memory? Most microprocessors? Captain of industry challenge is taking advantage of new technology once see quantification Balance of power: MPer companies shipping most of DRAM, or DRAM companies shipping most of MPers Not talking about exotic technology, based on photons or neurons, based on opening up technology shipped in 20 years

Acknowledgments Looking for ideas of VIRAM enabled apps
Contact us if you’re interested: Thanks for advice/support: DARPA, California MICRO, Hitachi, IBM, Intel, LG Semicon, Microsoft, Neomagic, Sandcraft, SGI/Cray, Sun Microsystems, TI, TSMC

(The following slides are used to help answer questions)
Backup Slides (The following slides are used to help answer questions)

Commercial IRAM highway is governed by memory per IRAM?
Laptop 32 MB Network Computer Super PDA/Phone 8 MB Limited by DRAM on chip: DRAM/chip increases faster than application memory demand, so I expect new applications to become popular as memory per chip increases 1MB to 4MB to 16MBto 64 MB (1Gbit = 128 MB) Video Games Graphics Acc. 2 MB

Near-term IRAM Applications
“Intelligent” Set-top 2.6M Nintendo 64 (≈ $150) sold in 1st year 4-chip Nintendo 1-chip: 3D graphics, sound, fun! “Intelligent” Personal Digital Assistant 0.6M PalmPilots (≈ $300) sold in 1st 6 months Handwriting + learn new alphabet ( = K, = T, = 4) v. Speech input A supercomputer you could lose? Honey, I can’t find my supercomputer; have you seen it? Look at the speed of processor and amount of I/O: seems that can have a balanced system using GHz serial I/O Point 2: DRAM vs. Disk: now 104 faster latency and bandwidth

Words to Remember “...a strategic inflection point is a time in the life of a business when its fundamentals are about to change. ... Let's not mince words: A strategic inflection point can be deadly when unattended to. Companies that begin a decline as a result of its changes rarely recover their previous greatness.” Only the Paranoid Survive, Andrew S. Grove, 1996

2006 ISTORE IBM MicroDrive ISTORE node
1.7” x 1.4” x 0.2” 1999: 340 MB, 5400 RPM, 5 MB/s, 15 ms seek 2006: 9 GB, 50 MB/s? ISTORE node MicroDrive + IRAM Crossbar switches growing by Moore’s Law 16 x 16 in 1999  64 x 64 in 2005 ISTORE rack (19” x 33” x 84”) 1 tray (3” high)  16 x 32  512 ISTORE nodes 20 trays+switches+UPS  10,240 ISTORE nodes(!)

IRAM: A Microprocessor for the Post-PC Era

Similar presentations

Presentation on theme: "IRAM: A Microprocessor for the Post-PC Era"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IRAM: A Microprocessor for the Post-PC Era

Similar presentations

Presentation on theme: "IRAM: A Microprocessor for the Post-PC Era"— Presentation transcript:

Similar presentations

About project

Feedback