1 The Future of Microprocessors Embedded in Memory David A. Patterson EECS, University.

Slides:



Advertisements
Similar presentations
The University of Adelaide, School of Computer Science
Advertisements

System Design Tricks for Low-Power Video Processing Jonah Probell, Director of Multimedia Solutions, ARC International.
Computer Architecture & Organization
Lecture 2: Modern Trends 1. 2 Microprocessor Performance Only 7% improvement in memory performance every year! 50% improvement in microprocessor performance.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.
Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.
Chapter 1. Introduction This course is all about how computers work But what do we mean by a computer? –Different types: desktop, servers, embedded devices.
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.
Retrospective on the VIRAM-1 Design Decisions Christoforos E. Kozyrakis IRAM Retreat January 9, 2001.
CIS 314 : Computer Organization Lecture 1 – Introduction.
1 IRAM: A Microprocessor for the Post-PC Era David A. Patterson EECS, University of.
6/30/2015HY220: Ιάκωβος Μαυροειδής1 Moore’s Law Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
CalStan 3/2011 VIRAM-1 Floorplan – Tapeout June 01 Microprocessor –256-bit media processor –12-14 MBytes DRAM – Gops –2W at MHz –Industrial.
1 New Directions in Computer Architecture David A. Patterson EECS, University of California.
1 Chapter 4 The Central Processing Unit and Memory.
ASIC Design Introduction - 1 The history of Integrated Circuit (IC) The base for such a significant progress –Well understanding of semiconductor physics.
Computer performance.
NEW TRENDS IN COMPUTER ARCHITECTURE DESIGN Saeid Nooshabadi Arthur Sale University of Tasmania.
UC Berkeley 1 The Datacenter is the Computer David Patterson Director, RAD Lab January, 2007.
Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.
Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.
1 Homework Reading –None (Finish all previous reading assignments) Machine Projects –Continue with MP5 Labs –Finish lab reports by deadline posted in lab.
Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu
Introduction CSE 410, Spring 2008 Computer Systems
Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
1 Recap (from Previous Lecture). 2 Computer Architecture Computer Architecture involves 3 inter- related components – Instruction set architecture (ISA):
1 Computer System Organization I/O systemProcessor Compiler Operating System (Windows 98) Application (Netscape) Digital Design Circuit Design Instruction.
Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 7: February 6, 2012 Memories.
3/15/2002CSE Final Remarks Concluding Remarks SOAP.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
Computer Organization and Design Computer Abstractions and Technology
CPE 731 Advanced Computer Architecture Technology Trends Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,
CS/EE 5810 CS/EE 6810 F00: 1 Multimedia. CS/EE 5810 CS/EE 6810 F00: 2 New Architecture Direction “… media processing will become the dominant force in.
Sean Mathews, Christopher Kiser, Haoxiang Chen. Processor Design Tradeoffs: Instruction Set Design Support useful functions while implementing as efficiently.
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
Succeeding with Technology Chapter 2 Hardware Designed to Meet the Need The Digital Revolution Integrated Circuits and Processing Storage Input, Output,
CPU Transforms Input and Output Each computer contains one Collection of electronic circuits Processor Interpretates and execute instructions in a program.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 8: February 19, 2014 Memories.
Lecture # 10 Processors Microcomputer Processors.
CS203 – Advanced Computer Architecture Performance Evaluation.
Introduction CSE 410, Spring 2005 Computer Systems
1 IRAM Vision Microprocessor & DRAM on a single chip: –on-chip memory latency 5-10X, bandwidth X –improve energy efficiency 2X-4X (no off-chip bus)
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
William Stallings Computer Organization and Architecture 6th Edition
CS203 – Advanced Computer Architecture
Hardware Technology Trends and Database Opportunities
Visit for more Learning Resources
Chapter1 Fundamental of Computer Design
Architecture & Organization 1
Morgan Kaufmann Publishers
Vector Processing => Multimedia
CS775: Computer Architecture
IRAM: A Microprocessor for the Post-PC Era
Architecture & Organization 1
Performance of computer systems
A High Performance SoC: PkunityTM
Computer Evolution and Performance
IRAM: A Microprocessor for the Post-PC Era
COMS 361 Computer Organization
Chapter 4 Multiprocessors
Performance of computer systems
IRAM: A Microprocessor for the Post-PC Era
IRAM Vision Microprocessor & DRAM on a single chip:
Presentation transcript:

1 The Future of Microprocessors Embedded in Memory David A. Patterson EECS, University of California Berkeley, CA

2 Outline Desktop/Server Microprocessor State of the Art A New Technology: Embedded DRAM Mobile Multimedia Computing as New Direction A New Architecture for Mobile Multimedia Computing Berkeley’s Mobile Multimedia Microprocessor Historical Perspective Challenges & Potential Industrial Impact

3 Processor-DRAM Gap (latency) µProc 60%/yr. DRAM 7%/yr DRAM CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance Time “Moore’s Law”

4 Processor-Memory Performance Gap “Tax” Processor % Area %Transistors (≈cost)(≈power) Alpha %77% StrongArm SA11061%94% Pentium Pro64%88% –2 dies per package: Proc/I$/D$ + L2$ Caches have no inherent value, only try to close performance gap

5 Today’s Situation: Microprocessor MIPS MPUs R5000R k/5k Clock Rate200 MHz 195 MHz1.0x On-Chip Caches32K/32K 32K/32K 1.0x Instructions/Cycle 1(+ FP)4 4.0x Pipe stages x ModelIn-orderOut-of-order--- Die Size (mm 2 ) x –without cache, TLB x Development (man yr.) x SPECint_base x

6 Desktop/Server State of the Art Processor performance doubling / 18 months Microprocessor-DRAM performance gap & tax Cost fixed at ≈$500/chip, power whatever can cool –10X cost, 10X power => 2X performance? Desktop apps slow at rate processors speedup? –Word struggles to keep up with typing? Consolidation of industry? PowerPC PA-RISC MIPS Alpha IA-64 SPARC

7 Outline Desktop/Server Microprocessor State of the Art A New Technology: Embedded DRAM Mobile Multimedia Computing as New Direction A New Architecture for Mobile Multimedia Computing Berkeley’s Mobile Multimedia Microprocessor Historical Perspective Challenges & Potential Industrial Impact

8 A More Revolutionary Approach: Processor Embedded in DRAM Faster logic in DRAM process –DRAM vendors offer faster transistors + same number metal layers as good logic ≈ 20% higher cost per wafer? –As die cost ≈ f(die area 4 )  4% die shrink  equal cost Called Intelligent RAM (“IRAM”) since most of transistors will be DRAM

9 IRAM Vision Statement Microprocessor & DRAM on a single chip: –on-chip memory latency 5-10X, bandwidth X –improve energy efficiency 2X-4X (no off-chip bus) –serial I/O 5-10X v. buses –smaller board area/volume –adjustable memory size/width DRAMDRAM fabfab Proc Bus DRAM $$ Proc L2$ LogicLogic fabfab Bus DRAM I/O Bus

10 Outline Desktop/Server Microprocessor State of the Art A New Technology: Embedded DRAM Mobile Multimedia Computing as New Direction A New Architecture for Mobile Multimedia Computing Berkeley’s Mobile Multimedia Microprocessor Historical Perspective Challenges & Potential Industrial Impact

11 Intelligent PDA ( 2003?) Pilot PDA (todo,calendar, calculator, addresses,...) + Gameboy (Tetris,...) + Nikon Coolpix (camera) + Cell Phone, Pager, GPS, tape recorder, TV remote, am/fm radio, garage door opener,... + Wireless data (WWW) + Speech, vision recog. – Speech control of all devices – Vision to see surroundings, scan documents, read bar codes, measure room + Voice output for conversations

12 New Architecture Directions “…media processing will become the dominant force in computer arch. & microprocessor design.” “... new media-rich applications... involve significant real-time processing of continuous media streams, and make heavy use of vectors of packed 8-, 16-, and 32-bit integer and Fl. Pt.” Needs include high memory BW, high network BW, continuous media data types, real-time response, fine grain parallelism –“How Multimedia Workloads Will Change Processor Design”, Diefendorff & Dubey, IEEE Computer (9/97)

13 Outline Desktop/Server Microprocessor State of the Art A New Technology: Embedded DRAM Mobile Multimedia Computing as New Direction A New Architecture for Mobile Multimedia Computing Berkeley’s Mobile Multimedia Microprocessor Historical Perspective Challenges & Potential Industrial Impact

14 Potential IRAM Architecture “New” model: VSIW=Very Short Instruction Word! –Compact: Describe N operations with 1 short instruct. –Predictable (real-time) perf. vs. statistical perf. (cache) –Multimedia ready: choose N*64b, 2N*32b, 4N*16b –Easy to get high performance; N operations: »are independent »use same functional unit »access disjoint registers »access registers in same order as previous instructions »access contiguous memory words or known pattern »hides memory latency (and any other latency) –Compiler technology already developed, for sale!

15 Operation & Instruction Count: RISC v. “VSIW” Processor (from F. Quintana, U. Barcelona.) Spec92fp Operations (M) Instructions (M) Program RISC VSIW R / V RISC VSIW R / V swim x x hydro2d x x nasa x x su2cor x x tomcatv x x wave x x mdljdp x x VSIW reduces ops by 1.2X, instructions by 20X!

16 Revive Vector (= VSIW) Architecture! Cost: ≈ $1M each? Low latency, high BW memory system? Code density? Compilers? Vector Performance? Power/Energy? Scalar performance? Real-time? Limited to scientific applications? Single-chip CMOS MPU/IRAM IRAM = low latency, high bandwidth memory Much smaller than VLIW/EPIC For sale, mature (>20 years) Easy scale speed with technology Parallel to save energy, keep perf Include modern, modest CPU  OK scalar (MIPS 5K v. 10k) No caches, no speculation  repeatable speed as vary input Multimedia apps vectorizable too: N*64b, 2N*32b, 4N*16b

17 Software Technology Trends Affecting New Direction? any CPU + vector coprocessor/memory –scalar/vector interactions are limited, simple –Example architecture based on ARM 9, MIPS IV Vectorizing compilers built for 25 years –can buy one for new machine from The Portland Group Can change HW without changing programs –More arithmetic units, registers OK Lowers software effort –Media Processors/DSP, > 50% staff are programmers

18 Software Technology Trends Affecting New Direction? IRAM has limited memory => Desktop OS wrong Real-Time Operating Systems –“Microkernel” - exclude modules not used by application –Small footprint even with all features ( 0.5 to 2 MB) –Examples: WRS VxWorks, QNX, Windows CE Applications not distributed in binary –WWW software distribution (Java, Javascript, TCL) –Open Source movement (Linix, Netscape, PERL)

19 Outline Desktop/Server Microprocessor State of the Art A New Technology: Embedded DRAM Mobile Multimedia Computing as New Direction A New Architecture for Mobile Multimedia Computing Berkeley’s Mobile Multimedia Microprocessor Historical Perspective Challenges & Potential Industrial Impact

20 V-IRAM1: 0.18 µm, Fast Logic, 200 MHz 1.6 GFLOPS(64b)/6.4 GOPS(16b)/32MB Memory Crossbar Switch M M … M M M … M M M … M M M … M M M … M M M … M … M M … M M M … M M M … M M M … M + Vector Registers x ÷ Load/Store 8K I cache 8K D cache 2-way Superscalar Vector Processor 4 x 64 or 8 x 32 or 16 x 16 4 x 64 Queue Instruction I/O Serial I/O

21 Ring- based Switch C P U +$ Tentative VIRAM-1 Floorplan I/O 0.18 µm DRAM 32 MB in 16 banks x 256b, 128 subbanks 0.18 µm, 5 Metal Logic ≈ 200 MHz MIPS, 16K I$, 16K D$ ≈ MHz FP/int. vector units die: ≈ 16x16 mm xtors: ≈ 270M power: ≈2 Watts 4 Vector Pipes/Lanes Memory (128 Mbits / 16 MBytes)

22 C P U +$ Tentative VIRAM-”0.25” Floorplan Demonstrate scalability via 2nd layout (automatic from 1st) 8 MB in 4 banks x 256b, 32 subbanks ≈ 200 MHz CPU, 16K I$, 16K D$ 1 ≈ 200 MHz FP/int. vector units die: ≈ 5 x 16 mm xtors: ≈ 70M power: ≈0.5 Watts 1 VU Memory (32 Mb / 4 MB) Memory (32 Mb / 4 MB)

23 V-IRAM-1 Tentative Plan Phase I: Feasibility stage (≈H1’98) –Test chip, CAD agreement, architecture defined Phase 2: Design & Layout Stage (≈H2’98) –Test chip, Simulated design and layout Phase 3: Verification (≈H2’99) –Tape-out Phase 4: Fabrication,Testing, and Demonstration (≈H1’00) –Functional integrated circuit First microprocessor ≥ B transistors?

24 Outline Desktop/Server Microprocessor State of the Art A New Technology: Embedded DRAM Mobile Multimedia Computing as New Direction A New Architecture for Mobile Multimedia Computing Berkeley’s Mobile Multimedia Microprocessor Historical Perspective Challenges & Potential Industrial Impact

25 SIMD on chip (DRAM) Uniprocessor (SRAM) MIMD on chip (DRAM) Uniprocessor (DRAM) MIMD component (SRAM ) Mbits of Memory Computational RAM PIP-RAM Mitsubishi M32R/D Execube Pentium Pro Alpha Transputer T IRAM UNI ? PPRAM Bits of Arithmetic Unit Terasys IRAM not a new idea Stone, ‘70 “Logic-in memory” Barron, ‘78 “Transputer” Kogge, ‘94 “Execube” Shimizu, ‘96 “M32R/D” Murakami, ‘97 “PPRAM”

26 Why IRAM now? Lower risk than before DRAM manufacturers now willing to listen –Before not interested, so early IRAM = SRAM Faster Logic + DRAM available soon? Past efforts memory limited  multiple chips  1st solve the unsolved (parallel processing) –Gigabit DRAM  ≈100 MB; OK for many apps? Systems headed to 2 chips: CPU + memory Embedded apps leverage energy efficiency, adjustable mem. capacity, smaller board area  “Post-PC Era”: PDA >> desktop

27 IRAM Challenges Chip –Good performance and reasonable power? –Cost for Embedded DRAM process? »Cost of 16 Mbit DRAM $1.78 in June 1998: How compete? –Speed in Embedded DRAM? (time behind logic fab?) –Testing time of IRAM vs DRAM vs microprocessor? –BW/Latency oriented DRAM tradeoffs? Architecture –How to turn high memory bandwidth into performance for real applications? –Extensible IRAM: Large program/data solution?

28 Commercial IRAM highway is governed by memory per IRAM? Graphics Acc. Super PDA/Phone Video Games Network Computer Laptop 8 MB2 MB32 MB

29 Words to Remember “...a strategic inflection point is a time in the life of a business when its fundamentals are about to change.... Let's not mince words: A strategic inflection point can be deadly when unattended to. Companies that begin a decline as a result of its changes rarely recover their previous greatness.” –Only the Paranoid Survive, Andrew S. Grove, 1996

30 Apps/metrics of future to design computer of future Mobile Multimedia >> “Stationary” Computers? IRAM potential in mem/IO BW, energy, board area; challenges in cost, power/performance, testing 10X-100X improvements based on technology shipping for 20 years (not JJ, photons, MEMS,...) Potential: shift semiconductor balance of power? Who ships the most memory? Most microprocessors? Conclusion

31 Interested in Participating? Looking for ideas of IRAM enabled apps, partners to transfer technology Contact us if you’re interested: Thanks for advice/support: DARPA, California MICRO, Hitachi, Intel, LG Semicon, Microsoft, Neomagic, Sandcraft, SGI/Cray, Sun Microsystems, TI, TSMC

32 Backup Slides (The following slides are used to help answer questions)

33 VIRAM-1 Specs/Goals Technology micron, 5-6 metal layers, fast xtor Memory16-32 MB Die size≈ mm 2 Vector pipes/lanes4 64-bit (or 8 32-bit or bit) Serial I/O4 1 Gbit/s Power university ≈ volt logic Clock university 200scalar/200vector MHz Perf university 1.6 GFLOPS 64 – 6.4 GOPS 16 Power industry ≈ volt logic Clock industry 400scalar/400vector MHz Perf industry 3.2 GFLOPS 64 – 12.8 GOPS 16 2X

34 Mediaprocesing Functions (Dubey) KernelVector length Matrix transpose/multiply# vertices at once DCT (video, comm.)image width FFT (audio) Motion estimation (video)image width, i.w./16 Gamma correction (video)image width Haar transform (media mining)image width Median filter (image process.)image width Separable convolution (““)image width (from

35 A More Revolutionary Approach: New Architecture Directions “...wires are not keeping pace with scaling of other features. … In fact, for CMOS processes below 0.25 micron... an unacceptably small percentage of the die will be reachable during a single clock cycle.” “Architectures that require long-distance, rapid interaction will not scale well...” –“Will Physical Scalability Sabotage Performance Gains?” Matzke, IEEE Computer (9/97)

36 “Vanilla” IRAM - Performance Conclusions IRAM systems with existing architectures provide moderate performance benefits High bandwidth / low latency used to speed up memory accesses, not computation Reason: existing architectures developed under assumption of low bandwidth memory system –Need something better than “build a bigger cache” –Important to investigate alternative architectures that better utilize high bandwidth and low latency of IRAM