Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 The Future of Microprocessors Embedded in Memory David A. Patterson EECS, University.

Similar presentations


Presentation on theme: "1 The Future of Microprocessors Embedded in Memory David A. Patterson EECS, University."— Presentation transcript:

1 1 The Future of Microprocessors Embedded in Memory David A. Patterson patterson@cs.berkeley.edu http://cs.berkeley.edu/~patterson/talks EECS, University of California Berkeley, CA 94720-1776

2 2 Outline Desktop/Server Microprocessor State of the Art A New Technology: Embedded DRAM Mobile Multimedia Computing as New Direction A New Architecture for Mobile Multimedia Computing Berkeley’s Mobile Multimedia Microprocessor Historical Perspective Challenges & Potential Industrial Impact

3 3 Processor-DRAM Gap (latency) µProc 60%/yr. DRAM 7%/yr. 1 10 100 1000 19801981198319841985198619871988198919901991199219931994199519961997199819992000 DRAM CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance Time “Moore’s Law”

4 4 Processor-Memory Performance Gap “Tax” Processor % Area %Transistors (≈cost)(≈power) Alpha 2116437%77% StrongArm SA11061%94% Pentium Pro64%88% –2 dies per package: Proc/I$/D$ + L2$ Caches have no inherent value, only try to close performance gap

5 5 Today’s Situation: Microprocessor MIPS MPUs R5000R1000010k/5k Clock Rate200 MHz 195 MHz1.0x On-Chip Caches32K/32K 32K/32K 1.0x Instructions/Cycle 1(+ FP)4 4.0x Pipe stages55-71.2x ModelIn-orderOut-of-order--- Die Size (mm 2 ) 84 2983.5x –without cache, TLB32205 6.3x Development (man yr.)603005.0x SPECint_base955.78.81.6x

6 6 Desktop/Server State of the Art Processor performance doubling / 18 months Microprocessor-DRAM performance gap & tax Cost fixed at ≈$500/chip, power whatever can cool –10X cost, 10X power => 2X performance? Desktop apps slow at rate processors speedup? –Word struggles to keep up with typing? Consolidation of industry? PowerPC PA-RISC MIPS Alpha IA-64 SPARC

7 7 Outline Desktop/Server Microprocessor State of the Art A New Technology: Embedded DRAM Mobile Multimedia Computing as New Direction A New Architecture for Mobile Multimedia Computing Berkeley’s Mobile Multimedia Microprocessor Historical Perspective Challenges & Potential Industrial Impact

8 8 A More Revolutionary Approach: Processor Embedded in DRAM Faster logic in DRAM process –DRAM vendors offer faster transistors + same number metal layers as good logic process? @ ≈ 20% higher cost per wafer? –As die cost ≈ f(die area 4 )  4% die shrink  equal cost Called Intelligent RAM (“IRAM”) since most of transistors will be DRAM

9 9 IRAM Vision Statement Microprocessor & DRAM on a single chip: –on-chip memory latency 5-10X, bandwidth 50-100X –improve energy efficiency 2X-4X (no off-chip bus) –serial I/O 5-10X v. buses –smaller board area/volume –adjustable memory size/width DRAMDRAM fabfab Proc Bus DRAM $$ Proc L2$ LogicLogic fabfab Bus DRAM I/O Bus

10 10 Outline Desktop/Server Microprocessor State of the Art A New Technology: Embedded DRAM Mobile Multimedia Computing as New Direction A New Architecture for Mobile Multimedia Computing Berkeley’s Mobile Multimedia Microprocessor Historical Perspective Challenges & Potential Industrial Impact

11 11 Intelligent PDA ( 2003?) Pilot PDA (todo,calendar, calculator, addresses,...) + Gameboy (Tetris,...) + Nikon Coolpix (camera) + Cell Phone, Pager, GPS, tape recorder, TV remote, am/fm radio, garage door opener,... + Wireless data (WWW) + Speech, vision recog. – Speech control of all devices – Vision to see surroundings, scan documents, read bar codes, measure room + Voice output for conversations

12 12 New Architecture Directions “…media processing will become the dominant force in computer arch. & microprocessor design.” “... new media-rich applications... involve significant real-time processing of continuous media streams, and make heavy use of vectors of packed 8-, 16-, and 32-bit integer and Fl. Pt.” Needs include high memory BW, high network BW, continuous media data types, real-time response, fine grain parallelism –“How Multimedia Workloads Will Change Processor Design”, Diefendorff & Dubey, IEEE Computer (9/97)

13 13 Outline Desktop/Server Microprocessor State of the Art A New Technology: Embedded DRAM Mobile Multimedia Computing as New Direction A New Architecture for Mobile Multimedia Computing Berkeley’s Mobile Multimedia Microprocessor Historical Perspective Challenges & Potential Industrial Impact

14 14 Potential IRAM Architecture “New” model: VSIW=Very Short Instruction Word! –Compact: Describe N operations with 1 short instruct. –Predictable (real-time) perf. vs. statistical perf. (cache) –Multimedia ready: choose N*64b, 2N*32b, 4N*16b –Easy to get high performance; N operations: »are independent »use same functional unit »access disjoint registers »access registers in same order as previous instructions »access contiguous memory words or known pattern »hides memory latency (and any other latency) –Compiler technology already developed, for sale!

15 15 Operation & Instruction Count: RISC v. “VSIW” Processor (from F. Quintana, U. Barcelona.) Spec92fp Operations (M) Instructions (M) Program RISC VSIW R / V RISC VSIW R / V swim25611595 1.1x1150.8142x hydro2d58401.4x 580.8 71x nasa769411.7x 692.2 31x su2cor51351.4x 511.8 29x tomcatv15101.4x 151.3 11x wave527251.1x 277.2 4x mdljdp232520.6x 3215.8 2x VSIW reduces ops by 1.2X, instructions by 20X!

16 16 Revive Vector (= VSIW) Architecture! Cost: ≈ $1M each? Low latency, high BW memory system? Code density? Compilers? Vector Performance? Power/Energy? Scalar performance? Real-time? Limited to scientific applications? Single-chip CMOS MPU/IRAM IRAM = low latency, high bandwidth memory Much smaller than VLIW/EPIC For sale, mature (>20 years) Easy scale speed with technology Parallel to save energy, keep perf Include modern, modest CPU  OK scalar (MIPS 5K v. 10k) No caches, no speculation  repeatable speed as vary input Multimedia apps vectorizable too: N*64b, 2N*32b, 4N*16b

17 17 Software Technology Trends Affecting New Direction? any CPU + vector coprocessor/memory –scalar/vector interactions are limited, simple –Example architecture based on ARM 9, MIPS IV Vectorizing compilers built for 25 years –can buy one for new machine from The Portland Group Can change HW without changing programs –More arithmetic units, registers OK Lowers software effort –Media Processors/DSP, > 50% staff are programmers

18 18 Software Technology Trends Affecting New Direction? IRAM has limited memory => Desktop OS wrong Real-Time Operating Systems –“Microkernel” - exclude modules not used by application –Small footprint even with all features ( 0.5 to 2 MB) –Examples: WRS VxWorks, QNX, Windows CE Applications not distributed in binary –WWW software distribution (Java, Javascript, TCL) –Open Source movement (Linix, Netscape, PERL)

19 19 Outline Desktop/Server Microprocessor State of the Art A New Technology: Embedded DRAM Mobile Multimedia Computing as New Direction A New Architecture for Mobile Multimedia Computing Berkeley’s Mobile Multimedia Microprocessor Historical Perspective Challenges & Potential Industrial Impact

20 20 V-IRAM1: 0.18 µm, Fast Logic, 200 MHz 1.6 GFLOPS(64b)/6.4 GOPS(16b)/32MB Memory Crossbar Switch M M … M M M … M M M … M M M … M M M … M M M … M … M M … M M M … M M M … M M M … M + Vector Registers x ÷ Load/Store 8K I cache 8K D cache 2-way Superscalar Vector Processor 4 x 64 or 8 x 32 or 16 x 16 4 x 64 Queue Instruction I/O Serial I/O

21 21 Ring- based Switch C P U +$ Tentative VIRAM-1 Floorplan I/O 0.18 µm DRAM 32 MB in 16 banks x 256b, 128 subbanks 0.18 µm, 5 Metal Logic ≈ 200 MHz MIPS, 16K I$, 16K D$ ≈ 4 200 MHz FP/int. vector units die: ≈ 16x16 mm xtors: ≈ 270M power: ≈2 Watts 4 Vector Pipes/Lanes Memory (128 Mbits / 16 MBytes)

22 22 C P U +$ Tentative VIRAM-”0.25” Floorplan Demonstrate scalability via 2nd layout (automatic from 1st) 8 MB in 4 banks x 256b, 32 subbanks ≈ 200 MHz CPU, 16K I$, 16K D$ 1 ≈ 200 MHz FP/int. vector units die: ≈ 5 x 16 mm xtors: ≈ 70M power: ≈0.5 Watts 1 VU Memory (32 Mb / 4 MB) Memory (32 Mb / 4 MB)

23 23 V-IRAM-1 Tentative Plan Phase I: Feasibility stage (≈H1’98) –Test chip, CAD agreement, architecture defined Phase 2: Design & Layout Stage (≈H2’98) –Test chip, Simulated design and layout Phase 3: Verification (≈H2’99) –Tape-out Phase 4: Fabrication,Testing, and Demonstration (≈H1’00) –Functional integrated circuit First microprocessor ≥ 0.15-.25B transistors?

24 24 Outline Desktop/Server Microprocessor State of the Art A New Technology: Embedded DRAM Mobile Multimedia Computing as New Direction A New Architecture for Mobile Multimedia Computing Berkeley’s Mobile Multimedia Microprocessor Historical Perspective Challenges & Potential Industrial Impact

25 25 SIMD on chip (DRAM) Uniprocessor (SRAM) MIMD on chip (DRAM) Uniprocessor (DRAM) MIMD component (SRAM ) 10 100 1000 10000 0.1 1 10 100 Mbits of Memory Computational RAM PIP-RAM Mitsubishi M32R/D Execube Pentium Pro Alpha 21164 Transputer T9 1000 IRAM UNI ? PPRAM Bits of Arithmetic Unit Terasys IRAM not a new idea Stone, ‘70 “Logic-in memory” Barron, ‘78 “Transputer” Kogge, ‘94 “Execube” Shimizu, ‘96 “M32R/D” Murakami, ‘97 “PPRAM”

26 26 Why IRAM now? Lower risk than before DRAM manufacturers now willing to listen –Before not interested, so early IRAM = SRAM Faster Logic + DRAM available soon? Past efforts memory limited  multiple chips  1st solve the unsolved (parallel processing) –Gigabit DRAM  ≈100 MB; OK for many apps? Systems headed to 2 chips: CPU + memory Embedded apps leverage energy efficiency, adjustable mem. capacity, smaller board area  “Post-PC Era”: PDA >> desktop

27 27 IRAM Challenges Chip –Good performance and reasonable power? –Cost for Embedded DRAM process? »Cost of 16 Mbit DRAM $1.78 in June 1998: How compete? –Speed in Embedded DRAM? (time behind logic fab?) –Testing time of IRAM vs DRAM vs microprocessor? –BW/Latency oriented DRAM tradeoffs? Architecture –How to turn high memory bandwidth into performance for real applications? –Extensible IRAM: Large program/data solution?

28 28 Commercial IRAM highway is governed by memory per IRAM? Graphics Acc. Super PDA/Phone Video Games Network Computer Laptop 8 MB2 MB32 MB

29 29 Words to Remember “...a strategic inflection point is a time in the life of a business when its fundamentals are about to change.... Let's not mince words: A strategic inflection point can be deadly when unattended to. Companies that begin a decline as a result of its changes rarely recover their previous greatness.” –Only the Paranoid Survive, Andrew S. Grove, 1996

30 30 Apps/metrics of future to design computer of future Mobile Multimedia >> “Stationary” Computers? IRAM potential in mem/IO BW, energy, board area; challenges in cost, power/performance, testing 10X-100X improvements based on technology shipping for 20 years (not JJ, photons, MEMS,...) Potential: shift semiconductor balance of power? Who ships the most memory? Most microprocessors? Conclusion

31 31 Interested in Participating? Looking for ideas of IRAM enabled apps, partners to transfer technology Contact us if you’re interested: http://iram.cs.berkeley.edu/ email: patterson@cs.berkeley.edu Thanks for advice/support: DARPA, California MICRO, Hitachi, Intel, LG Semicon, Microsoft, Neomagic, Sandcraft, SGI/Cray, Sun Microsystems, TI, TSMC

32 32 Backup Slides (The following slides are used to help answer questions)

33 33 VIRAM-1 Specs/Goals Technology0.18-0.20 micron, 5-6 metal layers, fast xtor Memory16-32 MB Die size≈ 200-300 mm 2 Vector pipes/lanes4 64-bit (or 8 32-bit or 16 16-bit) Serial I/O4 lines @ 1 Gbit/s Power university ≈2 w @ 1-1.5 volt logic Clock university 200scalar/200vector MHz Perf university 1.6 GFLOPS 64 – 6.4 GOPS 16 Power industry ≈1 w @ 1-1.5 volt logic Clock industry 400scalar/400vector MHz Perf industry 3.2 GFLOPS 64 – 12.8 GOPS 16 2X

34 34 Mediaprocesing Functions (Dubey) KernelVector length Matrix transpose/multiply# vertices at once DCT (video, comm.)image width FFT (audio)256-1024 Motion estimation (video)image width, i.w./16 Gamma correction (video)image width Haar transform (media mining)image width Median filter (image process.)image width Separable convolution (““)image width (from http://www.research.ibm.com/people/p/pradeep/tutor.html)

35 35 A More Revolutionary Approach: New Architecture Directions “...wires are not keeping pace with scaling of other features. … In fact, for CMOS processes below 0.25 micron... an unacceptably small percentage of the die will be reachable during a single clock cycle.” “Architectures that require long-distance, rapid interaction will not scale well...” –“Will Physical Scalability Sabotage Performance Gains?” Matzke, IEEE Computer (9/97)

36 36 “Vanilla” IRAM - Performance Conclusions IRAM systems with existing architectures provide moderate performance benefits High bandwidth / low latency used to speed up memory accesses, not computation Reason: existing architectures developed under assumption of low bandwidth memory system –Need something better than “build a bigger cache” –Important to investigate alternative architectures that better utilize high bandwidth and low latency of IRAM


Download ppt "1 The Future of Microprocessors Embedded in Memory David A. Patterson EECS, University."

Similar presentations


Ads by Google