1 How to Hurt Scientific Productivity David A. Patterson Pardee Professor of Computer Science, U.C. Berkeley President, Association for Computing Machinery.

Slides:



Advertisements
Similar presentations
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Why GPUs? Robert Strzodka. 2Overview Computation / Bandwidth / Power CPU – GPU Comparison GPU Characteristics.
Chapter1 Fundamental of Computer Design Dr. Bernard Chen Ph.D. University of Central Arkansas.
1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.
1 Jan 07 RAMP PI Report: Plans until next Retreat & Beyond Krste Asanovíc (MIT), Derek Chiou (Texas), James Hoe(CMU), Christos Kozyrakis (Stanford), Shih-Lien.
1 RAMP White RAMP Retreat, BWRC, Berkeley, CA 20 January 2006 RAMP collaborators: Arvind (MIT), Krste Asanovíc (MIT), Derek Chiou (Texas), James Hoe (CMU),
1 Research Accelerator for MultiProcessing Dave Patterson, UC Berkeley January RAMP collaborators: Arvind (MIT), Krste Asanovíc (MIT), Derek Chiou.
Future of Computer Architecture
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Research Accelerator for Multiple Processors
1 A Community Vision for a Shared Experimental Parallel HW/SW Platform Dave Patterson, Pardee Professor of Comp. Science, UC Berkeley President, Association.
1 Introduction to Research Accelerator for Multiple Processors David Patterson (Berkeley, CO-PI), Arvind (MIT), Krste Asanovíc (Berkeley/MIT), Derek Chiou.
GSRC Annual Symposium Sep 29-30, 2008 Full-System Chip Multiprocessor Power Evaluations Using FPGA-Based Emulation Abhishek Bhattacharjee, Gilberto Contreras,
Parallel Processing Architectures Laxmi Narayan Bhuyan
PSU CS 106 Computing Fundamentals II Introduction HM 1/3/2009.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
1 RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007.
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.
Computer performance.
Chapter1 Fundamental of Computer Design Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
1 Layers of Computer Science, ISA and uArch Alexander Titov 20 September 2014.
Introduction CSE 410, Spring 2008 Computer Systems
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Topics Speedup Amdahl’s law Execution timeReadings January 10, 2012 CSCE 713 Computer Architecture.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
RAMPing Down Chuck Thacker Microsoft Research August 2010.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
EE3A1 Computer Hardware and Digital Design
FPGA-based Fast, Cycle-Accurate Full System Simulators Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John Xu and Nikhil Patil University of Texas at Austin.
1 Retreat (Advance) John Wawrzynek UC Berkeley January 15, 2009.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
3/12/07CS Visit Days1 A Sea Change in Processor Design Uniprocessor SpecInt Performance: From Hennessy and Patterson, Computer Architecture: A Quantitative.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
CS203 – Advanced Computer Architecture Performance Evaluation.
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
Computer Architecture & Operations I
William Stallings Computer Organization and Architecture 6th Edition
CS203 – Advanced Computer Architecture
Lynn Choi School of Electrical Engineering
Computer Architecture & Operations I
Chapter1 Fundamental of Computer Design
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Andrew Putnam University of Washington RAMP Retreat January 17, 2008
1 Future of Computer Architecture David A. Patterson Pardee Professor of Computer Science, U.C. Berkeley President, Association for Computing Machinery.
Derek Chiou The University of Texas at Austin
CS775: Computer Architecture
David Patterson Electrical Engineering and Computer Sciences
Lecture 14: Reducing Cache Misses
Parallel Processing Architectures
A High Performance SoC: PkunityTM
Chapter 1 Introduction.
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Utsunomiya University
Presentation transcript:

1 How to Hurt Scientific Productivity David A. Patterson Pardee Professor of Computer Science, U.C. Berkeley President, Association for Computing Machinery February, 2006

2 High Level Message Everything is changing; Old conventional wisdom is out We DESPERATELY need a new architectural solution for microprocessors based on parallelism  21 st Century target: systems that enhance scientific productivity Need to create a “watering hole” to bring everyone together to quickly find that solution  architects, language designers, application experts, numerical analysts, algorithm designers, programmers, …

3 Computer Architecture Hurt #1: Aim High (and Ignore Amdahl’s Law) Peak Performance Sells + Increases employment of computer scientists at companies trying to get larger fraction of peak Examples  Very deep pipeline / very high clock rate  Relaxed write consistency  Out-Of-Order message delivery

4 Computer Architecture Hurt #2: Promote Mystery (and Hide Thy Real Performance) Predictability suggests no sophistication + If its unsophisticated, how can it be expensive? Examples  Out-of-order execution processors  Memory/disk controllers with secret prefetch algorithms  N levels of on-chip caches, where N   (Year – 1975) / 10 

5 Computer Architecture Hurt #3: Be “Interesting” (and Have a Quirky Personality) Programmers enjoy a challenge + Job security since must rewrite application with each new generation Examples  Message-passing clusters composed of shared address multiprocessors  Pattern sensitive interconnection networks  Computing using Graphical Processor Units  TLBs exceptions if access all cache memory on chip

6 Computer Architecture Hurt #4: Accuracy & Reliability are for Wimps (Speed Kills Competition) Don’t waste resources on accuracy, reliability + Probably blame Microsoft anyways Examples  Cray et al 754 Floating Point Format, yet not compliant, so get different results from desktop  No ECC on Memory of Virginia Tech Apple G5 cluster  “Error Free” intercommunication networks make error checking in messages “unnecessary”  No ECC on L2 Cache of Sun UltraSPARC 2

7 Alternatives to Hurting Productivity Aim High (& Ignore Amdahl’s Law)?  No! Delivered productivity >> Peak performance Promote Mystery (& Hide Thy Real Performance)?  No! Promote a simple, understandable model of execution and performance Be “Interesting” (& Have a Quirky Personality)  No programming surprises! Accuracy & Reliability are for Wimps? (Speed Kills)  No! You’re not going fast if you’re headed in the wrong direction Computer designers neglected productivity in past  No excuse for 21st century computing to be based on untrustworthy, mysterious, I/O-starved, quirky HW where peak performance is king

8 Outline Part I: How to Hurt Scientific Productivity  via Computer Architecture Part II: A New Agenda for Computer Architecture  1 st Review Conventional Wisdom (New & Old) in Technology and Computer Architecture  21 st century kernels, New classifications of apps and architecture Part III: A “Watering Hole” for Parallel Systems Exploration  Research Accelerator for Multiple Processors

9 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can afford to turn on) Old: Multiplies are slow, Memory access is fast New: “Memory wall” Memory slow, multiplies fast (200 clocks to DRAM memory, 4 clocks for FP multiply) Old : Increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …) New CW: “ILP wall” diminishing returns on more ILP New: Power Wall + Memory Wall + ILP Wall = Brick Wall  Old CW: Uniprocessor performance 2X / 1.5 yrs  New CW: Uniprocessor performance only 2X / 5 yrs? Conventional Wisdom (CW) in Computer Architecture

10 Uniprocessor Performance (SPECint) VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 RISC + x86: ??%/year 2002 to present From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006,  Sea change in chip design: multiple “cores” or processors per chip from IBM, Sun, AMD, Intel today 3X

11 Old CW: Since cannot know future programs, find set of old programs to evaluate designs of computers for the future  E.g., SPEC2006 What about parallel codes?  Few available, tied to old models, languages, architectures, … New approach: Design computers of future for numerical methods important in future Claim: key methods for next decade are 7 dwarves (+ a few), so design for them!  Representative codes may vary over time, but these numerical methods will be important for > 10 years 21 st Century Computer Architecture

12 High-end simulation in the physical sciences = 7 numerical methods : 1. Structured Grids (including locally structured grids, e.g. Adaptive Mesh Refinement) 2. Unstructured Grids 3. Fast Fourier Transform 4. Dense Linear Algebra 5. Sparse Linear Algebra 6. Particles 7. Monte Carlo Well-defined targets from algorithmic, software, and architecture standpoint Phillip Colella’s “Seven dwarfs” If add 4 for embedded, covers all 41 EEMBC benchmarks 8. Search/Sort 9. Filter 10. Combinational logic 11. Finite State Machine Note: Data sizes (8 bit to 32 bit) and types (integer, character) differ, but algorithms the same Slide from “Defining Software Requirements for Scientific Computing”, Phillip Colella, 2004

13 SPECfp  8 Structured grid 3 using Adaptive Mesh Refinement  2 Sparse linear algebra  2 Particle methods  5 TBD: Ray tracer, Speech Recognition, Quantum Chemistry, Lattice Quantum Chromodynamics (many kernels inside each benchmark?) SPECint  8 Finite State Machine  2 Sorting/Searching  2 Dense linear algebra (data type differs from dwarf)  1 TBD: 1 C compiler (many kernels?) 6/11 Dwarves Covers 24/30 SPEC2006

14 21 st Century Code Generation Old CW: Takes a decade for compilers to introduce an architecture innovation New approach: “Auto-tuners” 1st run variations of program on computer to find best combinations of optimizations (blocking, padding, …) and algorithms, then produce C code to be compiled for that computer  E.g., PHiPAC (Portable High Performance Ansi C ), Atlas (BLAS), Sparsity (Sparse linear algebra), Spiral (DSP), FFT-W  Can achieve large speedup over conventional compiler One Auto-tuner per dwarf?  Exist for Dense Linear Algebra, Sparse Linear Algebra, Spectral

15 Reference Best: 4x2 Mflop/s Sparse Matrix – Search for Blocking for finite element problem [Im, Yelick, Vuduc, 2005]

16 21 st Century Classification Old CW:  SISD vs. SIMD vs. MIMD 3 “new” measures of parallelism  Size of Operands  Style of Parallelism  Amount of Parallelism

17 Operand Size and Type Programmer should be able to specify data size, type independent of algorithm 1 bit (Boolean*) 8 bits (Integer, ASCII) 16 bits (Integer, DSP fixed pt, Unicode*) 32 bits (Integer, SP Fl. Pt., Unicode*) 64 bits (Integer, DP Fl. Pt.) 128 bits (Integer*, Quad Precision Fl. Pt.*) 1024 bits (Crypto*) * Not supported well in most programming languages and optimizing compilers

18 Style of Parallelism Explicitly Parallel Data Level Parallel (  SIMD) Thread Level Parallel (  MIMD) Streaming (time is one dimension) General DLP No Coupling TLP Barrier Synch. TLP Tight Coupling TLP Programmer wants code to run on as many parallel architectures as possible so (if possible) Architect wants to run as many different types of parallel programs as possible so Less HW Control, Simpler Prog. model More Flexible

19 Parallel Framework – Apps (so far) Original 7 dwarves: 6 data parallel, 1 no coupling TLP Bonus 4 dwarves: 2 data parallel, 2 no coupling TLP EEMBC (Embedded): Stream 10, DLP 19, Barrier TLP 2 SPEC (Desktop): 14 DLP, 2 no coupling TLP E EEMBCEEMBC E EEMBCEEMBC SPECSPEC SPECSPEC DwarfSDwarfS DWARFSDWARFS Streaming DLP DLP No coupling TLP Barrier TLP Tight TLP Most New Architectures, Languages Most Important Apps?

20 New Parallel Framework CryptoBoolean Given natural operand size and level of parallelism, how parallel is computer or how must parallelism available in application? Proposed Parallel Framework for Arch and Apps E EEMBCEEMBC E EEMBCEEMBC SPECSPEC SPECSPEC DWARFSDWARFS DWARFSDWARFS >

21 Parallel Framework - Architecture CryptoBoolean Examples of good architectural matches to each style MMX TCCTCC CM5CM5 CLUSTERCLUSTER Vec- tor IMAGINEIMAGINE >

22 Outline Part I: How to Hurt Scientific Productivity  via Computer Architecture Part II: A New Agenda for Computer Architecture  1 st Review Conventional Wisdom (New & Old) in Technology and Computer Architecture  21 st century kernels, New classifications of apps and architecture Part III: A “Watering Hole” for Parallel Systems Exploration  Research Accelerator for Multiple Processors Conclusion

23 1. Algorithms, Programming Languages, Compilers, Operating Systems, Architectures, Libraries, … not ready for 1000 CPUs / chip 2. Only companies can build HW, and it takes years $M mask costs, $M for ECAD tools, GHz clock rates, >100M transistors 3. Software people don’t start working hard until hardware arrives 3 months after HW arrives, SW people list everything that must be fixed, then we all wait 4 years for next iteration of HW/SW 4. How get 1000 CPU systems in hands of researchers to innovate in timely fashion on in algorithms, compilers, languages, OS, architectures, … ? 5. Avoid waiting years between HW/SW iterations? Problems with Sea Change

24 Build Academic MPP from FPGAs As  25 CPUs will fit in Field Programmable Gate Array (FPGA), 1000-CPU system from  40 FPGAs? bit simple “soft core” RISC at 150MHz in 2004 (Virtex-II) FPGA generations every 1.5 yrs;  2X CPUs,  1.2X clock rate HW research community does logic design (“gate shareware”) to create out-of-the-box, MPP  E.g., 1000 processor, standard ISA binary-compatible, 64-bit, cache- coherent  100 MHz/CPU in 2007  RAMPants: Arvind (MIT), Krste Asanovíc (MIT), Derek Chiou (Texas), James Hoe (CMU), Christos Kozyrakis (Stanford), Shih-Lien Lu (Intel), Mark Oskin (Washington), David Patterson (Berkeley, Co-PI), Jan Rabaey (Berkeley), and John Wawrzynek (Berkeley, PI) “Research Accelerator for Multiple Processors”

25 Completed Dec (14x17 inch 22-layer PCB) Board: 5 Virtex II FPGAs, 18 banks DDR2-400 memory, 20 10GigE conn. RAMP 1 Hardware BEE2: Berkeley Emulation Engine 2 By John Wawrzynek and Bob Brodersen with students Chen Chang and Pierre Droz 1.5W / computer, 5 cu. in. /computer, $100 / computer 1000 CPUs :  1.5 KW,  ¼ rack,  $100,000 Box: 8 compute modules in 8U rack mount chassis

26 RAMP Milestones NameGoalTargetCPUsDetails Red (Stanf ord) Get Started 1H068 PowerPC 32b hard cores Transactional memory SMP Blue (Cal) Scale2H06  b soft (Microblaze) Cluster, MPI White (All) Full Features 1H07?128? soft 64b, Multiple commercial ISAs CC-NUMA, shared address, deterministic, debug/monitor 2.03 rd party sells it 2H07?4X CPUs of ‘04 FPGA New ’06 FPGA, new board

27 Can RAMP keep up? FGPA generations: 2X CPUs / 18 months  2X CPUs / 24 months for desktop microprocessors 1.1X to 1.3X performance / 18 months  1.2X? / year per CPU on desktop? However, goal for RAMP is accurate system emulation, not to be the real system  Goal is accurate target performance, parameterized reconfiguration, extensive monitoring, reproducibility, cheap (like a simulator) while being credible and fast enough to emulate 1000s of OS and apps in parallel (like hardware)

28 RAMP + Auto-tuners = Promised land? Auto-tuners in reaction to fixed, hard to understand hardware RAMP enables perpendicular exploration For each algorithm, how can the architecture be modified to achieve maximum performance given the resource limitations (e.g., bandwidth, cache-sizes,...) Auto-tuning searches can focus on comparing different algorithms for each dwarf rather than also spending time massaging computer quirks

29 Multiprocessing Watering Hole Killer app:  All CS Research, Advanced Development RAMP attracts many communities to shared artifact  Cross-disciplinary interactions  Ramp up innovation in multiprocessing RAMP as next Standard Research/AD Platform? (e.g., VAX/BSD Unix in 1980s, Linux/x86 in 1990s) Parallel file system Flight Data RecorderTransactional Memory Fault insertion to check dependability Data center in a box Internet in a box Dataflow language/computer Security enhancements Router designCompile to FPGA Parallel languages RAMP 128-bit Floating Point Libraries

30 Conclusion: [1 / 2] Alternatives to Hurting Productivity Delivered productivity >> Peak performance Promote a simple, understandable model of execution and performance No programming surprises! You’re not going fast if you’re going the wrong way Use Programs of Future to design Computers, Languages, … of the Future 7 + 5? Dwarves, Auto-Tuners, RAMP Although architect’s, language designers focusing toward right, most dwarves are toward left Streaming DLP DLP No coupling TLP Barrier TLP Tight TLP

31 Research Accelerator for Multiple Processors Carpe Diem: Researchers need it ASAP  FPGAs ready, and getting better  Stand on shoulders vs. toes: standardize on Berkeley FPGA platforms (BEE, BEE2) by Wawrzynek et al  Architects aid colleagues via gateware RAMP accelerates HW/SW generations  System emulation + good accounting vs. FPGA computer  Emulate, Trace, Reproduce anything; Tape out every day “Multiprocessor Research Watering Hole” ramp up research in multiprocessing via common research platform  innovate across fields  hasten sea change from sequential to parallel computing Conclusions [2 / 2]

32 Acknowledgments Material comes from discussions on new directions for architecture with:  Professors Krste Asanovíc (MIT), Raz Bodik, Jim Demmel, Kurt Kuetzer, John Wawrzynek, and Kathy Yelick  LBNL discussants Parry Husbands, Bill Kramer, Lenny Oliker, and John Shalf  UCB Grad students Joe Gebis and Sam Williams RAMP based on work of RAMP Developers:  Arvind (MIT), Krste Asanovíc (MIT), Derek Chiou (Texas), James Hoe (CMU), Christos Kozyrakis (Stanford), Shih-Lien Lu (Intel), Mark Oskin (Washington), David Patterson (Berkeley, Co-PI), Jan Rabaey (Berkeley), and John Wawrzynek (Berkeley, PI) See ramp.eecs.berkeley.edu

33 Backup Slides

34 Summary of Dwarves (so far) Original 7: 6 data parallel, 1 no coupling TLP Bonus 4: 2 data parallel, 2 no coupling TLP  To Be Done: FSM EEMBC (Embedded): Stream 10, DLP 19  Barrier (2), 11 more to characterize SPEC (Desktop): 14 DLP, 2 no coupling TLP  6 dwarves cover 24/30; To Be Done: 8 FSM, 6 Big SPEC Although architect’s focusing toward right, most dwarves are toward left Streaming DLP DLP No coupling TLP Barrier TLP Tight TLP

35 Supporters (wrote letters to NSF) Gordon Bell (Microsoft) Ivo Bolsens (Xilinx CTO) Norm Jouppi (HP Labs) Bill Kramer (NERSC/LBL) Craig Mundie (MS CTO) G. Papadopoulos (Sun CTO) Justin Rattner (Intel CTO) Ivan Sutherland (Sun Fellow) Chuck Thacker (Microsoft) Kees Vissers (Xilinx) Doug Burger (Texas) Bill Dally (Stanford) Carl Ebeling (Washington) Susan Eggers (Washington) Steve Keckler (Texas) Greg Morrisett (Harvard) Scott Shenker (Berkeley) Ion Stoica (Berkeley) Kathy Yelick (Berkeley) RAMP Participants: Arvind (MIT), Krste Asanovíc (MIT), Derek Chiou (Texas), James Hoe (CMU), Christos Kozyrakis (Stanford), Shih-Lien Lu (Intel), Mark Oskin (Washington), David Patterson (Berkeley, Co-PI), Jan Rabaey (Berkeley), and John Wawrzynek (Berkeley, PI)

36 RAMP FAQ Q: What about power, cost, space in RAMP? A:  1.5 watts per computer  $100-$200 per computer  5 cubic inches per computer  1000 computers for $100k to $200k, 1.5 KW, 1/3 rack Using very slow clock rate, very simple CPUs, and very large FPGAs

37 RAMP FAQ Q: How will FPGA clock rate improve? A1: 1.1X to 1.3X / 18 months  Note that clock rate now going up slowly on desktop A2: Goal for RAMP is system emulation, not to be the real system  Hence, value accurate accounting of target clock cycles, parameterized design (Memory BW, network BW, …), monitor, debug over performance  Goal is just fast enough to emulate OS, app in parallel

38 RAMP FAQ Q: What about power, cost, space in RAMP? A:  1.5 watts per computer  $100-$200 per computer  5 cubic inches per computer Using very slow clock rate, very simple CPUs in a very large FPGA (RAMP blue)

39 RAMP FAQ Q: How can many researchers get RAMPs? A1: RAMP 2.0 to be available for purchase at low margin from 3 rd party vendor A2: Single board RAMP 2.0 still interesting as FPGA 2X CPUs/18 months  RAMP 2.0 FPGA two generations later than RAMP 1.0, so 256? simple CPUs per board vs. 64?

40 Parallel FAQ Q: Won’t the circuit or processing guys solve CPU performance problem for us? A1: No. More transistors, but can’t help with ILP wall, and power wall is close to fundamental problem  Memory wall could be lowered some, but hasn’t happened yet commercially A2: One time jump. IBM using “strained silicon” on Silicon On Insulator to increase electron mobility (Intel doesn’t have SOI)  clock rate  or leakage power   Continue making rapid semiconductor investment?

41 Parallel FAQ Q: How afford 2 processors if power is the problem? A: Simpler core, lower voltage and frequency  Power  Capacitance x Volt 2 x Frequency :  0.5  Also, single complex CPU inefficient in transistors, power

42 RAMP Development Plan 1. Distribute systems internally for RAMP 1 development  Xilinx agreed to pay for production of a set of modules for initial contributing developers and first full RAMP system  Others could be available if can recover costs 2. Release publicly available out-of-the-box MPP emulator  Based on standard ISA (IBM Power, Sun SPARC, …) for binary compatibility  Complete OS/libraries  Locally modify RAMP as desired 3. Design next generation platform for RAMP 2  Base on 65nm FPGAs (2 generations later than Virtex-II)  Pending results from RAMP 1, Xilinx will cover hardware costs for initial set of RAMP 2 machines  Find 3rd party to build and distribute systems (at near-cost), open source RAMP gateware and software  Hope RAMP 3, 4, … self-sustaining NSF/CRI proposal pending to help support effort  2 full-time staff (one HW/gateware, one OS/software)  Look for grad student support at 6 RAMP universities from industrial donations

43 the stone soup of architecture research platforms I/O Patterson Monitoring Kozyrakis Net Switch Oskin Coherence Hoe Cache Asanovic PPC Arvind x86 Lu Glue-support Chiou Hardware Wawrzynek

44 Gateware Design Framework Design composed of units that send messages over channels via ports Units (10,000 + gates)  CPU + L1 cache, DRAM controller…. Channels (  FIFO)  Lossless, point-to-point, unidirectional, in-order message delivery…

45 Gateware Design Framework Insight: almost every large building block fits inside FPGA today  what doesn’t is between chips in real design Supports both cycle-accurate emulation of detailed parameterized machine models and rapid functional-only emulations Carefully counts for Target Clock Cycles Units in any hardware design language (will work with Verilog, VHDL, BlueSpec, C,...) RAMP Design Language (RDL) to describe plumbing to connect units in

46 Quick Sanity Check BEE2 uses old FPGAs (Virtex II), 4 banks DDR2-400/cpu bit Microblazes per Virtex II FPGA, 0.75 MB memory for caches  32 KB direct mapped Icache, 16 KB direct mapped Dcache Assume 150 MHz, CPI is 1.5 (4-stage pipe)  I$ Miss rate is 0.5% for SPECint2000  D$ Miss rate is 2.8% for SPECint2000, 40% Loads/stores BW need/CPU = 150/1.5*4B*(0.5% + 40%*2.8%) = 6.4 MB/sec BW need/FPGA = 16*6.4 = 100 MB/s Memory BW/FPGA = 4*200 MHz*2*8B = 12,800 MB/s Plenty of BW for tracing, …

47 RAMP FAQ on ISAs Which ISA will you pick?  Goal is replaceable ISA/CPU L1 cache, rest infrastructure unchanged (L2 cache, router, memory controller, …) What do you want from a CPU?  Standard ISA (binaries, libraries, …), simple (area), 64-bit (coherency), DP Fl.Pt. (apps)  Multithreading? As an option, but want to get to 1000 independent CPUs When do you need it? 3Q06 RAMP people port my ISA, fix my ISA?  Our plates are full already Type A vs. Type B gateware Router, Memory controller, Cache coherency, L2 cache, Disk module, protocol for each Integration, testing

48 Handicapping ISAs Got it: Power 405 (32b), SPARC v8 (32b), Xilinx Microblaze (32b) Very Likely: SPARC v9 (64b) Likely: IBM Power 64b Probably (haven’t asked): MIPS32, MIPS64 No: x86, x86-64  But Derek Chiou of UT looking at x86 binary translation We’ll sue: ARM  But pretty simple ISA & MIT has good lawyers

49 Related Approaches (1) Quickturn, Axis, IKOS, Thara:  FPGA- or special-processor based gate-level hardware emulators  Synthesizable HDL is mapped to array for cycle and bit-accurate netlist emulation  RAMP’s emphasis is on emulating high-level architecture behaviors Hardware and supporting software provides architecture-level abstractions for modeling and analysis Targets architecture and software research Provides a spectrum of tradeoffs between speed and accuracy/precision of emulation RPM at USC in early 1990’s:  Up to only 8 processors  Only the memory controller implemented with configurable logic

50 Related Approaches (2) Software Simulators Clusters (standard microprocessors) PlanetLab (distributed environment) Wisconsin Wind Tunnel (used CM-5 to simulate shared memory) All suffer from some combination of:  Slowness, inaccuracy, scalability, unbalanced computation/communication, target inflexibility

51 RAMP uses (internal) Internet-in-a-Box Patterson TCC Kozyrakis Dataflow Oskin Reliable MP Hoe 1M-way MT Asanovic BlueSpec Arvind x86 Lu Net-uP Chiou Wawrzynek BEE

52 RAMP Example: UT FAST 1MHz to 100MHz, cycle-accurate, full-system, multiprocessor simulator  Well, not quite that fast right now, but we are using embedded 300MHz PowerPC 405 to simplify X86, boots Linux, Windows, targeting to Pentium M-like designs  Heavily modified Bochs, supports instruction trace and rollback Working on “superscalar” model  Have straight pipeline 486 model with TLBs and caches Statistics gathered in hardware  Very little if any probe effect Work started on tools to semi-automate micro- architectural and ISA level exploration  Orthogonality of models makes both simpler Derek Chiou, UTexas

53 Example: Transactional Memory Processors/memory hierarchy that support transactional memory Hardware/software infrastructure for performance monitoring and profiling  Will be general for any type of event Transactional coherence protocol Christos Kozyrakis, Stanford

54 Example: PROTOFLEX Hardware/Software Co-simulation/test methodology Based on FLEXUS C++ full-system multiprocessor simulator  Can swap out individual components to hardware Used to create and test a non-block MSI invalidation-based protocol engine in hardware James Hoe, CMU

55 Example: Wavescalar Infrastructure Dynamic Routing Switch Directory-based coherency scheme and engine Mark Oskin, U Washington

56 Example RAMP App: “Internet in a Box” Building blocks also  Distributed Computing RAMP vs. Clusters (Emulab, PlanetLab)  Scale: RAMP O(1000) vs. Clusters O(100)  Private use: $100k  Every group has one  Develop/Debug: Reproducibility, Observability  Flexibility: Modify modules (SMP, OS)  Heterogeneity: Connect to diverse, real routers Explore via repeatable experiments as vary parameters, configurations vs. observations on single (aging) cluster that is often idiosyncratic David Patterson, UC Berkeley

57 Old CW: Programming is hard New CW: Parallel programming is really hard  2 kinds of Scientific Programmers Those using single processor Those who can use up to 100 processors  Big steps for programmers From 1 processor to 2 processors From 100 processors to 1000 processors  Can computer architecture make many processors look like fewer processors, ideally one? Old CW: Who cares about I/O in Supercomputing? New CW: Supercomputing = Massive data + Massive Computation Conventional Wisdom (CW) in Scientific Programming

58 Size of Parallel Computer What parallelism achievable with good or bad architectures, good or bad algorithms?  32-way: anything goes  100-way: good architecture and bad algorithms or bad architecture and good algorithms  1000-way: good architecture and good algorithm

59 Parallel Framework - Benchmarks CryptoBoolean EEMBC Angle to Time CAN Remote Bit Manipulation Basic Int Cache Buster

60 IIR PWM Road Speed Matrix Parallel Framework - Benchmarks CryptoBoolean EEMBC iDCT Table Lookup FFT iFFT FIR Pointer Chasing

61 JPEG Parallel Framework - Benchmarks CryptoBoolean EEMBC Hi Pass Gray Scale RGB To YIQ RGB To CMYK JPEG

62 Parallel Framework - Benchmarks CryptoBoolean EEMBC IP NAT, QoS OSPF, TCP IP Packet Check Route Lookup

63 Parallel Framework - Benchmarks CryptoBoolean EEMBC Text Processing Dithering Image Rotation

64 Parallel Framework - Benchmarks CryptoBoolean EEMBC Bit Alloc Autocor Convolution, Viterbi

65 SPECintCPU: 32-bit integer FSM: perlbench, bzip2, minimum cost flow (MCF), Hidden Markov Models (hmm), video (h264avc), Network discrete event simulation, 2D path finding library (astar), XML Transformation (xalancbmk) Sorting/Searching: go (gobmk), chess (sjeng), Dense linear algebra: quantum computer (libquantum), video (h264avc) TBD: compiler (gcc)

66 SPECfpCPU: 64-bit Fl. Pt. Structured grid: Magnetohydrodynamics (zeusmp), General relativity (cactusADM), Finite element code (calculix), Maxwell's E&M eqns solver (GemsFDTD), Fluid dynamics (lbm; leslie3d-AMR), Finite element solver (dealII-AMR), Weather modeling (wrf-AMR) Sparse linear algebra: Fluid dynamics (bwaves), Linear program solver (soplex), Particle methods: Molecular dynamics (namd, 64- bit; gromacs, 32-bit), TBD: Quantum chromodynamics (milc), Quantum chemistry (gamess), Ray tracer (povray), Quantum crystallography (tonto), Speech recognition (sphinx3)

67 Parallel Framework - Benchmarks CryptoBoolean 7 Dwarfs: Use simplest parallel model that works Monte Carlo Dense Structured Unstructured Sparse FFT Particle

68 Parallel Framework - Benchmarks CryptoBoolean Additional 4 Dwarfs (not including FSM, Ray tracing) Comb. Logic Searching / Sorting cryptoFilter

69 CryptoBoolean Angle to Time CAN Remote Bit Manipulation Basic Int Cache Buster Number EEMBC kernels ParallelismStyleOperand Data bit 5100Data bit 10 Stream bit 210Tightly Coupled bit Parallel Framework – EEMBC Benchmarks