CS61C L43 Hardware Parallel Computing (1) Garcia, Spring 2007 © UCB Leakage solved!  Insulators to separate wires on chips have always had problems with.

Slides:



Advertisements
Similar presentations
Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H Workshop on Multi-core Technologies International Institute.
Advertisements

Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Why GPUs? Robert Strzodka. 2Overview Computation / Bandwidth / Power CPU – GPU Comparison GPU Characteristics.
CIS 570 Advanced Computer Systems University of Massachusetts Dartmouth Instructor: Dr. Michael Geiger Fall 2008 Lecture 1: Fundamentals of Computer Design.
CMSC 411 Computer Systems Architecture Lecture 1 Computer Architecture at Crossroads Instructor: Anwar Mamat Slides from Alan Sussman, Pete Keleher, Chau-Wen.
CS61C L40 Parallelism in Processor Design (1) Garcia, Spring 2008 © UCB License Except as otherwise noted, the content of this presentation is licensed.
Chapter1 Fundamental of Computer Design Dr. Bernard Chen Ph.D. University of Central Arkansas.
The Evolution of RISC A Three Party Rivalry By Jenny Mitchell CS147 Fall 2003 Dr. Lee.
Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Laboratory Krste Asanovic, Ras Bodik, Jim Demmel, Tony Keaveny, Kurt.
CS61C L28 Intra-machine Parallelism (1) Magrino, Summer 2010 © UCB TA Tom Magrino inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C : Machine Structures.
CS61C L21 State Elements: CircuitsThat Remember (1) Garcia © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
CS61C L23 Synchronous Digital Systems (1) Garcia, Fall 2011 © UCB Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C.
CS61C L20 Synchronous Digital Systems (1) Garcia, Fall 2006 © UCB Blu-ray vs HD-DVD war over?  As you know, there are two different, competing formats.
CS61C L21 State Elements : Circuits that Remember (1) Garcia, Fall 2006 © UCB One Laptop per Child  The OLPC project has been making news recently with.
CS10 The Beauty and Joy of Computing Lecture #8 : Concurrency Prof Jonathan Koomey looked at 6 decades of data and found that energy efficiency.
Chapter 17 Parallel Processing.
Old-Fashioned Mud-Slinging with Flash
CS 300 – Lecture 23 Intro to Computer Architecture / Assembly Language Virtual Memory Pipelining.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
CS10 The Beauty and Joy of Computing Lecture #8 : Concurrency Nvidia CEO Jen-Hsun Huang said: “whatever capability you think you have today,
CS 300 – Lecture 2 Intro to Computer Architecture / Assembly Language History.
CIS 314 : Computer Organization Lecture 1 – Introduction.
CS 61C L20 Introduction to Synchronous Digital Systems (1) Garcia, Fall 2004 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
CS10 The Beauty and Joy of Computing Lecture #8 : Concurrency IBM’s Watson computer (really 2,800 cores) is leading former champions $35,734.
CS61C L20 Synchronous Digital Systems (1) Garcia, Spring 2007 © UCB Disk failures 15x specs!  A recent conference reveals that drives fail in real life.
CS61C L20 Introduction to Synchronous Digital Systems (1) Garcia © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
1 Lecture 1: CS/ECE 3810 Introduction Today’s topics:  logistics  why computer organization is important  modern trends.
Computer ArchitectureFall 2008 © November 19, 2007 Nael Abu-Ghazaleh Lecture 26 Emerging.
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
Computer performance.
Chapter1 Fundamental of Computer Design Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Computer System Architectures Computer System Software
Multi-core architectures. Single-core computer Single-core CPU chip.
1 Lecture 1: CS/ECE 3810 Introduction Today’s topics:  Why computer organization is important  Logistics  Modern trends.
Multi-Core Architectures
RAMPing Down Chuck Thacker Microsoft Research August 2010.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
CS/ECE 3330 Computer Architecture Kim Hazelwood Fall 2009.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
Computer Architecture Lec 1 - Introduction. 01/19/10Lec 01-intro 2 Outline Computer Science at a Crossroads Computer Architecture v. Instruction Set Arch.
Chapter 1 Performance & Technology Trends Read Sections 1.5, 1.6, and 1.8.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
Pipelining and Parallelism Mark Staveley
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
ISA's, Compilers, and Assembly
BCS361: Computer Architecture I/O Devices. 2 Input/Output CPU Cache Bus MemoryDiskNetworkUSBDVD …
CS61CL L15 Parallelism(1) Pearce, Summer 2009 © UCB Paul Pearce, TA inst.eecs.berkeley.edu/~cs61c CS61CL : Machine Structures Lecture #15 – Parallelism.
Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.
3/12/07CS Visit Days1 A Sea Change in Processor Design Uniprocessor SpecInt Performance: From Hennessy and Patterson, Computer Architecture: A Quantitative.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
Feeding Parallel Machines – Any Silver Bullets? Novica Nosović ETF Sarajevo 8th Workshop “Software Engineering Education and Reverse Engineering” Durres,
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Lynn Choi School of Electrical Engineering
Chapter1 Fundamental of Computer Design
At a recent fireside chat…
Lynn Choi School of Electrical Engineering
Architecture & Organization 1
Architecture & Organization 1
Chapter 4 Multiprocessors
Lecturer PSOE Dan Garcia
Presentation transcript:

CS61C L43 Hardware Parallel Computing (1) Garcia, Spring 2007 © UCB Leakage solved!  Insulators to separate wires on chips have always had problems with current leakage. Air is much better, but hard to manufacture. IBM announces they’ve found a way! Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C : Machine Structures Lecture 43 – Hardware Parallel Computing news.bbc.co.uk/2/hi/technology/ stm Thanks to Dave Patterson for his Berkeley View slides view.eecs.berkeley.edu

CS61C L43 Hardware Parallel Computing (2) Garcia, Spring 2007 © UCB Background: Threads A Thread stands for “thread of execution”, is a single stream of instructions A program can split, or fork itself into separate threads, which can (in theory) execute simultaneously. It has its own registers, PC, etc. Threads from the same process operate in the same virtual address space  switching threads faster than switching processes! An easy way to describe/think about parallelism A single CPU can execute many threads by Time Division Multipexing CPU Time Thread0 Thread1 Thread2

CS61C L43 Hardware Parallel Computing (3) Garcia, Spring 2007 © UCB Background: Multithreading Multithreading is running multiple threads through the same hardware Could we do Time Division Multipexing better in hardware? Sure, if we had the HW to support it!

CS61C L43 Hardware Parallel Computing (4) Garcia, Spring 2007 © UCB Background: Multicore Put multiple CPU’s on the same die Why is this better than multiple dies? Smaller Cheaper Closer, so lower inter-processor latency Can share a L2 Cache (complicated) Less power Cost of multicore: complexity and slower single-thread execution

CS61C L43 Hardware Parallel Computing (5) Garcia, Spring 2007 © UCB Multicore Example(IBM Power5) Core #1 Core #2 Shared Stuff

CS61C L43 Hardware Parallel Computing (6) Garcia, Spring 2007 © UCB Real World Example: Cell Processor Multicore, and more…. Heart of the Playstation 3

CS61C L43 Hardware Parallel Computing (7) Garcia, Spring 2007 © UCB Real World Example 1: Cell Processor 9 Cores (1PPE, 8SPE) at 3.2GHz Power Processing Element (PPE) Supervises all activities, allocates work Is multithreaded (2 threads) Synergystic Processing Element (SPE) Where work gets done Very Superscalar No Cache, only “Local Store”

CS61C L43 Hardware Parallel Computing (8) Garcia, Spring 2007 © UCB Peer Instruction A.The majority of PS3’s processing power comes from the Cell processor B.Berkeley profs believe multicore is the future of computing C.Current multicore techniques can scale well to many (32+) cores ABC 0: FFF 1: FFT 2: FTF 3: FTT 4: TFF 5: TFT 6: TTF 7: TTT

CS61C L43 Hardware Parallel Computing (9) Garcia, Spring 2007 © UCB Peer Instruction Answer 1. All PS3 is 2.18TFLOPS, Cell is only 204GFLOPS (GPU can do a lot…) FALSE 2. Not multicore, manycore! FALSE 3. Share memory and caches huge barrier. That’s why Cell has Local Store! FALSE A.The majority of PS3’s processing power comes from the Cell processor B.Berkeley profs believe multicore is the future of computing C.Current multicore techniques can scale well to many (32+) cores ABC 0: FFF 1: FFT 2: FTF 3: FTT 4: TFF 5: TFT 6: TTF 7: TTT

CS61C L43 Hardware Parallel Computing (10) Garcia, Spring 2007 © UCB Upcoming Calendar Week #MonWedThu LabFri #15 This week Re- configurable Computing (Michael) Parallel Computing in Software (Matt) Parallel? I/O Networking & 61C Feedback Survey Parallel Computing in Hardware (Dan’s last OH 3pm) #16 Last week o’ classes LAST CLASS Summary, Review, & HKN Evals Perf comp due Tues Wed 2pm Review 10 Evans FINAL EXAM Sat 12:30pm-3:30pm 2050 VLSB

CS61C L43 Hardware Parallel Computing (11) Garcia, Spring 2007 © UCB High Level Message Everything is changing Old conventional wisdom is out We desperately need new approach to HW and SW based on parallelism since industry has bet its future that parallelism works Need to create a “watering hole” to bring everyone together to quickly find that solution architects, language designers, application experts, numerical analysts, algorithm designers, programmers, …

CS61C L43 Hardware Parallel Computing (12) Garcia, Spring 2007 © UCB Conventional Wisdom (CW) in Computer Architecture 1.Old CW: Power is free, but transistors expensive New CW: Power wall Power expensive, transistors “free” Can put more transistors on a chip than have power to turn on 2.Old CW: Multiplies slow, but loads fast New CW: Memory wall Loads slow, multiplies fast 200 clocks to DRAM, but even FP multiplies only 4 clocks 3.Old CW: More ILP via compiler / architecture innovation Branch prediction, speculation, Out-of-order execution, VLIW, … New CW: ILP wall Diminishing returns on more ILP 4.Old CW: 2X CPU Performance every 18 months New CW is Power Wall + Memory Wall + ILP Wall = Brick Wall

CS61C L43 Hardware Parallel Computing (13) Garcia, Spring 2007 © UCB Uniprocessor Performance (SPECint) VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 RISC + x86: ??%/year 2002 to present From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, Sept. 15, 2006  Sea change in chip design: multiple “cores” or processors per chip 3X

CS61C L43 Hardware Parallel Computing (14) Garcia, Spring 2007 © UCB Sea Change in Chip Design Intel 4004 (1971): 4-bit processor, 2312 transistors, 0.4 MHz, 10 micron PMOS, 11 mm 2 chip Processor is the new transistor! RISC II (1983): 32-bit, 5 stage pipeline, 40,760 transistors, 3 MHz, 3 micron NMOS, 60 mm 2 chip 125 mm 2 chip, micron CMOS = 2312 RISC II+FPU+Icache+Dcache RISC II shrinks to  0.02 mm 2 at 65 nm Caches via DRAM or 1 transistor SRAM or 3D chip stacking Proximity Communication via capacitive coupling at > 1 TB/s ? (Ivan Sun / Berkeley)

CS61C L43 Hardware Parallel Computing (15) Garcia, Spring 2007 © UCB Parallelism again? What’s different this time? “This shift toward increasing parallelism is not a triumphant stride forward based on breakthroughs in novel software and architectures for parallelism; instead, this plunge into parallelism is actually a retreat from even greater challenges that thwart efficient silicon implementation of traditional uniprocessor architectures.” – Berkeley View, December 2006 HW/SW Industry bet its future that breakthroughs will appear before it’s too late view.eecs.berkeley.edu

CS61C L43 Hardware Parallel Computing (16) Garcia, Spring 2007 © UCB Need a New Approach Berkeley researchers from many backgrounds met between February 2005 and December 2006 to discuss parallelism Circuit design, computer architecture, massively parallel computing, computer-aided design, embedded hardware and software, programming languages, compilers, scientific programming, and numerical analysis Krste Asanovic, Ras Bodik, Jim Demmel, John Kubiatowicz, Edward Lee, George Necula, Kurt Keutzer, Dave Patterson, Koshik Sen, John Shalf, Kathy Yelick + others Tried to learn from successes in embedded and high performance computing Led to 7 Questions to frame parallel research

CS61C L43 Hardware Parallel Computing (17) Garcia, Spring 2007 © UCB 7 Questions for Parallelism Applications: 1. What are the apps? 2. What are kernels of apps? Hardware: 3. What are HW building blocks? 4. How to connect them? Programming Model & Systems Software: 5. How to describe apps & kernels? 6. How to program the HW? Evaluation: 7. How to measure success? (Inspired by a view of the Golden Gate Bridge from Berkeley)

CS61C L43 Hardware Parallel Computing (18) Garcia, Spring 2007 © UCB Power limits leading edge chip designs Intel Tejas Pentium 4 cancelled due to power issues Yield on leading edge processes dropping dramatically IBM quotes yields of 10 – 20% on 8- processor Cell Design/validation leading edge chip is becoming unmanageable Verification teams > design teams on leading edge processors Hardware Tower: What are the problems?

CS61C L43 Hardware Parallel Computing (19) Garcia, Spring 2007 © UCB HW Solution: Small is Beautiful Expect modestly pipelined (5- to 9-stage) CPUs, FPUs, vector, Single Inst Multiple Data (SIMD) Processing Elements (PEs) Small cores not much slower than large cores Parallel is energy efficient path to performance: P≈V 2 Lower threshold and supply voltages lowers energy per op Redundant processors can improve chip yield Cisco Metro 188 CPUs + 4 spares; Sun Niagara sells 6 or 8 CPUs Small, regular processing elements easier to verify One size fits all? Amdahl’s Law  Heterogeneous processors?

CS61C L43 Hardware Parallel Computing (20) Garcia, Spring 2007 © UCB Number of Cores/Socket We need revolution, not evolution Software or architecture alone can’t fix parallel programming problem, need innovations in both “Multicore” 2X cores per generation: 2, 4, 8, … “Manycore” 100s is highest performance per unit area, and per Watt, then 2X per generation: 64, 128, 256, 512, 1024 … Multicore architectures & Programming Models good for 2 to 32 cores won’t evolve to Manycore systems of 1000’s of processors  Desperately need HW/SW models that work for Manycore or will run out of steam (as ILP ran out of steam at 4 instructions)

CS61C L43 Hardware Parallel Computing (21) Garcia, Spring 2007 © UCB 1.  Only companies can build HW, and it takes years 2.Software people don’t start working hard until hardware arrives 3 months after HW arrives, SW people list everything that must be fixed, then we all wait 4 years for next iteration of HW/SW 3.How get 1000 CPU systems in hands of researchers to innovate in timely fashion on in algorithms, compilers, languages, OS, architectures, … ? 4.Can avoid waiting years between HW/SW iterations? Measuring Success: What are the problems?

CS61C L43 Hardware Parallel Computing (22) Garcia, Spring 2007 © UCB Build Academic Manycore from FPGAs As  16 CPUs will fit in Field Programmable Gate Array (FPGA), 1000-CPU system from  64 FPGAs? 8 32-bit simple “soft core” RISC at 100MHz in 2004 (Virtex-II) FPGA generations every 1.5 yrs;  2X CPUs,  1.2X clock rate HW research community does logic design (“gate shareware”) to create out-of-the-box, Manycore E.g., 1000 processor, standard ISA binary-compatible, 64-bit, cache-coherent  150 MHz/CPU in 2007 RAMPants: 10 faculty at Berkeley, CMU, MIT, Stanford, Texas, and Washington “Research Accelerator for Multiple Processors” as a vehicle to attract many to parallel challenge

CS61C L43 Hardware Parallel Computing (23) Garcia, Spring 2007 © UCB Why Good for Research Manycore? SMPClusterSimulate RAMP Scalability (1k CPUs) CAAA Cost (1k CPUs)F ($40M)C ($2-3M)A+ ($0M)A ($ M) Cost of ownershipADAA Power/Space (kilowatts, racks) D (120 kw, 12 racks) A+ (.1 kw, 0.1 racks) A (1.5 kw, 0.3 racks) CommunityDAAA ObservabilityDCA+ ReproducibilityBDA+ ReconfigurabilityDCA+ CredibilityA+ FB+/A- Perform. (clock)A (2 GHz)A (3 GHz)F (0 GHz)C (0.1 GHz) GPA CB-BA-

CS61C L43 Hardware Parallel Computing (24) Garcia, Spring 2007 © UCB Multiprocessing Watering Hole Killer app:  All CS Research, Advanced Development RAMP attracts many communities to shared artifact  Cross-disciplinary interactions RAMP as next Standard Research/AD Platform? (e.g., VAX/BSD Unix in 1980s) Parallel file system Flight Data RecorderTransactional Memory Fault insertion to check dependability Data center in a box Internet in a box Dataflow language/computer Security enhancements Router designCompile to FPGA Parallel languages RAMP 128-bit Floating Point Libraries

CS61C L43 Hardware Parallel Computing (25) Garcia, Spring 2007 © UCB Reasons for Optimism towards Parallel Revolution this time End of sequential microprocessor/faster clock rates No looming sequential juggernaut to kill parallel revolution SW & HW industries fully committed to parallelism End of La-Z-Boy Programming Era Moore’s Law continues, so soon can put 1000s of simple cores on an economical chip Communication between cores within a chip at low latency (20X) and high bandwidth (100X) Processor-to-Processor fast even if Memory slow All cores equal distance to shared main memory Less data distribution challenges Open Source Software movement means that SW stack can evolve more quickly than in past RAMP as vehicle to ramp up parallel research