Parallelism: A Serious Goal or a Silly Mantra (some half-thought-out ideas)

Slides:



Advertisements
Similar presentations
The Implications of Multi-core. What I want to do today Given that everyone is heralding Multi-core –Is it really the Holy Grail? –Will it cure cancer?
Advertisements

Lecture 6: Multicore Systems
Avenues for Research The Microarchitecture of Future Microprocessors.
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
Pentium microprocessors CAS 133 – Basic Computer Skills/MS Office CIS 120 – Computer Concepts I Russ Erdman.
Microprocessor Performance, Phase II Yale Patt The University of Texas at Austin STAMATIS Symposium TU-Delft September 28, 2007.
Microarchitecture is Dead AND We must have a multicore interface that does not require programmers to understand what is going on underneath.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Yale Patt The University of Texas at Austin World University Presidents’ Symposium University of Belgrade April 4, 2009 Future Microprocessors: What must.
Yale Patt The University of Texas at Austin Universidade de Brasilia Brasilia, DF -- Brazil August 12, 2009 Future Microprocessors: Multi-core, Multi-nonsense,
Introduction Companion slides for
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Multicore & Parallel Processing P&H Chapter ,
Instruction Level Parallelism (ILP) Colin Stevens.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
Introduction Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified by Rajeev Alur for CIS 640 at Penn, Spring.
Parallel Algorithms - Introduction Advanced Algorithms & Data Structures Lecture Theme 11 Prof. Dr. Th. Ottmann Summer Semester 2006.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Basics and Architectures
Introduction CSE 410, Spring 2008 Computer Systems
Lynn Choi School of Electrical Engineering Microprocessor Microarchitecture The Past, Present, and Future of CPU Architecture.
Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.
1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
My major mantra: We Must Break the Layers. Algorithm Program ISA (Instruction Set Arch)‏ Microarchitecture Circuits Problem Electrons.
Yale Patt The University of Texas at Austin University of California, Irvine March 2, 2012 High Performance in the Multi-core Era: The role of the Transformation.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
[Tim Shattuck, 2006][1] Performance / Watt: The New Server Focus Improving Performance / Watt For Modern Processors Tim Shattuck April 19, 2006 From the.
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Outline  Over view  Design  Performance  Advantages and disadvantages  Examples  Conclusion  Bibliography.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Chapter 1 Performance & Technology Trends Read Sections 1.5, 1.6, and 1.8.
Dean Tullsen UCSD.  The parallelism crisis has the feel of a relatively new problem ◦ Results from a huge technology shift ◦ Has suddenly become pervasive.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
Shashwat Shriparv InfinitySoft.
Lecture 2: Computer Architecture: A Science ofTradeoffs.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
Multi-core, Mega-nonsense. Will multicore cure cancer? Given that multicore is a reality –…and we have quickly jumped from one core to 2 to 4 to 8 –It.
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.
1 A simple parallel algorithm Adding n numbers in parallel.
Samira Khan University of Virginia Feb 25, 2016 COMPUTER ARCHITECTURE CS 6354 Asymmetric Multi-Cores The content and concept of this course are adapted.
Introduction CSE 410, Spring 2005 Computer Systems
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Lynn Choi School of Electrical Engineering
Lecture 5 Approaches to Concurrency: The Multiprocessor
Lynn Choi School of Electrical Engineering
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers
Hyperthreading Technology
The University of Texas at Austin
Introduction, Focus, Overview
Levels of Parallelism within a Single Processor
Coe818 Advanced Computer Architecture
The University of Texas at Austin
Chapter 1 Introduction.
Computer Architecture: A Science of Tradeoffs
Computer Evolution and Performance
Levels of Parallelism within a Single Processor
Introduction, Focus, Overview
Spring’19 Prof. Eric Rotenberg
Presentation transcript:

Parallelism: A Serious Goal or a Silly Mantra (some half-thought-out ideas)

Random thoughts on Parallelism Why the sudden preoccupation with parallelism? The Silliness (or what I call Meganonsense) –Break the problem  Use half the energy –1000 mickey mouse cores –Hardware is sequential –Server throughput (how many pins?) –What about GPUs and Data Base? Current bugs to exploiting parallelism (or are they?) –Dark silicon –Amdahl’s Law –The Cloud The answer –The fundamental concept vis-à-vis parallelism –What it means re: the transformation hierarchy

Random thoughts on Parallelism Why the sudden preoccupation with parallelism? The Silliness (or what I call Meganonsense) –Break the problem  Use half the energy –1000 mickey mouse cores –Hardware is sequential –Server throughput (how many pins?) –What about GPUs and Data Base? Current bugs to exploiting parallelism (or are they?) –Dark silicon –Amdahl’s Law –The Cloud The answer –The fundamental concept vis-à-vis parallelism –What it means re: the transformation hierarchy

It starts with the raw material (Moore’s Law)‏ The first microprocessor (Intel 4004), 1971 –2300 transistors –106 KHz The Pentium chip, 1992 –3.1 million transistors –66 MHz Today –more than one billion transistors –Frequencies in excess of 5 GHz Tomorrow ?

And what we have done with this raw material

Too many people do not realize: Parallelism did not start with Multi-core Pipelining Out-of-order Execution Multiple operations in a single microinstruction VLIW (horizontal microcode exposed to the software)

Random thoughts on Parallelism Why the sudden preoccupation with parallelism? The Silliness (or what I call Meganonsense) –Break the problem  Use half the energy –1000 mickey mouse cores –Hardware is sequential –Server throughput (how many pins?) –What about GPUs and Data Base? Current bugs to exploiting parallelism (or are they?) –Dark silicon –Amdahl’s Law –The Cloud The answer –The fundamental concept vis-à-vis parallelism –What it means re: the transformation hierarchy

One thousand mickey mouse cores Why not a million? Why not ten million? Let’s start with 16 –What if we could replace 4 with one more powerful core? …and we learned: –One more powerful core is not enough –Sometimes we need several –Morphcore was born –BUT not all morphcore (fixed function vs flexibility)

The Asymmetric Chip Multiprocessor (ACMP) Niagara -like core Large core ACMP Approach Niagara -like core “Niagara” Approach Large core Large core Large core “Tile-Large” Approach

Large core vs. Small Core Out-of-order Wide fetch e.g. 4-wide Deeper pipeline Aggressive branch predictor (e.g. hybrid)‏ Many functional units Trace cache Memory dependence speculation In-order Narrow Fetch e.g. 2-wide Shallow pipeline Simple branch predictor (e.g. Gshare)‏ Few functional units Large Core Small Core

Throughput vs. Serial Performance

Server throughput The Good News: Not a software problem –Each core runs its own problem The Bad News: How many pins? –Memory bandwidth More Bad News: How much energy? –Each core runs its own problem

What about GPUs and Data Base In theory, absolutely! GPUs (SMT + SIMD + Predication) –Provided there are no conditional branches (Divergence) –Provided memory accesses line up nicely (Coalescing) Data Bases –Provided there are no critical sections

Random thoughts on Parallelism Why the sudden preoccupation with parallelism? The Silliness (or what I call Meganonsense) –Break the problem  Use half the energy –1000 mickey mouse cores –Hardware is sequential –Server throughput (how many pins?) –What about GPUs and Data Base? Current bugs to exploiting parallelism (or are they?) –Dark silicon –Amdahl’s Law –The Cloud The answer –The fundamental concept vis-à-vis parallelism –What it means re: the transformation hierarchy

Dark Silicon Too many transistors: we can not power them all –All those cores powered down –All that parallelism wasted Not really: The Refrigerator! (aka: Accelerators) –Fork (in parallel) –Although not all at the same time!

Amdahl’s Law The serial bottleneck always limits performance Heterogeneous cores AND control over them can minimize the effect

The Cloud It is behind the curtain, how to manage it Answer: the on-chip run-time system Answer: Pragmas beyond the Cloud

Random thoughts on Parallelism Why the sudden preoccupation with parallelism? The Silliness (or what I call Meganonsense) –Break the problem  Use half the energy –1000 mickey mouse cores –Hardware is sequential –Server throughput (how many pins?) –What about GPUs and Data Base? Current bugs to exploiting parallelism (or are they?) –Dark silicon –Amdahl’s Law –The Cloud The answer –The fundamental concept vis-à-vis parallelism –What it means re: the transformation hierarchy

The fundamental concept: Synchronization

Algorithm Program ISA (Instruction Set Arch)‏ Microarchitecture Circuits Problem Electrons

At every layer we synchronize Algorithm: task dependencies ISA: sequential control flow (implicit) Microarchitecture: ready bits Circuit : clock cycle (implicit)

Who understands this? Should this be part of students’ parallelism education? Where should it come in the curriculum? Can students even understand these different layers?

Parallel to Sequential to Parallel Guri says: think sequential, execute parallel –i.e. don’t throw away 60 years of computing experience –The original HPS model of out-of-order execution –Synchronization is obvious: restricted data flow At the higher level, parallel at larger granularity –Pragmas in JAVA? Who would have thought! –Dave Kuck’s CEDAR project, vintage 1985 –Synchronization is necessary: course grain data flow

Can we do more? The run-time system – part of the chip design –The chip knows the chip resources –On-chip monitoring can supply information –The run-time system can direct the use of those resources The Cloud – the other extreme, and today’s be-all –How do we harness its capability? –What is needed from the hierarchy to make it work

My message Parallelism is a serious goal IF we want to solve the most challenging problems (Cure cancer, predict tsunamis) Telling people to think parallel is nice, but often silly Examining the transformation hierarchy and seeing where we can leverage seems to me a sounder approach