Yale Patt The University of Texas at Austin University of California, Irvine March 2, 2012 High Performance in the Multi-core Era: The role of the Transformation.

Slides:

Advertisements

Similar presentations

The Implications of Multi-core. What I want to do today Given that everyone is heralding Multi-core –Is it really the Holy Grail? –Will it cure cancer?

Advertisements

Computer Abstractions and Technology

Microprocessor Performance, Phase II Yale Patt The University of Texas at Austin STAMATIS Symposium TU-Delft September 28, 2007.

Microarchitecture is Dead AND We must have a multicore interface that does not require programmers to understand what is going on underneath.

Yale Patt The University of Texas at Austin World University Presidents’ Symposium University of Belgrade April 4, 2009 Future Microprocessors: What must.

Yale Patt The University of Texas at Austin Universidade de Brasilia Brasilia, DF -- Brazil August 12, 2009 Future Microprocessors: Multi-core, Multi-nonsense,

Some Opportunities and Obstacles in Cross-Layer and Cross-Component (Power) Management Onur Mutlu NSF CPOM Workshop, 2/10/2012.

TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.

Instruction Level Parallelism (ILP) Colin Stevens.

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

CS 300 – Lecture 20 Intro to Computer Architecture / Assembly Language Caches.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

EET 4250: Chapter 1 Performance Measurement, Instruction Count & CPI Acknowledgements: Some slides and lecture notes for this course adapted from Prof.

Chapter 4 Assessing and Understanding Performance

1  2004 Morgan Kaufmann Publishers Chapter Seven.

Parallel Algorithms - Introduction Advanced Algorithms & Data Structures Lecture Theme 11 Prof. Dr. Th. Ottmann Summer Semester 2006.

1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.

Mechanisms: Run-time and Compile-time. Outline Agents of Evolution Run-time Branch prediction Trace Cache SMT, SSMT The memory problem (L2 misses) Compile-time.

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs José A. Joao * M. Aater Suleman * Onur Mutlu ‡ Yale N. Patt * * HPS Research.

Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.

1 Layers of Computer Science, ISA and uArch Alexander Titov 20 September 2014.

8 – Simultaneous Multithreading. 2 Review from Last Time Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for.

Gary MarsdenSlide 1University of Cape Town Computer Architecture – Introduction Andrew Hutchinson & Gary Marsden (me) ( ) 2005.

Introduction CSE 410, Spring 2008 Computer Systems

Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.

EET 4250: Chapter 1 Computer Abstractions and Technology Acknowledgements: Some slides and lecture notes for this course adapted from Prof. Mary Jane Irwin.

Multi-Core Architectures

Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.

1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

My major mantra: We Must Break the Layers. Algorithm Program ISA (Instruction Set Arch)‏ Microarchitecture Circuits Problem Electrons.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

Lecture 1 1 Computer Systems Architecture Lecture 1: What is Computer Architecture?

10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.

1 CS/EE 362 Hardware Fundamentals Lecture 9 (Chapter 2: Hennessy and Patterson) Winter Quarter 1998 Chris Myers.

Parallelism: A Serious Goal or a Silly Mantra (some half-thought-out ideas)

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.

Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.

A few issues on the design of future multicores André Seznec IRISA/INRIA.

1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.

Shashwat Shriparv InfinitySoft.

Morgan Kaufmann Publishers

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

Lecture VIII: Software Architecture

Multi-core, Mega-nonsense. Will multicore cure cancer? Given that multicore is a reality –…and we have quickly jumped from one core to 2 to 4 to 8 –It.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.

1 A simple parallel algorithm Adding n numbers in parallel.

Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington.

Introduction CSE 410, Spring 2005 Computer Systems

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Conclusions on CS3014 David Gregg Department of Computer Science

CSE 410, Spring 2006 Computer Systems

A Common Machine Language for Communication-Exposed Architectures

Morgan Kaufmann Publishers

Architecture & Organization 1

Hyperthreading Technology

The University of Texas at Austin

Introduction, Focus, Overview

Architecture & Organization 1

Levels of Parallelism within a Single Processor

Computer Architecture Lecture 4 17th May, 2006

The University of Texas at Austin

Chapter 1 Introduction.

Computer Architecture: A Science of Tradeoffs

Introduction, Focus, Overview

Presentation transcript:

Yale Patt The University of Texas at Austin University of California, Irvine March 2, 2012 High Performance in the Multi-core Era: The role of the Transformation Hierarchy

Outline What is the transformation hierarchy? How do we deal with it? –First the nonsense –Second, what we need to do differently? –What can we expect if we do? How do we make that happen?

Outline What is the transformation hierarchy? How do we deal with it? –First the nonsense –Second, what we need to do differently? –What can we expect if we do? How do we make that happen?

Algorithm Program ISA (Instruction Set Arch)‏ Microarchitecture Circuits Problem Electrons

Outline What is the transformation hierarchy? How do we deal with it? –First the nonsense –Second, what we need to do differently? –What can we expect if we do? How do we make that happen?

Outline What is the transformation hierarchy? How do we deal with it? –First the Mega-nonsense –Second, what we need to do differently? –What can we expect if we do? How do we make that happen?

Important to disabuse ourselves of a lot of Mega-nonsense Multi-core was an architectural solution Hardware works sequentially Make the hardware simple – thousands of cores ILP is dead Abstraction is a pure good All programmers need to be protected Thinking in parallel is hard

Some of the nonsense Multi-core was an architectural solution Hardware works sequentially Make the hardware simple – thousands of cores ILP is dead Abstraction is a pure good All programmers need to be protected Thinking in parallel is hard

Multi-core: How we got to where we are (i.e., Moore’s Law)‏ The first microprocessor (Intel 4004), 1971 –2300 transistors –106 KHz The Pentium chip, 1992 –3.1 million transistors –66 MHz Today –more than one billion transistors –Frequencies in excess of 5 GHz Tomorrow ?

What have we done with these transistors?

Multi-core: How we got to where we are In the beginning: a better and better uniprocessor –improving performance on the hard problems –…until it just got too hard Followed by: a uniprocessor with a bigger L2 cache –forsaking further improvement on the “hard” problems –utilizing chip area sub-optimally Today: adding more and more cores –dual core, quad core, octo core, 32 cores and counting… –Why: It was the obvious most cost-effective next step Tomorrow: ???

So, What’s the Point Multi-core exists; we will have 1000 core chips (…if we are not careful!) But it wasn’t because of computer architects Ergo, we do not have to accept multi-core as it is i.e., we can get it right in the future, and that means: What goes on the chip What are the interfaces

Some of the nonsense Multi-core was an architectural solution Hardware works sequentially Make the hardware simple – thousands of cores ILP is dead Abstraction is a pure good All programmers need to be protected Thinking in parallel is hard

Some of the nonsense Multi-core was an architectural solution Hardware works sequentially Make the hardware simple – thousands of cores ILP is dead Abstraction is a pure good All programmers need to be protected Thinking in parallel is hard

The Asymmetric Chip Multiprocessor (ACMP) Niagara -like core Large core ACMP Approach Niagara -like core “Niagara” Approach Large core Large core Large core “Tile-Large” Approach

Heterogeneous, not Homogeneous My mantra since 2002 – (The Robert Chien lecture at Illinois) –Pentium X / Niagara Y More recently, ACMP –First published as a UT Tech Report in February, 2007 –Later: PhD dissertation (2010) of M. Aater Suleman –In between: ISCA 2010 (Data Marshalling) ASPLOS (ACS – Handling critical sections) Today, we are working on MorphCore

Large core vs. Small Core Out-of-order Wide fetch e.g. 4-wide Deeper pipeline Aggressive branch predictor (e.g. hybrid)‏ Many functional units Trace cache Memory dependence speculation In-order Narrow Fetch e.g. 2-wide Shallow pipeline Simple branch predictor (e.g. Gshare)‏ Few functional units Large Core Small Core

Throughput vs. Serial Performance

Some of the nonsense Multi-core was an architectural solution Hardware works sequentially Make the hardware simple – thousands of cores ILP is dead Abstraction is a pure good All programmers need to be protected Thinking in parallel is hard

ILP is dead We double the number of transistors on the chip –Pentium M: 77 Million transistors (50M for the L2 cache)‏ –2 nd Generation: 140 Million (110M for the L2 cache)‏ We see 5% improvement in IPC Ergo: ILP is dead! Perhaps we have blamed the wrong culprit. The EV4,5,6,7,8 data: from EV4 to EV8: –Performance improvement: 55X –Performance from frequency: 7X –Ergo: 55/7 > 7 -- more than half due to microarchitecture

Moore’s Law A law of physics A law of process technology A law of microarchitecture A law of psychology

Some of the nonsense Multi-core was an architectural solution Hardware works sequentially Make the hardware simple – thousands of cores ILP is dead Abstraction is a pure good All programmers need to be protected Thinking in parallel is hard

Can Abstractions lead to trouble? Taxi to the airport The Scheme chip Sorting Microsoft developers (Deeper understanding)

Some of the nonsense Multi-core was an architectural solution Hardware works sequentially Make the hardware simple – thousands of cores ILP is dead Abstraction is a pure good All programmers need to be protected Thinking in parallel is hard

Two kinds of people with problems to solve Those who need the utmost in performance –and are willing to trade ease of use for extra performance –For them, we need to provide a more powerful engine (Heavyweight cores for ILP, Accelerators for specific tasks) Those who just need to get their job done –and are content with trading performance for ease of use (They really do not want to know how computers work) –For them, we need a software layer to bridge the gap

Some of the nonsense Multi-core was an architectural solution Hardware works sequentially Make the hardware simple – thousands of cores ILP is dead Abstraction is a pure good All programmers need to be protected Thinking in parallel is hard

Thinking in Parallel is Hard?

Perhaps: Thinking is Hard

Thinking in Parallel is Hard? Perhaps: Thinking is Hard How do we get people to believe: Thinking in parallel is natural

Outline What is the transformation hierarchy? How do we deal with it? –First the nonsense –Second, what we need to do differently? –What can we expect if we do? How do we make that happen?

How do we harness tens of billions of transistors on each chip? In the old days, just add “stuff” –Everything was transparent to the software –Performance came because the engine was more powerful Branch Prediction, out-of-order Execution Larger caches –But lately, the “stuff” has been more cores, which is no longer transparent to the software Today, Bandwidth and Energy can kill you Tomorrow, both will be worse… Unless we find a better paradigm

In my view, that means: exploiting the transformation hierarchy I propose four steps to help make that happen

Algorithm Program ISA (Instruction Set Arch)‏ Microarchitecture Circuits Problem Electrons

Step 1: We Must Break the Layers (We already have in limited cases) Pragmas in the Language The Refrigerator X + Superscalar The algorithm, the language, the compiler, & the microarchitecture all working together

IF we break the layers: Compiler, Microarchitecture –Multiple levels of cache –Block-structured ISA –Part by compiler, part by uarch –Fast track, slow track Algorithm, Compiler, Microarchitecture –X + superscalar – the Refrigerator –Niagara X / Pentium Y Microarchitecture, Circuits –Verification Hooks –Internal fault tolerance

Step 2: We must recognize we need ILP cores Probably MorphCores High performance out-of-order for serial code SMT in-order when throughput dominates

Step 3: We need more than one interface Those who need the utmost in performance –and are willing to trade ease of use for extra performance –For them, we need to provide a more powerful engine (Heavyweight cores for ILP, Accelerators for specific tasks ) Those who just need to get their job done –and are content with trading performance for ease of use –For them, we need a software layer to bridge the multi-core …which means: –Chip designers understanding what algorithms need –Algorithms folks understanding what hardware provides –Knowledgeable Software folks bridging the two interfaces

Step 4: The future Run-time System Much closer to the microarchitecture Procedures called by compiled code Inputs from on-chip monitoring Decides how to utilize on-chip resources

Outline What is the transformation hierarchy? How do we deal with it? –First the nonsense –Second, what we need to do differently? –What can we expect if we do? How do we make that happen?

If we are willing: IF we are willing to continue to pursue ILP –Because high performance tasks can not do without it IF we are willing to break the layers –Because we don’t have a choice anymore IF we are willing to embrace parallel programming IF we are willing to accept more than one interface –One for those who can exploit details of the hardware –Another for those who can not

Then we can provide the chips of tomorrow Performance that keeps pace with demand Realistic power requirements Fault tolerance Verifiable Secure …

And we can support both sets of users: For those who wish to “cure cancer” –Chip designers must provide the low-level interface –Knowledgeable algorithm designers will exploit it For those who wish to use a high-level interface –Chip designers must provide the hooks –Knowledgeable software designers must bridge the gap

Outline What is the transformation hierarchy? How do we deal with it? –First the nonsense –Second, what we need to do differently? –What can we expect if we do? How do we make that happen?

To start with: Education i.e., Do not accept the premise: It is okay to know only one layer.

Students can understand more than one layer What if we get rid of “Objects” FIRST –Students do not get it – they have no underpinnings –Objects are too high a level of abstraction –So, students end up memorizing –Memorizing isn’t learning (and certainly not understanding)‏ What if we START with “motivated” bottom up –Students build on what they already know –Memorizing is replaced with real learning –Continually raising the level of abstraction The student sees the layers from the start –The student makes the connections –The student understands what is going on –The layers get broken naturally

Why do I waste your time talking about education? Microsoft developers who started with my motivated bottom-up approach say It makes better programmers out of them Bill Gates has complained more than once about the capability of new graduates they interview You can make a difference

Again: IF we understand more than our own layer of the transformation hierarchy so we really can talk to each other, THEN maybe we can make all of the above happen.

Thank you!