1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.

Slides:



Advertisements
Similar presentations
Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Intel Multi-Core Technology. New Energy Efficiency by Parallel Processing – Multi cores in a single package – Second generation high k + metal gate 32nm.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.
Instruction Level Parallelism (ILP) Colin Stevens.
Chapter Hardwired vs Microprogrammed Control Multithreading
Current Trends in CMP/CMT Processors Wei Hsu 7/26/2006.
Chapter 17 Parallel Processing.
1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
Computer performance.
MICROPROCESSOR AN OVERVIEW A microprocessor is a multipurpose, programmable, clock driven, register based device that takes input and provides output.
Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.
8 – Simultaneous Multithreading. 2 Review from Last Time Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.
1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Outline  Over view  Design  Performance  Advantages and disadvantages  Examples  Conclusion  Bibliography.
1 Vulnerabilities on high-end processors André Seznec IRISA/INRIA CAPS project-team.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
Shashwat Shriparv InfinitySoft.
André Seznec Caps Team IRISA/INRIA 1 High Performance Microprocessors André Seznec IRISA/INRIA
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
THE BRIEF HISTORY OF 8085 MICROPROCESSOR & THEIR APPLICATIONS
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
HyperThreading ● Improves processor performance under certain workloads by providing useful work for execution units that would otherwise be idle ● Duplicates.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Multi-core, Mega-nonsense. Will multicore cure cancer? Given that multicore is a reality –…and we have quickly jumped from one core to 2 to 4 to 8 –It.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
William Stallings Computer Organization and Architecture 6th Edition
Lynn Choi School of Electrical Engineering
Lynn Choi School of Electrical Engineering
Architecture & Organization 1
Hyperthreading Technology
Architecture & Organization 1
Levels of Parallelism within a Single Processor
BIC 10503: COMPUTER ARCHITECTURE
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Chapter 1 Introduction.
Computer Evolution and Performance
CSC3050 – Computer Architecture
Levels of Parallelism within a Single Processor
Chapter 4 Multiprocessors
8 – Simultaneous Multithreading
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
Presentation transcript:

1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

2 Focus of high performance computer architecture  Up to 1980  Mainframes  Up to 1990  Supercomputers  Till now:  General purpose microprocessors  Coming:  Mobile computing, embedded computing

3 Uniprocessor architecture has driven progress so far The famous “Moore law”

4 Moore’s “Law”: transistors on a microprocessor  Nb of transistors on a microprocessor chip doubles every 18 months  1972: 2000 transistors (Intel 4004)  1989: 1 M transistors (Intel 80486)  1999: 130 M transistors (HP PA-8500)  2005: 1,7 billion transistors (Intel Itanium Montecito)

5 Moore’s “law”: performance Performance doubles every 18 months  1989: Intel Mhz (< 1inst/cycle)  1995: PentiumPro 150 Mhz x 3 inst/cycle  2002: Pentium Ghz x 3 inst/cycle  09/2005: Pentium 4, 3.2 Ghz x 3 inst/cycle x 2 processors !!

6 Moore’s “Law”: memory  Memory capacity doubles every 18 months:  1983: 64 Kbits chips  1989: 1 Mbit chips  2005: 1 Gbit chips

7 And parallel machines, so far..  Parallel machines have been built from every processor generation:  Tightly coupled shared memory processors: Dual processors board Up to 8 processors servers  Distributed memory parallel machines: Hardware coherent memory (NUMA): servers Software managed memory: clusters, clusters of clusters..

8 Hardware thread level parallelism has not been mainstream so far But it might change But it will change

9 What has prevented hardware thread parallelism to prevail ?  Economic issue:  Hardware cost grew superlinearly with the number of processors  Performance: Never been able to use the last generation micropocessor:  Scalability issue: Bus snooping does not scale well above 4-8 processors  Parallel applications are missing:  Writing parallel applications requires thinking “parallel”  Automatic parallelization works on small segments

10 What has prevented hardware thread parallelism to prevail ? (2)  We ( the computer architects) were also guilty :  We just found how to use these transistors in a uniprocessor  IC technology only brings the transistors and the frequency  We brang the performance :  Compiler guys helped a little bit

11 Up to now, what was microarchitecture about ?  Memory access time is 100 ns  Program semantic is sequential  Instruction life (fetch, decode,..,execute,..,memory access,..) is ns  How can we use the transistors to achieve the highest performance as possible?  So far, up to 4 instructions every 0.3 ns

12 The processor architect challenge  300 mm2 of silicon  2 technology generations ahead  What can we use for performance ?  Pipelining  Instruction Level Parallelism  Speculative execution  Memory hierarchy

13 Pipelining  Just slice the instruction life in equal stages and launch concurrent execution: time IFDCEXMCT I0 IFDCEXMCT IFDCEXMCT IFDCEXMCT I1 I2

14 + Instruction Level Parallelism IFDCEXMCTIFDCEXMCTIFDCEXMCTIFDCEXMCTIFDCEXMCT IFDCEXMCT IFDCEXMCT IFDCEXMCT

15 + out-of-order execution IFDCEXM CT IFDCEXMIFDCEXM CT IFDCEXM CT IFDCEXM CT IFDCEXM CT IFDCEXM IFDCEXM wait Executes as soon as operands are valid CT

16 + speculative execution  % branches:  Can not afford to wait for 30 cycles for direction and target  Predict and execute speculatively:  Validate at execution time  State-of-the-art predictors: ≈2 misprediction per 1000 instructions  Also predict:  Memory (in)dependency  (limited) data value

17 + memory hierarchy  Main memory response time:  ≈ 100 ns ≈ 1000 instructions  Use of a memory hierarchy:  L1 caches: 1-2 cycles, 8-64KB  L2 cache: 10 cycles, 256KB-2MB  L3 cache (coming): 25 cycles, 2-8MB  + prefetching for avoiding cache misses

18 Can we continue to just throw transistors in uniprocessors ? Increasing the superscalar degree ? Larger caches ? New prefetch mechanisms ?

19 One billion transistors now !! The uniprocessor road seems over  way uniprocessor seems out of reach:  just not enough ILP  quadratic complexity on a few key (power hungry) components (register file, bypass, issue logic)  to avoid temperature hot spots: very long intra-CPU communications would be needed  5-7 years to design a 4-way superscalar core: How long to design a 16-way ?

20 One billion transistors: Thread level parallelism, it’s time now !  Chip multiprocessor  Simultaneous multithreading:  TLP on a uniprocessor !

21 General purpose Chip MultiProcessor (CMP): why it did not (really) appear before 2003  Till 2003 better (economic) usage for transistors:  Single process performance is the most important  More complex superscalar implementation  More cache space: Bring the L2 cache on-chip Enlarge the L2 cache Include a L3 cache (now) Diminishing return !!

22 General Purpose CMP: why it should not still appear as mainstream  No further (significant) benefit in complexifying single processors:  Logically we shoud use smaller and cheaper chips  or integrate the more functionalities on the same chip: E.g. the graphic pipeline  Very poor catalog of parallel applications:  Single processor is still mainstream  Parallel programming is the privilege (knowledge) of a few

23 General Purpose CMP: why they appear as mainstream now ! The economic factor: -The consumer user pays euros for a PC -The professional user pays euros for a PC A constant: The processor represents % of the PC price Intel and AMD will not cut their share

24 The Chip Multiprocessor  Put a shared memory multiprocessor on a single die:  Duplicate the processor, its L1 cache, may be L2,  Keep the caches coherent  Share the last level of the memory hierarchy (may be)  Share the external interface (to memory and system)

25 Chip multiprocessor: what is the situation (2005) ?  PCs : Dual-core Pentium 4 and Amd64  Servers:  Itanium Montecito: dual-core  IBM Power 5: dual-core  Sun Niagara: 8 processor CMP

26 The server vision: IBM Power 4

27 Simultaneous Multithreading (SMT): parallel processing on a uniprocessor  functional units are underused on superscalar processors  SMT:  Sharing the functional units on a superscalar processor between several process  Advantages:  Single process can use all the resources  dynamic sharing of all structures on parallel/multiprocess workloads

28 Superscalar SMT Time

29 The programmer view

30 SMT: Alpha (cancelled june 2001)  8-way superscalar  Ultimate performance on a process  SMT: up to 4 contexts  Extra cost in silicon, design and so on:  evaluated to 5-10 %

31 General Purpose Multicore SMT: an industry reality Intel and IBM  Intel Pentium 4 Is developped as a 2-context SMT:  Coined as hyperthreading by Intel  Dual-core SMT  Intel Itanium Montecito: dual-core 2-context SMT  IBM Power5 : dual-core 2-context SMT

32 The programmer view of a multi-core SMT !

33 Hardware TLP is there !! But where are the threads ? A unique opportunity for the software industry: hardware parallelism comes for free

34 Waiting for the threads (1)  Artificially generates threads to increase performance of single threads:  Speculative threads: Predict threads at medium granularity –Either software or hardware  Helper threads: Run ahead a speculative skeleton of the application to: –Avoid branch mispredictions –Prefetch data

35 Waiting for the threads (2)  Hardware transcient faults are becoming a concern:  Runs twice the same thread on two cores and check integrity  Security:  array bound checking is nearly for free on a out-of-order core

36 Waiting for the threads (3)  Hardware clock frequency is limited by:  Power budget: every core running  Temperature hot-spots  On single thread workload:  Increase clock frequency and migrate the process

37 Conclusion  Hardware TLP is becoming mainstream on general-purpose.  Moderate degrees of hardware TLPs will be available for mid- term  That is the first real opportunity for the whole software industry to go parallel !  But it might demand a new generation of application developpers !!