1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.

Slides:



Advertisements
Similar presentations
CSCI 4717/5717 Computer Architecture
Advertisements

Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Intel Multi-Core Technology. New Energy Efficiency by Parallel Processing – Multi cores in a single package – Second generation high k + metal gate 32nm.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Extending the Unified Parallel Processing Speedup Model Computer architectures take advantage of low-level parallelism: multiple pipelines The next generations.
Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.
Instruction Level Parallelism (ILP) Colin Stevens.
Chapter Hardwired vs Microprogrammed Control Multithreading
Current Trends in CMP/CMT Processors Wei Hsu 7/26/2006.
1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
Chapter 18 Multicore Computers
Computer performance.
Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.
8 – Simultaneous Multithreading. 2 Review from Last Time Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.
André Seznec Caps Team IRISA/INRIA HAVEGE HArdware Volatile Entropy Gathering and Expansion Unpredictable random number generation at user level André.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Outline  Over view  Design  Performance  Advantages and disadvantages  Examples  Conclusion  Bibliography.
1 Vulnerabilities on high-end processors André Seznec IRISA/INRIA CAPS project-team.
Classic Model of Parallel Processing
A few issues on the design of future multicores André Seznec IRISA/INRIA.
Shashwat Shriparv InfinitySoft.
André Seznec Caps Team IRISA/INRIA 1 High Performance Microprocessors André Seznec IRISA/INRIA
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
HyperThreading ● Improves processor performance under certain workloads by providing useful work for execution units that would otherwise be idle ● Duplicates.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Multi-core, Mega-nonsense. Will multicore cure cancer? Given that multicore is a reality –…and we have quickly jumped from one core to 2 to 4 to 8 –It.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
William Stallings Computer Organization and Architecture 6th Edition
Lynn Choi School of Electrical Engineering
Parallel Processing - introduction
Lynn Choi School of Electrical Engineering
Computer Structure Multi-Threading
Architecture & Organization 1
Hyperthreading Technology
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Architecture & Organization 1
Levels of Parallelism within a Single Processor
BIC 10503: COMPUTER ARCHITECTURE
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Adaptive Single-Chip Multiprocessing
Chapter 1 Introduction.
Computer Evolution and Performance
Levels of Parallelism within a Single Processor
Chapter 4 Multiprocessors
8 – Simultaneous Multithreading
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
Presentation transcript:

1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

2 Disclaimer  No nice mathematical formula  No theorem BTW : I got the aggregation of maths, but it was long time ago

3 Focus of high performance computer architecture  Up to 1980  Mainframes  Up to 1990  Supercomputers  Till now:  General purpose microprocessors  Coming:  Mobile computing, embedded computing

4 Uniprocessor architecture has driven performance progress so far The famous “Moore’s law”

5 “Moore’s Law”: predict exponential growth  Every 18 months:  The number of transistors on chip doubles  The performance of the processor doubles  The size of the main memory doubles 30 years IRISA = 1,000,000 x

6 Moore’s Law: transistors on a processor chip HP8500 Montecito IRISA 80 persons IRISA 550 persons IRISA 200 persons IRISA 350 persons

7 Moore’s Law: bits per DRAM chip 1K 64K 1M 1Gbit 64M M. Métivier G. Ruget L. Kott J.P. Verjus J.P. Banatre C. Labit and the IRISA directors

8 Moore’s Law: Performance ,000 3,000 J. Lenfant A. Seznec P. Michaud F. Bodin I. Puaut and the CAPS group

9 And parallel machines, so far..  Parallel machines have been built from every processor generation:  Hardware coherent shared memory processors: Dual processors board Up to 8-processor server Large (very expensive) hardware coherent NUMA architectures  Distributed memory systems: Intel iPSC, Paragon clusters, clusters of clusters..

10 Parallel machines have not been mainstream so far But it might change But it will change

11 What has prevented parallel machines to prevail ?  Economic issue:  Hardware cost grew superlinearly with the number of processors  Performance: Never been able to use the last generation micropocessor:  Scalability issue: Bus snooping does not scale well above 4-8 processors  Parallel applications are missing:  Writing parallel applications requires thinking “parallel”  Automatic parallelization works on small segments

12 What has prevented parallel machines to prevail ? (2)  We ( the computer architects) were also guilty :  We just found how to use these transistors in a uniprocessor  IC technology brought the transistors and the frequency  We brought the performance :  Compiler guys also helped a little bit

13 Up to now, what was microarchitecture about ?  Memory access time is 100 ns  Program semantic is sequential  Instruction life (fetch, decode,..,execute,..,memory access,..) is ns  How can we use the transistors to achieve the highest performance as possible?  So far, up to 4 instructions every 0.3 ns

14 The architect tool box for uniprocessor performance  Pipelining  Instruction Level Parallelism  Speculative execution  Memory hierarchy

15 Pipelining  Just slice the instruction life in equal stages and launch concurrent execution: IF…DC…EXM…CT I0 I1 I2 IF…DC…EXM…CT IF…DC…EXM…CT IF…DC…EXM…CT time

16 + Instruction Level Parallelism IF…DC…EXM…CT IF…DC…EXM…CT IF…DC…EXM…CT IF…DC…EXM…CT IF…DC…EXM…CT IF…DC…EXM…CT IF…DC…EXM…CT IF…DC…EXM…CT

17 + out-of-order execution To optimize resource usage: Executes as soon as operands are valid IFDCEXMCT wait IFDCEXM IFDCEXMCT IFDC EXM CT IFDCEXMCT IFDCEXMCT

18 + speculative execution  % branches:  On Pentium 4: direction and target known at cycle 31 !!  Predict and execute speculatively:  Validate at execution time  State-of-the-art predictors: ≈2 mispredictions per 1000 instructions  Also predict:  Memory (in)dependency  (limited) data value

19 + memory hierarchy Processor Main memory I $D$ 300 cycles ≈ 1000 inst L2 cache: 10 cycles Small 1-2 cycles

20 + prefetch  On a miss on the L2 cache:  stops for 300 cycles ! ?!  Try to predict which memory block will be missing in the near future and bring it in the L2 cache

21 Can we continue to just throw transistors in uniprocessors ? Increasing inst. per cycles? Larger caches ? New prefetch mechanisms ?

22 One billion transistors now !! The uniprocessor road seems over  way uniprocessor seems out of reach:  just not enough ILP  quadratic complexity on a few key components: register file, bypass, issue logic,..  to avoid temperature hot spots: very long intra-CPU communications would be needed  5-7 years to design a 4-way superscalar core: How long to design a 16-way ?

23 One billion transistors: Process/Thread level parallelism.. it’s time now !  Simultaneous multithreading:  TLP on a uniprocessor !  Chip multiprocessor Combine both !!

24 Simultaneous Multithreading (SMT): parallel processing on a uniprocessor  functional units are underused on superscalar processors  SMT:  Sharing the functional units on a superscalar processor between several processes  Advantages:  Single process can use all the resources  dynamic sharing of all structures on parallel/multiprocess workloads

25 Superscalar SMT Time Unused Thread 1 Thread 2 Thread 3 Thread 4

26 The Chip Multiprocessor  Put a shared memory multiprocessor on a single die:  Duplicate the processor, its L1 cache, may be L2,  Keep the caches coherent  Share the last level of the memory hierarchy (may be)  Share the external interface (to memory and system)

27 General purpose Chip MultiProcessor (CMP): why it did not (really) appear before 2003  Till 2003 better (economic) usage for transistors:  Single process performance is the most important  More complex superscalar implementation  More cache space Diminishing return !!

28 General Purpose CMP: why it should not still appear as mainstream  No further (significant) benefit in complexifying single processors:  Logically we shoud use smaller and cheaper chips  or integrate the more functionalities on the same chip: E.g. the graphic pipeline  Poor catalog of parallel applications:  Single processor is still mainstream  Parallel programming is the privilege (knowledge) of a few

29 General Purpose CMP: why they appear as mainstream now ! The economic factor: -The consumer user pays euros for a PC -The professional user pays euros for a PC A constant: The processor represents % of the PC price Intel and AMD will not cut their share

30 General Purpose Multicore SMT in 2005: an industry reality Intel and IBM  PCs :  Dual-core 2-thread SMT Pentium 4  Dual-core Amd64  Servers:  Itanium Montecito: dual-core 2-thread SMT  IBM Power 5: dual-core 2-thread SMT  Sun Niagara: 8-core 4-thread

31 a 4-core 4-thread SMT: think about tuning for performance ! P0 4 threads P1P2P3 Shared L3 Cache HUGE MAIN MEMORY

32 a 4-core 4-thread SMT: think about tuning for performance ! (2)  Write the application with more than 16 threads  Take in account first level memory hierarchy sharing !  Take in account 2nd level memory hierarchy sharing !  Optimize for instruction level parallelism Competing for memory bandwidth !!

33 Hardware TLP is there !! But where are the threads/processes ? A unique opportunity for the software industry: hardware parallelism comes for free

34 Waiting for the threads (1)  Generate threads to increase performance of single threads:  Speculative threads: Predict threads at medium granularity –Either software or hardware  Helper threads: Run ahead a speculative skeleton of the application to: –Avoid branch mispredictions –Prefetch data

35 Waiting for the threads (2)  Hardware transcient faults are becoming a concern:  Runs twice the same thread on two cores and check integrity  Security:  Array bound checking is nearly for free on a out- of-order core  Encryption/decryption

36 Waiting for the threads (3)  Hardware clock frequency is limited by:  Power budget: every core running  Temperature hot-spots  On single thread workload:  Increase clock frequency and migrate the process

37 Conclusion  Hardware TLP is becoming mainstream for general- purpose computing  Moderate degrees of hardware TLPs will be available for mid-term  That is the first real opportunity for the whole software industry to go parallel !  But it might demand a new generation of application developpers !!