1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.

Slides:

Advertisements

Similar presentations

Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Intel Multi-Core Technology. New Energy Efficiency by Parallel Processing – Multi cores in a single package – Second generation high k + metal gate 32nm.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.

Instruction Level Parallelism (ILP) Colin Stevens.

Chapter Hardwired vs Microprogrammed Control Multithreading

Current Trends in CMP/CMT Processors Wei Hsu 7/26/2006.

Chapter 17 Parallel Processing.

1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.

Computer performance.

MICROPROCESSOR AN OVERVIEW A microprocessor is a multipurpose, programmable, clock driven, register based device that takes input and provides output.

Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.

8 – Simultaneous Multithreading. 2 Review from Last Time Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.

1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.

Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.

Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Outline  Over view  Design  Performance  Advantages and disadvantages  Examples  Conclusion  Bibliography.

1 Vulnerabilities on high-end processors André Seznec IRISA/INRIA CAPS project-team.

A few issues on the design of future multicores André Seznec IRISA/INRIA.

Shashwat Shriparv InfinitySoft.

André Seznec Caps Team IRISA/INRIA 1 High Performance Microprocessors André Seznec IRISA/INRIA

Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

THE BRIEF HISTORY OF 8085 MICROPROCESSOR & THEIR APPLICATIONS

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.

CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.

HyperThreading ● Improves processor performance under certain workloads by providing useful work for execution units that would otherwise be idle ● Duplicates.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Multi-core, Mega-nonsense. Will multicore cure cancer? Given that multicore is a reality –…and we have quickly jumped from one core to 2 to 4 to 8 –It.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA.

SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

William Stallings Computer Organization and Architecture 6th Edition

Lynn Choi School of Electrical Engineering

Lynn Choi School of Electrical Engineering

Architecture & Organization 1

Hyperthreading Technology

Architecture & Organization 1

Levels of Parallelism within a Single Processor

BIC 10503: COMPUTER ARCHITECTURE

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

Chapter 1 Introduction.

Computer Evolution and Performance

CSC3050 – Computer Architecture

Levels of Parallelism within a Single Processor

Chapter 4 Multiprocessors

8 – Simultaneous Multithreading

CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.

Presentation transcript:

1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

2 Focus of high performance computer architecture  Up to 1980  Mainframes  Up to 1990  Supercomputers  Till now:  General purpose microprocessors  Coming:  Mobile computing, embedded computing

3 Uniprocessor architecture has driven progress so far The famous “Moore law”

4 Moore’s “Law”: transistors on a microprocessor  Nb of transistors on a microprocessor chip doubles every 18 months  1972: 2000 transistors (Intel 4004)  1989: 1 M transistors (Intel 80486)  1999: 130 M transistors (HP PA-8500)  2005: 1,7 billion transistors (Intel Itanium Montecito)

5 Moore’s “law”: performance Performance doubles every 18 months  1989: Intel Mhz (< 1inst/cycle)  1995: PentiumPro 150 Mhz x 3 inst/cycle  2002: Pentium Ghz x 3 inst/cycle  09/2005: Pentium 4, 3.2 Ghz x 3 inst/cycle x 2 processors !!

6 Moore’s “Law”: memory  Memory capacity doubles every 18 months:  1983: 64 Kbits chips  1989: 1 Mbit chips  2005: 1 Gbit chips

7 And parallel machines, so far..  Parallel machines have been built from every processor generation:  Tightly coupled shared memory processors: Dual processors board Up to 8 processors servers  Distributed memory parallel machines: Hardware coherent memory (NUMA): servers Software managed memory: clusters, clusters of clusters..

8 Hardware thread level parallelism has not been mainstream so far But it might change But it will change

9 What has prevented hardware thread parallelism to prevail ?  Economic issue:  Hardware cost grew superlinearly with the number of processors  Performance: Never been able to use the last generation micropocessor:  Scalability issue: Bus snooping does not scale well above 4-8 processors  Parallel applications are missing:  Writing parallel applications requires thinking “parallel”  Automatic parallelization works on small segments

10 What has prevented hardware thread parallelism to prevail ? (2)  We ( the computer architects) were also guilty :  We just found how to use these transistors in a uniprocessor  IC technology only brings the transistors and the frequency  We brang the performance :  Compiler guys helped a little bit

11 Up to now, what was microarchitecture about ?  Memory access time is 100 ns  Program semantic is sequential  Instruction life (fetch, decode,..,execute,..,memory access,..) is ns  How can we use the transistors to achieve the highest performance as possible?  So far, up to 4 instructions every 0.3 ns

12 The processor architect challenge  300 mm2 of silicon  2 technology generations ahead  What can we use for performance ?  Pipelining  Instruction Level Parallelism  Speculative execution  Memory hierarchy

13 Pipelining  Just slice the instruction life in equal stages and launch concurrent execution: time IFDCEXMCT I0 IFDCEXMCT IFDCEXMCT IFDCEXMCT I1 I2

14 + Instruction Level Parallelism IFDCEXMCTIFDCEXMCTIFDCEXMCTIFDCEXMCTIFDCEXMCT IFDCEXMCT IFDCEXMCT IFDCEXMCT

15 + out-of-order execution IFDCEXM CT IFDCEXMIFDCEXM CT IFDCEXM CT IFDCEXM CT IFDCEXM CT IFDCEXM IFDCEXM wait Executes as soon as operands are valid CT

16 + speculative execution  % branches:  Can not afford to wait for 30 cycles for direction and target  Predict and execute speculatively:  Validate at execution time  State-of-the-art predictors: ≈2 misprediction per 1000 instructions  Also predict:  Memory (in)dependency  (limited) data value

17 + memory hierarchy  Main memory response time:  ≈ 100 ns ≈ 1000 instructions  Use of a memory hierarchy:  L1 caches: 1-2 cycles, 8-64KB  L2 cache: 10 cycles, 256KB-2MB  L3 cache (coming): 25 cycles, 2-8MB  + prefetching for avoiding cache misses

18 Can we continue to just throw transistors in uniprocessors ? Increasing the superscalar degree ? Larger caches ? New prefetch mechanisms ?

19 One billion transistors now !! The uniprocessor road seems over  way uniprocessor seems out of reach:  just not enough ILP  quadratic complexity on a few key (power hungry) components (register file, bypass, issue logic)  to avoid temperature hot spots: very long intra-CPU communications would be needed  5-7 years to design a 4-way superscalar core: How long to design a 16-way ?

20 One billion transistors: Thread level parallelism, it’s time now !  Chip multiprocessor  Simultaneous multithreading:  TLP on a uniprocessor !

21 General purpose Chip MultiProcessor (CMP): why it did not (really) appear before 2003  Till 2003 better (economic) usage for transistors:  Single process performance is the most important  More complex superscalar implementation  More cache space: Bring the L2 cache on-chip Enlarge the L2 cache Include a L3 cache (now) Diminishing return !!

22 General Purpose CMP: why it should not still appear as mainstream  No further (significant) benefit in complexifying single processors:  Logically we shoud use smaller and cheaper chips  or integrate the more functionalities on the same chip: E.g. the graphic pipeline  Very poor catalog of parallel applications:  Single processor is still mainstream  Parallel programming is the privilege (knowledge) of a few

23 General Purpose CMP: why they appear as mainstream now ! The economic factor: -The consumer user pays euros for a PC -The professional user pays euros for a PC A constant: The processor represents % of the PC price Intel and AMD will not cut their share

24 The Chip Multiprocessor  Put a shared memory multiprocessor on a single die:  Duplicate the processor, its L1 cache, may be L2,  Keep the caches coherent  Share the last level of the memory hierarchy (may be)  Share the external interface (to memory and system)

25 Chip multiprocessor: what is the situation (2005) ?  PCs : Dual-core Pentium 4 and Amd64  Servers:  Itanium Montecito: dual-core  IBM Power 5: dual-core  Sun Niagara: 8 processor CMP

26 The server vision: IBM Power 4

27 Simultaneous Multithreading (SMT): parallel processing on a uniprocessor  functional units are underused on superscalar processors  SMT:  Sharing the functional units on a superscalar processor between several process  Advantages:  Single process can use all the resources  dynamic sharing of all structures on parallel/multiprocess workloads

28 Superscalar SMT Time

29 The programmer view

30 SMT: Alpha (cancelled june 2001)  8-way superscalar  Ultimate performance on a process  SMT: up to 4 contexts  Extra cost in silicon, design and so on:  evaluated to 5-10 %

31 General Purpose Multicore SMT: an industry reality Intel and IBM  Intel Pentium 4 Is developped as a 2-context SMT:  Coined as hyperthreading by Intel  Dual-core SMT  Intel Itanium Montecito: dual-core 2-context SMT  IBM Power5 : dual-core 2-context SMT

32 The programmer view of a multi-core SMT !

33 Hardware TLP is there !! But where are the threads ? A unique opportunity for the software industry: hardware parallelism comes for free

34 Waiting for the threads (1)  Artificially generates threads to increase performance of single threads:  Speculative threads: Predict threads at medium granularity –Either software or hardware  Helper threads: Run ahead a speculative skeleton of the application to: –Avoid branch mispredictions –Prefetch data

35 Waiting for the threads (2)  Hardware transcient faults are becoming a concern:  Runs twice the same thread on two cores and check integrity  Security:  array bound checking is nearly for free on a out-of-order core

36 Waiting for the threads (3)  Hardware clock frequency is limited by:  Power budget: every core running  Temperature hot-spots  On single thread workload:  Increase clock frequency and migrate the process

37 Conclusion  Hardware TLP is becoming mainstream on general-purpose.  Moderate degrees of hardware TLPs will be available for mid- term  That is the first real opportunity for the whole software industry to go parallel !  But it might demand a new generation of application developpers !!