André Seznec Caps Team IRISA/INRIA 1 High Performance Microprocessors André Seznec IRISA/INRIA

Slides:



Advertisements
Similar presentations
CSCI 4717/5717 Computer Architecture
Advertisements

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Mult. Issue CSE 471 Autumn 011 Multiple Issue Alternatives Superscalar (hardware detects conflicts) –Statically scheduled (in order dispatch and hence.
CS 152 Computer Architecture & Engineering Andrew Waterman University of California, Berkeley Section 8 Spring 2010.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
Computer Organization and Assembly language
Lect 13-1 Lect 13: and Pentium. Lect Microprocessor Family  Microprocessor  Introduced in 1989  High Integration  On-chip 8K.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.
Computer Architecture Computer Architecture Superscalar Processors Ola Flygt Växjö University +46.
1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
André Seznec Caps Team IRISA/INRIA HAVEGE HArdware Volatile Entropy Gathering and Expansion Unpredictable random number generation at user level André.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.
CS5222 Advanced Computer Architecture Part 3: VLIW Architecture
A few issues on the design of future multicores André Seznec IRISA/INRIA.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Use of Pipelining to Achieve CPI < 1
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
CS 352H: Computer Systems Architecture
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Simultaneous Multithreading
Lynn Choi School of Electrical Engineering
5.2 Eleven Advanced Optimizations of Cache Performance
Basic Computer Organization
CS203 – Advanced Computer Architecture
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Flow Path Model of Superscalars
Hyperthreading Technology
Levels of Parallelism within a Single Processor
Computer Architecture Lecture 4 17th May, 2006
Sampoorani, Sivakumar and Joshua
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
* From AMD 1996 Publication #18522 Revision E
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
What is Computer Architecture?
What is Computer Architecture?
Levels of Parallelism within a Single Processor
What is Computer Architecture?
Lecture 4: Instruction Set Design/Pipelining
The University of Adelaide, School of Computer Science
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
Presentation transcript:

André Seznec Caps Team IRISA/INRIA 1 High Performance Microprocessors André Seznec IRISA/INRIA

High Performance Microprocessors André Seznec Caps Team Irisa 2 Where are we (Fall 2000)?  400 Mhz to 1.1 Ghz  1 cycle = crossing the ALU + bypassing  a floating point operation = 2-4 cycles  register file read/write : 1 cycle  2 -3 in next generation  L1 cache read/write: 1-3 cycles  trade-off size-associativity-hit time

High Performance Microprocessors André Seznec Caps Team Irisa 3 Where are we (Fall 2000) ?  Integration : 0.25  0.18   (2001)  millions of transistors in logic  Memory structures: up to 150 millions of transistors  20 to 60 Watts  > 100 W soon  pins  > 1000 soon

High Performance Microprocessors André Seznec Caps Team Irisa 4 Where are we (Fall 2000) ?  Processors x86 :  low end : 500 Mhz, <100 $  high end : 1 Ghz, 1000 $  DRAM : 1$ per Mbyte  SRAM : 50$ per Mbyte

High Performance Microprocessors André Seznec Caps Team Irisa 5 Binary compatibility  PCs !  Introducing a new ISA: a risky business !!  It might change :  embedded applications (cell phones, automotive,..)  64 bits architectures are needed 

High Performance Microprocessors André Seznec Caps Team Irisa 6 32 or 64 bits architecture  Virtual addresses computed on 32 bits  PowerPC, x86,  Virtual addresses computed on 64 bits  MIPS III, Alpha, Ultrasparc, HP-PA 2.x, IA64  In 2005: Games ? Word ?  Can ’t be avoided :  x86 ?

High Performance Microprocessors André Seznec Caps Team Irisa 7 Which instruction set ?  CISC: x86  RISC: Sparc, Mips, Alpha, PowerPC, HP-PA  EPIC: IA-64  does it matter ?

High Performance Microprocessors André Seznec Caps Team Irisa 8 RISC ISA  a single instruction width : 32 bits  simple decode  next address  Load/store architecture  Simple addressing : based and indexed  register-register  Advantage: very simple to pipeline

High Performance Microprocessors André Seznec Caps Team Irisa 9 RISC (2)  general registers + floating point registers  memory access: 1-8 bytes ALIGNED  limited support to speculative execution

High Performance Microprocessors André Seznec Caps Team Irisa 10 CISC x86  variable instruction size  operands in registers and in memory  destructive code: write one of its operand  non-deterministic instruction execution time

High Performance Microprocessors André Seznec Caps Team Irisa 11 EPIC IA64  EPIC IA64 =  RISC  + lot of registers (128 INT, 128 FP)  + 3 instructions encoded on 128 bits bundle  + 64 predicate registers  + hardware interlocks  + (?) Advanced Loads  The compiler is in charge of the ILP

High Performance Microprocessors André Seznec Caps Team Irisa 12 EPIC IA 64  Not so good :  - ISA support for software pipeline Rotating registers Loop management  -Register windows: with variable size !  -No based addressing  - post-incremented addressing  - (?) Advanced Loads

High Performance Microprocessors André Seznec Caps Team Irisa 13 ISA: does it matter?  32 or 64 bits:  towards x86 death (or mutation !)  Performances:  x86 ? :=)  floating point !!  At equal technology ?

High Performance Microprocessors André Seznec Caps Team Irisa 14 ISA: does it matter (2) ?  x86: translation in  Op  4 lost cycles !  Or use a trace cache  x86: too few floating point registers  Alpha 21264: 2 operands 1 result  + performant RISC  Itanium: IA 64 in order  800 Mhz in 0.18   Alpha Mhz 0.35  (1997)

High Performance Microprocessors André Seznec Caps Team Irisa 15 So ?  x86: + installed base, + Intel, - must mute to 64 bits  IA64: + spéculative excution, + Intel  Alpha: + « clean » ISA, - what is Compaq strategy ?  Sparc: = ISA, + SUN  Power(PC):complex ISA, =IBM and Motorola

High Performance Microprocessors André Seznec Caps Team Irisa 16 The architect challenge  400 mm2 of silicon  3 technology generations ahead  What will you use for performance ?  Pipeline  ILP  Speculative execution  Memory hierarchy  Thread parallelism

High Performance Microprocessors André Seznec Caps Team Irisa 17 Deeper and deeper pipeline  Less gates crossing in a single cycle.  1 cycle = ALU crossing + feeding back  Longer and longer intra-CPU communications :  several cycles to cross the chip  More and more complex control:  more instructions in //  more speculation

High Performance Microprocessors André Seznec Caps Team Irisa 18 A few pipeline depths  cycles on lIntel Pentium III  cycles on AMD Athlon  7-9 cycles on Alpha  9 cycles on UltraSparc  10 cycles on Itanium  20 cycles on Willlamette  And it is likely to continue !!

High Performance Microprocessors André Seznec Caps Team Irisa 19 Instruction Level Parallelism  Superscalar:  the hardware is responsible for the parallel issuing  binary compatibility  All general purpose processors are superscalar

High Performance Microprocessors André Seznec Caps Team Irisa 20 The superscalar degree  Hard to define  « performance that can not be exceeded »  4 inst / cycles on almost all existing processors  6 inst / cycles on Itanium  8 inst / cycles on (future) Alpha  on Alpha ?? En 2012 ? :=)

High Performance Microprocessors André Seznec Caps Team Irisa 21 Superscalar: the difficulties  ILP is limited on applications:: 3 to 8 instructions per cycle  Register files :  number of ports is increasing  access is becoming a critical path  How to feed FUs with instructions ? with operands?  How to manage data dependencies  branch issues

High Performance Microprocessors André Seznec Caps Team Irisa 22 In order or out of order execution  In order execution :  Ah! If all latencies were statically known..  In order execution:  UltraSparc 3, Itanium  The compiler must do the work  Out-of-order execution  Alpha 21264, Pentium III, Athlon, Power (PC), HP-PA 8500

High Performance Microprocessors André Seznec Caps Team Irisa 23 Why in order execution  simple to implement  less transistors  shorter design cycle  faster clock (maybe)  shorter pipeline  the compiler « sees » everything:  micro-architecture  application

High Performance Microprocessors André Seznec Caps Team Irisa 24 Why out-of-order execution  Static ILP is limited on non-regular codes  The compiler does not see everything :  unknown latencies at compile time  memory (in)dependencies unknown at compile time  No need for recompiling  hum !!

High Performance Microprocessors André Seznec Caps Team Irisa 25 Out-of-order execution (1)  Principle :  execute ASAP: available operands and FUs  Implementation :  large instruction window for selecting fireable instructions

High Performance Microprocessors André Seznec Caps Team Irisa 26 Out-of-order execution (2)  Sequencing consists of:  // read of instructions  mark dependencies  rename registers  dispatch to FUS  waiting..  Dependency managing use space and time

High Performance Microprocessors André Seznec Caps Team Irisa 27 Renaming registers: removing false dependencies  WAW and WAR register hazards are not true dependencies.

High Performance Microprocessors André Seznec Caps Team Irisa 28 Memory dependencies  To execute a store, data to be stored is needed  Every subsequent load is potentially dependent on any previous store  Solution (old):  addresses are computed in program order  Loads bypass stores with hazards detection

High Performance Microprocessors André Seznec Caps Team Irisa 29 Memory dependencies (2)  Solution (current):  optimistic execution (out-of-order)  Repaired when true dependency is encountered  Pb: repair cost  Next generation: dependencies are predicted  Moshovos et Sohi, Micro'30, 1997  Chrysos et Emer, ISCA ’26, 1998

High Performance Microprocessors André Seznec Caps Team Irisa 30 Control hazards  % of instructions are branches  Target and direction are known very late in the pipeline :  Cycle 7 on DEC  Cycle 11 on Intel Pentium II  Cycle 18 on Willamette  Don ’t waste (all) these cycles !

High Performance Microprocessors André Seznec Caps Team Irisa 31 Dynamic branch prediction  Use history to predict the branch and fetch subsequent instructions  Implementation: tables read in // with the I-Cache  Will be predicted:  conditional branch target and direction  procedure returns  indirect branches

High Performance Microprocessors André Seznec Caps Team Irisa 32 Dynamic branch prediction  1992: DEC 21064, 1 bit scheme  1993: Pentium, 2 bits scheme  1995: PentiumPro, local history  1998: DEC 21264, hybrid predictor

High Performance Microprocessors André Seznec Caps Team Irisa 33 Dynamic branch prediction : general trend  More and more complex schemes  target prediction and direction prediction are decoupled  Return stack for procedures  ISA support for speculative execution  CMOV  IA64 predicates

High Performance Microprocessors André Seznec Caps Team Irisa 34 Out-of-order execution: « undoing »  uncorrect branch prediction  uncorrect independency prediction  Interruption, exception  Validate in program order  No definitive action out-of-order

High Performance Microprocessors André Seznec Caps Team Irisa 35 Being able to « undo »  A copy of the register file is updated in program order  or  a map of logical registers on physical registers is saved  Stores are done at validation

High Performance Microprocessors André Seznec Caps Team Irisa 36 Memory hierarchy may dictate performance  Example :  4 instructions/cycle,  1 memory access per cycle  10 cycles miss penalty if hit on the L2 cache  50 cycles for accessing the memory  2% instruction L1 misses, 4% data L1 misses, 1 /4 misses on L2  400 instructions : 395 cycles

High Performance Microprocessors André Seznec Caps Team Irisa 37 Hum..  L1 caches are non-blocking  L2 cache is pipelined  memory is pipelined  hardware and software prefetch  Latency and throughput are BOTH important

High Performance Microprocessors André Seznec Caps Team Irisa 38 L1 caches: general trend  1-2 cycles for read or write  multiple ported  non blocking  small associativity degree  will remain small

High Performance Microprocessors André Seznec Caps Team Irisa 39 L2 caches: general trend  mandatory !  on-chip or on the MCM  Pipelined access  « short » latency: 7-12 cycles  16 bytes bus WILL GROW  cycle time: 1-3 processor cycles  contention on the L2 cache might become a bottleneck

High Performance Microprocessors André Seznec Caps Team Irisa 40 Main memory  Far, very far, too far from the core:  several hundred instructions  Memory on the system bus: too slow !!  Memory bus + system bus

High Performance Microprocessors André Seznec Caps Team Irisa 41 Main memory on the system bus

High Performance Microprocessors André Seznec Caps Team Irisa 42 Glueless memory connection

High Performance Microprocessors André Seznec Caps Team Irisa 43 How to spend a billion transistors?  IRAM: a processor and its memory  Ultimate specalution on a uniprocessor  Thread parallelism:  On-chip multiprocessor  SMT processor

High Performance Microprocessors André Seznec Caps Team Irisa 44 IRAM  Processor and memory on a single die  huge memory bandwidth  Not for general purpose computing  Memory requirements increase  How it scales ?  Technological issue: mixing DRAM and logic  Cost-effective solution for lot of embedded applications

High Performance Microprocessors André Seznec Caps Team Irisa 45 Ultimate specalution on a uniprocessor  16 or 32 way superscalar processor  hyperspeculation:  branches,, dependencies, date....  Challenges::  prediction accuracy  communications on the chip  contention : caches, registers,...

High Performance Microprocessors André Seznec Caps Team Irisa 46 Thread parallelism on the chip: it ’s time now !  On-chip thread parallelism is possible and will arrive  on-chip multiprocessor ?  IBM Power 4 ( fin 2001)  Simultaneous Multithreading ?  Compaq Alpha ( 2003)

High Performance Microprocessors André Seznec Caps Team Irisa 47 On-chip multiprocessor  May be the solution, but..  Memory bandwidth?  Single thread performance ?

High Performance Microprocessors André Seznec Caps Team Irisa 48 On-chip multiprocessor: IBM Power 4 (2001)  Market segment: servers  2 superscalar processorrs on a chip:  shared L2 cache  4 chips on a MCM (multichip module)  huge bandwidth on the chip and on the MCM

High Performance Microprocessors André Seznec Caps Team Irisa 49 Programmer view

High Performance Microprocessors André Seznec Caps Team Irisa 50 Simultaneous Multithreading (SMT)  FUs are underused  SMT:  Share FUs from a superscalar processor among several threads  Advantages:  1 thread may use all the resources  structures are dynamically shared (caches, predictors, FUs)

High Performance Microprocessors André Seznec Caps Team Irisa 51 SMT: Alpha (2003)  8 way superscalar  Ultimate performance on a thread  if multithread or multiprogrammed, 4 threads share the hardware:  overhead 5-10 %

High Performance Microprocessors André Seznec Caps Team Irisa 52 Programmer view

High Performance Microprocessors André Seznec Caps Team Irisa 53 SMT: research directions  Speculative multithreading:  one step farther than branch prediction  Multipath execution:  let us execute the two pathes of a branch  Subordinate thread  a slave thread works for the principal thread: prefetching branch anticipation

High Performance Microprocessors André Seznec Caps Team Irisa 54 Predicting the future PC processor in 2010  x86+ ISA  mode 64 bits  10 way superscalar SMT  < 50 mm2  cache snooping

High Performance Microprocessors André Seznec Caps Team Irisa 55 Predicting the future The server processor in 2010  IA64 or Alpha+ ISA  predicate registers  Multi 10 way SMT  directory coherence

High Performance Microprocessors André Seznec Caps Team Irisa 56 Programmer view !

High Performance Microprocessors André Seznec Caps Team Irisa 57 Challenges  Design complexity?  How to program (compile) for these monsters?