Lynn Choi School of Electrical Engineering Microprocessor Microarchitecture The Past, Present, and Future of CPU Architecture.

Slides:



Advertisements
Similar presentations
4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Advertisements

Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Structure of Computer Systems
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Microprocessors. Von Neumann architecture Data and instructions in single read/write memory Contents of memory addressable by location, independent of.
1 xkcd/619. Multicore & Parallel Processing Guest Lecture: Kevin Walsh CS 3410, Spring 2011 Computer Science Cornell University.
Lecture 2: Modern Trends 1. 2 Microprocessor Performance Only 7% improvement in memory performance every year! 50% improvement in microprocessor performance.
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Multicore & Parallel Processing P&H Chapter ,
Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.
Computer performance.
Hyper-Threading, Chip multiprocessors and both Zoran Jovanovic.
Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.
8 – Simultaneous Multithreading. 2 Review from Last Time Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Computer Architecture Introduction Lynn Choi Korea University.
Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.
A Gentler, Kinder Guide to the Multi-core Galaxy Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech Guest lecture for ECE4100/6100.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Parallelism: A Serious Goal or a Silly Mantra (some half-thought-out ideas)
Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
André Seznec Caps Team IRISA/INRIA 1 High Performance Microprocessors André Seznec IRISA/INRIA
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
Computer Architecture Introduction Lynn Choi Korea University.
Microprocessor Microarchitecture Introduction Lynn Choi School of Electrical Engineering.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
CSC 7080 Graduate Computer Architecture Lec 8 – Multiprocessors & Thread- Level Parallelism (3) – Sun T1 Dr. Khalaf Notes adapted from: David Patterson.
The Alpha – Data Stream Matt Ziegler.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Multi-core, Mega-nonsense. Will multicore cure cancer? Given that multicore is a reality –…and we have quickly jumped from one core to 2 to 4 to 8 –It.
Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.
Lecture 3 Dr. Muhammad Ayaz Computer Organization and Assembly Language. (CSC-210)
UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History.
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.
“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Use of Pipelining to Achieve CPI < 1
Itanium® 2 Processor Architecture
Microprocessor Microarchitecture Introduction
CS 352H: Computer Systems Architecture
Lynn Choi School of Electrical Engineering
Visit for more Learning Resources
CIT 668: System Architecture
Lynn Choi School of Electrical Engineering
Lynn Choi Dept. Of Computer and Electronics Engineering
Architecture & Organization 1
Hyperthreading Technology
Architecture & Organization 1
Parallel Processing Sharing the load.
Chapter 1 Introduction.
8 – Simultaneous Multithreading
Chip&Core Architecture
The University of Adelaide, School of Computer Science
Lecture 3 (Microprocessor)
Utsunomiya University
Presentation transcript:

Lynn Choi School of Electrical Engineering Microprocessor Microarchitecture The Past, Present, and Future of CPU Architecture

Contents  Performance of Microprocessors  Past: ILP Saturation  I. Superscalar Hardware Complexity  II. Limits of ILP  III. Power Inefficiency  Present: TLP Era  I. Multithreading  II. Multicore  Present: Today’s Microprocessor  Intel Core 2 Quad, Sun Niagara II, and ARM Cortex A-9 MPCore  Future: Looking into the Future  I. Manycores  II. Multiple Systems on Chip  III. Trend – Change of Wisdoms

CPU Performance  T exe (Execution time per program) = NI * CPI execution * T cycle  NI = # of instructions / program (program size)  CPI = clock cycles / instruction  T cycle = second / clock cycle (clock cycle time)  To increase performance  Decrease NI (or program size)  Instruction set architecture (CISC vs. RISC), compilers  Decrease CPI (or increase IPC)  Instruction-level parallelism (Superscalar, VLIW)  Decrease T cycle (or increase clock speed)  Pipelining, process technology

Advances in Intel Microprocessors SPECInt95 Performance PPro 200MHz (superscalar, out-of-order) 8.09 Pentium II 300MHz (superscalar, out-of-order) 11.6 Pentium III 600MHz (superscalar, out-of-order) 24 Pentium 100MHz (superscalar, in-order) DX2 66MHz (pipelined) (projected) Pentium IV 1.7GHz (superscalar, out-of-order) Pentium IV 2.8GHz (superscalar, out-of-order) 81.3 (projected) 80 42X Clock Speed ↑ 2X IPC ↑

Microprocessor Performance Curve

ILP Saturation I – Hardware Complexity  Superscalar hardware is not scalable in terms of issue width!  Limited instruction fetch bandwidth  Renaming complexity ∝ issue width 2  Wakeup & selection logic ∝ instruction window 2  Bypass logic complexity ∝ # of FUs 2  Also, on-chip wire delays, # register and memory access ports, etc.  Higher IPC implies lowering the Clock Speed!

ILP Saturation II – Limits of ILP Even with a very aggressive superscalar microarchitecture 2K window Max. 64 instruction issues per cycle 8K entry tournament predictors 2K jump and return predictors 256 integer and 256 FP registers Available ILP is only 3 ~ 6!

ILP Saturation III – Power Inefficiency  Increasing issue rate is not energy efficient  Increasing clock rate is also not energy efficient  Increasing clock rate will increase transistor switching frequency  Faster clock needs deeper pipeline, but the pipelining overhead grows faster  Existing processors already reach the power limit  1.6GHz Itanium 2 consumes 130W of power!  Temperature problem –Pentium power density passes that of a hot plate (‘98) and would pass a nuclear reactor in 2005, and a rocket nozzle in  Higher IPC and higher clock speed have been pushed to their limit! Peak issue rate Sustained issue rate & Performance Hardware complexity & Power

TLP Era I - Multithreading  Multithreading  Interleave multiple independent threads into the pipeline every cycle  Each thread has its own PC, RF, branch prediction structures but shares instruction pipelines and backend execution units  Increase resource utilization & throughput for multiple-issue processors  Improve total system throughput (IPC) at the expense of compromised single program performance SuperscalarFine-Grain Multithreading SMT

TLP Era I - Multithreading  IBM 8-processor Power 5 with SMT (2 threads per core)  Run two copies of an application in SMT mode versus single-thread mode  23% improvement in SPECintRate and 16% improvement in SPECfpRate

TLP Era II - Multicore  Multicore  Single-chip multiprocessing  Easy to design and verify functionally  Excellent performance/watt  P dyn = αC L * V DD 2 * F  Dual core at half clock speed can achieve the same performance (throughput) but with only ¼ of the power consumption !  Dual core consumes 2 * C * V * 0.5F = 0.25 CV 2 F  Packaging, cooling, reliability  Power also determines the cost of packaging/cooling.  Chip temperature must be limited to avoid reliability issue and leakage power dissipation.  Improved throughput with minor degradation in single program performance  For multiprogramming workloads and multi-threaded applications

Today’s Microprocessor  Intel Core 2 Quad Processor (code name “Yorkfield”)  Technology  45nm process, 820M transistors, 2x107 mm² dies  2.83 GHz, two 64-bit dual-core dies in one MCM package  Core microarchitecture  Next generation multi-core microarchitecture introduced in Q  Derived from P6 microarchitecture  Optimized for multi-cores and lower power consumption  Lower clock speeds for lower power but higher performance  1/2 power (up to 65W) but more performance compared to dual- core Pentium D  14-stage 4-issue out-of-order (OOO) pipeline  64bit Intel architecture (x86-64)  2 unified 6MB L2 Caches  1333MHz system bus

Today’s Microprocessor  Sun UltraSPARC T2 processor (“Niagara II”)  Multithreaded multicore technology  Eight 1.4 GHz cores, 8 threads per core → total 64 threads  65nm process, 1831 pin BGA, 503M transistors, 84W power consumption  Core microarchitecture  Two issue 8-stage instruction pipelines & pipelined FPU per core  4MB L2 – 8 banks, 64 FB DIMMs, 60+ GB/s memory bandwidth  Security coprocessor per core and dual 10GB Ethernet, PCI Express

Today’s Microprocessor  Cortex A-9 MPCore  ARMv7 ISA  Support complex OS and multiuser applications  2-issue superscalar 8- stage OOO pipeline  FPU supports both SP and DP operations  NEON SIMD media processing engine  MPCore technology that can support 1 ~ 4 cores

Future CPU Microarchitecture - MANYCORE # of Cores Intel Core2 Quad (4) Intel Core 2 Duo (2) Intel Pentium 4 (1) Sun UltraSPARC T1 (8) Sun Victoria Falls (16) IBM Cell (9) IBM Power4 (2) Intel Teraflops (80)  Idea  Double the number of cores on a chip with each silicon generation  1000 cores will be possible with 30nm technology Intel Pentium D (2) Intel Core i7 (8) Intel Dunnington (6)

Future CPU Microarchitecture - MANYCORE  Architecture  Core architecture  Should be the most efficient in MIPS/watt and MIPS/silicon.  Modestly pipelined (8~14 stages) in-order pipeline  System architecture  Heterogeneous vs. homogeneous MP  Heterogeneous in terms of functionality  Heterogeneous in terms of performance  Amdahl’s Law  Shared vs. distributed memory MP  Shared memory multicore  Most of existing multicores  Preserve the programming paradigm via binary compatibility and cache coherence  Distributed memory multicores  More scalable hardware and suitable for manycore architectures CPU DSP GPU CPU

Future CPU Microarchitecture I - MANYCORE  Issues  On-chip interconnects  Buses and crossbar will not be scalable to 1000 cores!  Packet-switched point-to-point interconnects  Ring (IBM Cell), 2D/3D mesh/torus (RAW) networks  Can provide scalable bandwidth. But, how about latency?  Cache coherence  Bus-based snooping protocols cannot be used!  Directory-based protocols for up to 100 cores  More simplified and flexible coherence protocols will be needed to leverage the improved bandwidth and low latency.  Caches can be adapted between private and shared configurations.  More direct control over the memory hierarchy. Or, software-managed caches  Off-chip pin bandwidth  Manycores will unleash a much higher numbers of MIPS in a single chip.  More demand on IO pin bandwidth  Need to achieve 100 GB/s ~ 1TB/s memory bandwidth  More demand on DRAM out of total system silicon

Future CPU Microarchitecture I - MANYCORE  Projection  Pin IO bandwidth cannot sustain the memory demands of manycores  Multicores may work from 2 to 8 processors on a chip  Diminishing returns as 16 or 32 processors are realized!  Just as returns fell with ILP beyond 4~6 issue now available  But for applications with high TLP, manycore will be a good design choice  Network processors, Intel’s RMS (Recognition, Mining, Synthesis)

Future CPU Architecture II – Multiple SoC  Idea – System on Chip!  Integrate main memory on chip  Much higher memory bandwidth and reduced memory access latencies  Memory hierarchy issue  For memory expansion, off-chip DRAMs may need to be provided  This implies multiple levels of DRAM in the memory hierarchy  On-chip DRAMs can be used as a cache for the off-chip DRAM  On-chip memory is divided into SRAMs and DRAMs  Should we use SRAMs for caches?  Multiple systems on chip  Single monolithic DRAM shared by multiple cores  Distributed DRAM blocks across multiple cores CPU DRAM CPU DRAM

Intel Terascale processor  Features  GHz processor cores, 1.01 TFLOPS at 1.0V, 62W, 100M transistors  3D stacked memory  Mesh interconnects – provides 80GB/s bandwidth  Challenges  On-die power dissipation  Off-chip memory bandwidth  Cache hierarchy design and coherence

Intel Terascale processor

Trend - Change of Wisdoms  1. Power is free, but transistors are expensive.  “Power wall”: Power is expensive, but transistors are “free”.  2. Regarding power, the only concern is dynamic power.  For desktops/servers, static power due to leakage can be 40% of total power.  3. Can reveal more ILP via compilers/arch innovation.  “ILP wall”: There are diminishing returns on finding more ILP.  4. Multiply is slow, but load and store is fast.  “Memory wall”: Load and store is slow, but multiply is fast. 200 clocks to access DRAM, but FP multiplies may take only 4 clock cycles.  5. Uniprocessor performance doubles every 18 months.  Power Wall + Memory Wall + ILP Wall: The doubling of uniprocessor performance may now take 5 years.  6. Don’t bother parallelizing your application, as you can just wait and run it on a faster sequential computer.  It will be a very long wait for a faster sequential computer.  7. Increasing clock frequency is the primary method of improving processor performance.  Increasing parallelism is the primary method of improving processor performance.