ECE 4100/6100 (1) Multicore Computing - Evolution.

Slides:



Advertisements
Similar presentations
Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Intel Multi-Core Technology. New Energy Efficiency by Parallel Processing – Multi cores in a single package – Second generation high k + metal gate 32nm.
THE RAW MICROPROCESSOR: A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERAL- PURPOSE PROGRAMS Taylor, M.B.; Kim, J.; Miller, J.; Wentzlaff, D.; Ghodrat,
Lecture 2: Modern Trends 1. 2 Microprocessor Performance Only 7% improvement in memory performance every year! 50% improvement in microprocessor performance.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.
VLSI Trends. A Brief History  1958: First integrated circuit  Flip-flop using two transistors  From Texas Instruments  2011  Intel 10 Core Xeon Westmere-EX.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.
Design and Implementation of VLSI Systems (EN0160)
8/18/05ELEC / Lecture 11 ELEC / (Fall 2005) Special Topics in Electrical Engineering Low-Power Design of Electronic Circuits.
S. Reda EN160 SP’08 Design and Implementation of VLSI Systems (EN1600) Lecture 18: Scaling Theory Prof. Sherief Reda Division of Engineering, Brown University.
CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
Multi-core Processing The Past and The Future Amir Moghimi, ASIC Course, UT ECE.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.
Computer performance.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
1 VLSI and Computer Architecture Trends ECE 25 Fall 2012.
8 – Simultaneous Multithreading. 2 Review from Last Time Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
A Gentler, Kinder Guide to the Multi-core Galaxy Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech Guest lecture for ECE4100/6100.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
® 1 VLSI Design Challenges for Gigascale Integration Shekhar Borkar Intel Corp. October 25, 2005.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
A few issues on the design of future multicores André Seznec IRISA/INRIA.
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.
EE141 © Digital Integrated Circuits 2nd Introduction 1 Principle of CMOS VLSI Design Introduction Adapted from Digital Integrated, Copyright 2003 Prentice.
0 1 Thousand Core Chips A Technology Perspective Shekhar Borkar Intel Corp. June 7, 2007.
Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
CS203 – Advanced Computer Architecture
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
LOW POWER DESIGN METHODS
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
PipeliningPipelining Computer Architecture (Fall 2006)
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
CS203 – Advanced Computer Architecture
COMP 740: Computer Architecture and Implementation
CS203 – Advanced Computer Architecture
Lynn Choi School of Electrical Engineering
ECE 3055: Computer Architecture and Operating Systems
Parallel Processing - introduction
Lynn Choi School of Electrical Engineering
Architecture & Organization 1
Hyperthreading Technology
Architecture & Organization 1
Computer Architecture Lecture 4 17th May, 2006
Transistors on lead microprocessors double every 2 years Moore’s Law in Microprocessors Transistors on lead microprocessors double every 2 years.
A High Performance SoC: PkunityTM
Computer Evolution and Performance
Chapter 4 Multiprocessors
8 – Simultaneous Multithreading
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Utsunomiya University
Presentation transcript:

ECE 4100/6100 (1) Multicore Computing - Evolution

ECE 4100/6100 (2) Performance Scaling Pentium® Pro Architecture Pentium® 4 Architecture Pentium® Architecture Source: Shekhar Borkar, Intel Corp.

ECE 4100/6100 (3) Intel  Homogeneous cores  Bus based on chip interconnect  Shared Memory  Traditional I/O Classic OOO: Reservation Stations, Issue ports, Schedulers…etc Large, shared set associative, prefetch, etc. Source: Intel Corp.

ECE 4100/6100 (4) IBM Cell Processor Co-processor accelerator Heterogeneous MultiCore High bandwidth, multiple buses High speed I/O Classic (stripped down) core Source: IBM

ECE 4100/6100 (5) AMD Au1200 System on Chip Custom cores Embedded processor On-Chip I/O On-Chip Buses Source: AMD

ECE 4100/6100 (6) PlayStation 2 Die Photo (SoC) Source: IEEE Micro, March/April 2000 Floating point MACs

ECE 4100/6100 (7) Multi-* is Happening Source: Intel Corp.

ECE 4100/6100 (8) Intel’s Roadmap for Multicore Source: Adapted from Tom’s Hardware SC 1MB DC 2MB DC 2/4MB shared DC 3 MB/6 MB shared (45nm) DC 2/4MB DC 2/4MB shared DC 4MB DC 3MB /6MB shared (45nm) DC 2MB DC 4MB DC 16MB QC 4MB QC 8/16MB shared 8C 12MB shared (45nm) SC 512KB/ 1/ 2MB 8C 12MB shared (45nm) Desktop processors Mobile processors Enterprise processors Drivers are –Market segments –More cache –More cores

ECE 4100/6100 (9) Distillation Into Trends Technology Trends –What can we expect/project? Architecture Trends –What are the feasible outcomes? Application Trends –What are the driving deployment scenarios? –Where are the volumes?

ECE 4100/6100 (10) Technology Scaling 30% scaling down in dimensions  doubles transistor density Power per transistor –V dd scaling  lower power Transistor delay = C gate V dd /I SAT –C gate, V dd scaling  lower delay GATE SOURCE BODY DRAIN t ox GATE SOURCE DRAIN L

ECE 4100/6100 (11) Fundamental Trends High Volume Manufacturing Technology Node (nm) Integration Capacity (BT) Delay = CV/I scaling 0.7~0.7>0.7Delay scaling will slow down Energy/Logic Op scaling >0.35>0.5 Energy scaling will slow down Bulk Planar CMOS High Probability Low Probability Alternate, 3G etcLow Probability High Probability VariabilityMedium High Very High ILD (K)~3<3 Reduce slowly towards RC Delay Metal Layers to 1 layer per generation Source: Shekhar Borkar, Intel Corp.

ECE 4100/6100 (12) Moore’s Law How do we use the increasing number of transistors? What are the challenges that must be addressed? Source: Intel Corp.

ECE 4100/6100 (13) Impact of Moore’s Law To Date Memory Push the Memory Wall  Larger caches Frequency Increase Frequency  Deeper Pipelines ILP Increase ILP  Concurrent Threads, Branch Prediction and SMT Power Manage Power  clock gating, activity minimization IBM Power5 Source: IBM

ECE 4100/6100 (14) Shaping Future Multicore Architectures The ILP Wall –Limited ILP in applications The Frequency Wall –Not much headroom The Power Wall –Dynamic and static power dissipation The Memory Wall –Gap between compute bandwidth and memory bandwidth Manufacturing –Non recurring engineering costs –Time to market

ECE 4100/6100 (15) The Frequency Wall Not much headroom left in the stage to stage times (currently 8-12 FO4 delays) Increasing frequency leads to the power wall Vikas Agarwal, M. S. Hrishikesh, Stephen W. Keckler, Doug Burger. Clock rate versus IPC: the end of the road for conventional microarchitectures. In ISCA 2000

ECE 4100/6100 (16) Options Increase performance via parallelism –On chip this has been largely at the instruction/data level The 1990’s through 2005 was the era of instruction level parallelism –Single instruction multiple data/Vector parallelism MMX, SSIMD, Vector Co-Processors –Out Of Order (OOO) execution cores –Explicitly Parallel Instruction Computing (EPIC) Have we exhausted options in a thread?

ECE 4100/6100 (17) The ILP Wall - Past the Knee of the Curve? “Effort” Performance Scalar In-Order Moderate-Pipe Superscalar/OOO Very-Deep-Pipe Aggressive Superscalar/OOO Made sense to go Superscalar/OOO: good ROI Very little gain for substantial effort Source: G. Loh

ECE 4100/6100 (18) The ILP Wall Limiting phenomena for ILP extraction: – Clock rate : at the wall each increase in clock rate has a corresponding CPI increase (branches, other hazards) – Instruction fetch and decode : at the wall more instructions cannot be fetched and decoded per clock cycle – Cache hit rate : poor locality can limit ILP and it adversely affects memory bandwidth – ILP in applications : serial fraction on applications Reality: –Limit studies cap IPC at (using ideal processor) –Current processors have IPC of only 1-2

ECE 4100/6100 (19) The ILP Wall: Options Increase granularity of parallelism –Simultaneous Multi-threading to exploit TLP TLP has to exist  otherwise poor utilization results –Coarse grain multithreading –Throughput computing New languages/applications –Data intensive computing in the enterprise –Media rich applications

ECE 4100/6100 (20) The Memory Wall µProc 60%/yr. DRAM 7%/yr DRAM CPU Processor-Memory Performance Gap: (grows 50% / year) Time “Moore’s Law”

ECE 4100/6100 (21) The Memory Wall Increasing the number of cores increases the demanded memory bandwidth What architectural techniques can meet this demand? Average access time Year?

ECE 4100/6100 (22) The Memory Wall CPU 0 CPU 1 AMD Dual-Core Athlon FX On die caches are both area intensive and power intensive –StrongArm dissipates more than 43% power in caches –Caches incur huge area costs Larger caches never deliver the near-universal performance boost offered by frequency ramping (Source: Intel) IBM Power5

ECE 4100/6100 (23) The Power Wall Power per transistor scales with frequency but also scales with V dd –Lower V dd can be compensated for with increased pipelining to keep throughput constant –Power per transistor is not same as power per area  power density is the problem! –Multiple units can be run at lower frequencies to keep throughput constant, while saving power

ECE 4100/6100 (24) Leakage Power Basics Sub-threshold leakage –Increases with lower V th, T, W Gate-oxide leakage –Increases with lower T ox, higher W –High K dielectrics offer a potential solution Reverse biased pn junction leakage –Very sensitive to T, V (in addition to diffusion area)

ECE 4100/6100 (25) The Current Power Trend Source: Intel Corp Pentium® P Year Power Density (W/cm 2 ) Hot Plate Nuclear Reactor Rocket Nozzle Sun’s Surface

ECE 4100/6100 (26) Improving Power/Performance Consider constant die size and decreasing core area each generation = more cores/chip –Effect of lowering voltage and frequency  power reduction –Increasing cores/chip  performance increase Better power performance!

ECE 4100/6100 (27) Accelerators 2.23 mm X 3.54 mm, 260K transistors Opportunities: Network processing engines MPEG Encode/Decode engines, Speech engines TCP/IP Offload Engine Source: Shekhar Borkar, Intel Corp.

ECE 4100/6100 (28) Low-Power Design Techniques Circuit and gate level methods –Voltage scaling –Transistor sizing –Glitch suppression –Pass-transistor logic –Pseudo-nMOS logic –Multi-threshold gates Functional and architectural methods –Clock gating –Clock frequency reduction –Supply voltage reduction –Power down/off –Algorithmic and software techniques Two decades worth of research and development!

ECE 4100/6100 (29) The Economics of Manufacturing Where are the costs of developing the next generation processors? –Design Costs –Manufacturing Costs What type of chip level solutions is the economics implying? Assessing the implications of Moore’s Law is an exercise in mass production

ECE 4100/6100 (30) The Cost of An ASIC Example: Design with 80 M transistors in 100 nm technology Estimated Cost - $85 M -$90 M C P production verification design prototype verification implementation verification 12 – 18 months Cost and Risk rising to unacceptable levels Top cost drivers –Verification (40%) –Architecture Design (23%) –Embedded Software Design 1400 man months (SW) 1150 man months (HW) –HW/SW integration *Handel H. Jones, “ How to Slow the Design Cost Spiral,” Electronics Design Chain, September 2002,

ECE 4100/6100 (31) The Spectrum of Architectures Synthesis Compilation Custom ASIC FPGA Polymorphic Computing Architectures Fixed + Variable ISA Microprocessor Hardware Development Tiled architectures Software Development Customization fully in Hardware Customization fully in Software Design NRE Effort Decreasing Customization Increasing NRE and Time to Market Structured ASIC Tensilica Stretch Inc. PACT, PICOChip LSI Logic Leopard Logic MONARCH SM,RAW, TRIPS Xilinx Altera

ECE 4100/6100 (32) Interlocking Trade-offs Power Memory Frequency ILP speculation bandwidth dynamic power dynamic penalties miss penalty leakage power

ECE 4100/6100 (33) Multi-core Architecture Drivers Addressing ILP limits –Multiple threads –Coarse grain parallelism  raise the level of abstraction Addressing Frequency and Power limits –Multiple slower cores across technology generation –Scaling via increasing the number of cores rather than frequency –Heterogeneous cores for improved power/performance Addressing memory system limits –Deep, distributed, cache hierarchies –OS replication  shared memory remains dominant Addressing manufacturing issues –Design and verification costs  Replication  the network becomes more important!

ECE 4100/6100 (34) Parallelism

ECE 4100/6100 (35) Beyond ILP Performance is limited by the serial fraction parallelizable 1CPU 2CPUs3CPUs4CPUs Coarse grain parallelism in the post ILP era –Thread, process and data parallelism Learn from the lessons of the parallel processing community –Revisit the classifications and architectural techniques

ECE 4100/6100 (36) Flynn’s Model Flynn’s Classification –Single instruction stream, single data stream (SISD) The conventional, word-sequential architecture including pipelined computers –Single instruction stream, multiple data stream (SIMD) The multiple ALU-type architectures (e.g., array processor) –Multiple instruction stream, single data stream (MISD) Not very common –Multiple instruction stream, multiple data stream (MIMD) The traditional multiprocessor system M.J. Flynn, “Very high speed computing systems,” Proc. IEEE, vol. 54(12), pp. 1901–1909, 1966.

ECE 4100/6100 (37) SIMD/Vector Computation SIMD and Vector models are spatial and temporal analogs of each other A rich architectural history dating back to 1953! Source: CraySource: IBM IBM Cell SPE pipeline diagram IBM Cell SPE Organization

ECE 4100/6100 (38) SIMD/Vector Architectures VIRAM - Vector IRAM –Logic is slow in DRAM process –put a vector unit in a DRAM and provide a port between a traditional processor and the vector IRAM instead of a whole processor in DRAM Source: Berkeley Vector IRAM

ECE 4100/6100 (39) MIMD Machines Parallel processing has catalyzed the development of a several generations of parallel processing machines Unique features include the interconnection network, support for system wide synchronization, and programming languages/compilers P + C Dir Memory P + C Dir Memory P + C Dir Memory P + C Dir Memory Interconnection Network

ECE 4100/6100 (40) Basic Models for Parallel Programs Shared Memory –Coherency/consistency are driving concerns –Programming model is simplified at the expense of system complexity Message Passing –Typically implemented on distributed memory machines –System complexity is simplified at the expense of increased effort by the programmer

ECE 4100/6100 (41) Shared Memory Model That’s basically it… –need to fork/join threads, synchronize (typically locks) Main Memory Write XRead X CPU 0 CPU 1

ECE 4100/6100 (42) Recv Message Passing Protocols Explicitly send data from one thread to another –need to track ID’s of other CPUs –broadcast may need multiple send’s –each CPU has own memory space Hardware: send/recv queues between CPUs Send CPU 0 CPU 1

ECE 4100/6100 (43) Shared Memory Vs. Message Passing Shared memory doesn’t scale as well to larger number of nodes communications are broadcast based bus becomes a severe bottleneck Message passing doesn’t need centralized bus can arrange multi-processor like a graph –nodes = CPUs, edges = independent links/routes can have multiple communications/messages in transit at the same time

ECE 4100/6100 (44) Two Emerging Challenges Programming Models and Compilers? Interconnection Networks Source: IBM Source: Intel Corp.