Karu Sankaralingam University of Wisconsin-Madison Collaborators: Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, and Doug Burger The Dark Silicon Implications.

Slides:



Advertisements
Similar presentations
Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.
Advertisements

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Computer Abstractions and Technology
Computing Systems Roadmap and its Impact on Software Development Michael Ernst, BNL HSF Workshop at SLAC January, 2015.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Power Reduction Techniques For Microprocessor Systems
Lecture 2: Modern Trends 1. 2 Microprocessor Performance Only 7% improvement in memory performance every year! 50% improvement in microprocessor performance.
1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.
- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel.
On the Limits of Leakage Power Reduction in Caches Yan Meng, Tim Sherwood and Ryan Kastner UC, Santa Barbara HPCA-2005.
1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.
Power Management in Multicores Minshu Zhao. Outline Introduction Review of Power management technique Power management in Multicore ◦ Identify Multicores.
1 Lecture 1: CS/ECE 3810 Introduction Today’s topics:  logistics  why computer organization is important  modern trends.
1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
Operating Systems Should Manage Accelerators Sankaralingam Panneerselvam Michael M. Swift Computer Sciences Department University of Wisconsin, Madison,
Alternative Computing Technologies
UW-Madison Computer Sciences Vertical Research Group© 2010 A Unified Model for Timing Speculation: Evaluating the Impact of Technology Scaling, CMOS Design.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Multi Core Processor Submitted by: Lizolen Pradhan
Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu
1 Lecture 1: CS/ECE 3810 Introduction Today’s topics:  Why computer organization is important  Logistics  Modern trends.
Amdahl’s Law in the Multicore Era Mark D.Hill & Michael R.Marty 2008 ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun Ham 2012.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
® 1 VLSI Design Challenges for Gigascale Integration Shekhar Borkar Intel Corp. October 25, 2005.
Hardware Trends. Contents Memory Hard Disks Processors Network Accessories Future.
1 CS/EE 6810: Computer Architecture Class format:  Most lectures on YouTube *BEFORE* class  Use class time for discussions, clarifications, problem-solving,
Grad Student Visit DayUniversity of Wisconsin-Madison Wisconsin Computer Architecture Guri SohiMark HillMikko LipastiDavid WoodKaru Sankaralingam Nam Sung.
[Tim Shattuck, 2006][1] Performance / Watt: The New Server Focus Improving Performance / Watt For Modern Processors Tim Shattuck April 19, 2006 From the.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
CISC 879 : Advanced Parallel Programming Vaibhav Naidu Dept. of Computer & Information Sciences University of Delaware Importance of Single-core in Multicore.
Computational Sprinting on a Real System: Preliminary Results Arun Raghavan *, Marios Papaefthymiou +, Kevin P. Pipe +#, Thomas F. Wenisch +, Milo M. K.
Chapter 1 Performance & Technology Trends Read Sections 1.5, 1.6, and 1.8.
Department of Computer Science 1 Beyond CUDA/GPUs and Future Graphics Architectures Karu Sankaralingam University of Wisconsin-Madison Adapted from “Toward.
Shashwat Shriparv InfinitySoft.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? Wasim Shaikh Date: 10/29/2015.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
0 1 Thousand Core Chips A Technology Perspective Shekhar Borkar Intel Corp. June 7, 2007.
Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.
Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.
Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.
CISC 879 : Advanced Parallel Programming Vaibhav Naidu Dept. of Computer & Information Sciences University of Delaware Dark Silicon and End of Multicore.
CS203 – Advanced Computer Architecture
Core Architecture Optimization for Heterogeneous CMPs R. Kumar, D. M. Tullsen, and N.P. Jouppi İlker YILDIRIM
CS203 – Advanced Computer Architecture
Lecture 2: Performance Today’s topics:
Lynn Choi School of Electrical Engineering
The Problem Finding a needle in haystack An expert (CPU)
Multi-Processing in High Performance Computer Architecture:
Lecture 2: Performance Today’s topics: Technology wrap-up
Using Packet Information for Efficient Communication in NoCs
Parallel Processing Sharing the load.
EE 193: Parallel Computing
Leveraging Optical Technology in Future Bus-based Chip Multiprocessors
Computer Evolution and Performance
병렬처리시스템 2005년도 2학기 채 수 환
The University of Adelaide, School of Computer Science
EE 155 / Comp 122 Parallel Computing
Presentation transcript:

Karu Sankaralingam University of Wisconsin-Madison Collaborators: Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, and Doug Burger The Dark Silicon Implications for Microprocessors

Multicore Decade? 2 We have relied on multicore scaling for over five years. How much longer will it be our primary performance scaling technique? Pentium Extreme Dual-Core Core 2 Quad-Core i7 980x Hex-Core ?

Finding Optimal Multicore Designs 3 Comprehensive design space: Fixed area budget Fixed power budget Two sets of CMOS scaling projections Optimal core and diverse multicore organizations Parallel benchmarks For next 5 technology generations, we find the best performing multicore from a comprehensive design space search for each of the PARSEC benchmarks

Symmetric Multicore Projections 4 Symmetric multicores alone will not sustain the multicore era. 3.4x in 10 years 18x

Multicore Solutions 5 Asymmetric Topologies 3.5x

Multicore Solutions 6 Dynamic Topologies [Chakraborty (2008), Suleman et al (2009)] 3.5x

Multicore Solutions 7 Composed/Fused Topologies [Ipek et al (2007), Kim et al (2007)] 3.7x

Multicore Solutions 8 GPU-Style Cores 2.7x

Multicore Era Projections 9 The best designs speed up 14% per year rather than the recent trend of 34% per year 3.7x 18x

Why Diminishing Returns? 10 Transistor area is still scaling Voltage and capacitance scaling have slowed Result: designs are power, not area, limited

Devices Find the best case technology scaling Cores Find the best cores Multicores Find the best multicore organization Projections Predict best case multicore performance for each technology generation Overview 11

ConservativeOptimistic Area32x Power4.5x8.3x Frequency1.3x3.9x [Borkar 2007][ITRS 2010] Device Scaling Projections 12 From 45 nm to 8 nm:

Modeling Ideal Core Power/Perf. 13 Atom Nehalem Pareto Frontier includes all optimal power/performance points Repeat using core area for optimal area/performance points

Combining Device and Core Models nm Frontier 32 nm Frontier Device Scaling

Devices Find the best case technology scaling Cores Find the best cores Multicores Find the best multicore organization Projections Predict best case multicore performance for each technology generation Overview 15

What belongs in multicore model? 16 Styles Applications Topologies Area & Power / Performance Tradeoffs Architectures PARSEC f parallel, Data Use Number of Threads, Cache Sizes Area & Power Budget Cache & memory latencies, memory bandwidth Pareto Frontiers

Multicore Speedup Model 17 Multicore Speedup 1 1-f parallel Serial Speedup f parallel Parallel Speedup = +

Multicore Performance Model 18 Performance is limited by: and Memory bandwidth BW max / (instructions per byte from memory) Computation N cores  (core frequency/CPI exe )  core utilization [Guz et al, 2009]

Core Utilization Model 19 Core utilization is limited by: Fraction of Time Core is Ready to Issue Number of Threads in Core / Number of Threads to Keep Busy [Guz et al, 2009]

Multicore Model & Pareto Frontiers points A(q), P(q) q

Translating from SPECmark From q, find core’s SPECmark speedup 2. Frequency linearly distributed from Atom to Nehalem 3. Recall: model predicts benchmark performance as f(benchmark chars, frequency, CPI exe ) 4. Compute CPI exe such that Benchmark Speedup = SPECmark Speedup

Area and Power Constraints 22 N cores x A(q) ≤ Area Budget N cores x P(q) ≤ Power Budget Dark silicon = N cores / # of cores that fit in chip area

Devices Find the best case technology scaling Cores Find the best cores Multicores Find the best multicore organization Projections Predict best case multicore performance for each technology generation Overview 23

Dark Silicon 24 Sources of Dark Silicon: Power + Limited Parallelism At 22 nm: At 8 nm: 17% 26% 51% 71%

Overall Performance 25 Target f parallel = x 16x 8x 6x 3x

Conclusions Multicore performance gains are limited Need at least 18%-40% per generation from architecture alone without additional power 26 Unicore EraMulticore Era ?

27 Specialization Shrinking chips Pervasive Efficiency