University of Michigan

Slides:



Advertisements
Similar presentations
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
Advertisements

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Matt DeVuyst Rakesh Kumar Dean Tullsen.
A Multi-Core Approach to Addressing the Energy-Complexity Problem in Microprocessors Rakesh Kumar Keith Farkas (HP Labs) Norman Jouppi (HP Labs) Partha.
Multithreading Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science August 20, 2009 Enabling.
Intel Labs Labs Copyright © 2000 Intel Corporation. Fall 2000 Inside the Pentium ® 4 Processor Micro-architecture Next Generation IA-32 Micro-architecture.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
CS 7810 Lecture 12 Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors D. Brooks et al. IEEE Micro, Nov/Dec.
CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
Micro-Architecture Techniques for Sensor Network Processors Amir Javidi EECS 598 Feb 25, 2010.
Architectural Power Management for High Leakage Technologies Department of Electrical and Computer Engineering Auburn University, Auburn, AL /15/2011.
Power Management in Multicores Minshu Zhao. Outline Introduction Review of Power management technique Power management in Multicore ◦ Identify Multicores.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
1 VLSI and Computer Architecture Trends ECE 25 Fall 2012.
MorphCore: An Energy-Efficient Architecture for High-Performance ILP and High-Throughput TLP Khubaib * M. Aater Suleman *+ Milad Hashemi * Chris Wilkerson.
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.
Low Power Techniques in Processor Design
8 – Simultaneous Multithreading. 2 Review from Last Time Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
SYNAR Systems Networking and Architecture Group Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov and Alexandra.
CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.
1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
Energy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes Alaa R. Alameldeen, Ilya Wagner, Zeshan Chishti, Wei Wu,
CS25212 Coarse Grain Multithreading Learning Objectives: – To be able to describe a coarse grain multithreading implementation – To be able to estimate.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
COMP25212 CPU Multi Threading Learning Outcomes: to be able to: –Describe the motivation for multithread support in CPU hardware –To distinguish the benefits.
Implementation of a Simple 8-bit Microprocessor with Reversible Energy Recovery Logic Seokkee Kim and Soo-Ik Chae System Design Group School of Electrical.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre.
Dept. of Electrical & Computer Engineering Self-Morphing Cores for Higher Power Efficiency and Improved Resilience Nithesh Kurella, Sudarshan Srinivasan.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
Varun Mathur Mingwei Liu Sanghyun Park, Aviral Shrivastava and Yunheung Paek.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Rakesh Kumar Keith Farkas Norman P Jouppi,Partha Ranganathan,Dean M.Tullsen University of California, San Diego MICRO 2003 Speaker : Chun-Chung Chen Single-ISA.
M AESTRO : Orchestrating Predictive Resource Management in Future Multicore Systems Sangyeun Cho, Socrates Demetriades Computer Science Department University.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
Temperature and Power Management
Lecture 18: Core Design, Parallel Algos
ECE Department, University of California, Davis
Adaptive Cache Partitioning on a Composite Core
Visit for more Learning Resources
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
Prof. Onur Mutlu Carnegie Mellon University
Lynn Choi School of Electrical Engineering
Simultaneous Multithreading
Decoupled Access-Execute Pioneering Compilation for Energy Efficiency
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Hyperthreading Technology
DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores
Lecture: SMT, Cache Hierarchies
Lecture 2: Performance Today’s topics: Technology wrap-up
Computer Architecture Lecture 4 17th May, 2006
Tony Nowatzki Vinay Gangadhar Karthikeyan Sankaralingam
Computer Architecture
Getting the Most Out of Low Power MCUs
Comparison of Two Processors
8 – Simultaneous Multithreading
MICRO 2012 MorphCore An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP Khubaib Milad Hashemi Yale N. Patt M. Aater.
Presentation transcript:

University of Michigan Heterogeneous Microarchitectures Trump Voltage Scaling for Low-Power Cores Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Ronald Dreslinski Jr., Thomas F. Wenisch, and Scott Mahlke University of Michigan PACT’23 August 26th, 2014

? Low-Power Cores HD video + Web surfing How do we get there? 800 mAh battery Intel 80386 @ 24MHz Nokia 9000 1996 2300 mAh battery Krait 400 4x@2.3GHz Nexus 5 2013 ? HD video + Web surfing All day battery? How do we get there?

The Big Picture Study efficiency of heterogeneous architectures Efficiency depends on the schedule Factor out scheduling efficiency Study architectural efficiency

Heterogeneity Decompose Core? Core Quanta (Epochs) Application Pointers ILP Branch Miss MLP Floating Point Core ILP Pointers Decompose Core?

Heterogeneity Dynamic Voltage/Frequency Scaling (DVFS) 1000 MHz 250 MHz OoO Core Ooo OoO Core Big Core (OoO) Little Core (InOrder) Pointers Core InO Core InO Core InO Core Pointers Pointers Single-ISA Heterogeneous Microarchitectures (HMs) Migrate core microarchitecture (i.e. Big and Little) [Kumar’03] Dynamic Voltage/Frequency Scaling (DVFS) Change voltage/frequency points to improve efficiency [Horowitz’94] More dimensions?

Quanta Size DVFS HMs Quantum Coarse Grained Fine Grained 20-70 uSec Off-Chip Regulators Cache Transfer Big @ 1GHz Big @ 750MHz Little Core Shared Caches On-Chip Regs DVFS HMs Quantum Coarse Grained 20-70 uSec [Mazouz’13] 15-25 uSec [Greenhalgh’11] 10M Insts Fine Grained 10-20 nSec [Kim’12] 10-30 nSec [Padmanabha’13] 1K Insts

Which is most efficient? The Goal DVFS HMs ? Coarse Grained Today’s Cores Yesterday’s Cores Future Cores? Fine Grained Future Cores? Future Cores? Future Cores? Which is most efficient? DVFS vs. HMs Coarse vs. Fine

Most efficient schedule? Schedules Quanta (Epochs) IPC Time Performance On Big Core Performance On Little Core Most efficient schedule? Schedule: Big Little Big Little Big Little Big

Pareto-Optimal Schedules Quantum 1 Quantum 2 Quantum 3 Delay Energy Big Core 10 ms 50 mJ 20 ms 60 mJ 30 ms Little Core 20 mJ 40 ms 35 ms 40mJ How efficient is this schedule? Schedule: Pick one core for each quantum Number Schedule Delay Energy 1 {B,B,B} 60 ms 170 mJ 2 {B,B,L} 65 ms 150 mJ 3 {B,L,B} 80 ms 160 mJ 4 {B,L,L} 85 ms 140 mJ 5 {L,B,B} 70 ms 6 {L,B,L} 75 ms 120 mJ 7 {L,L,B} 90 ms 130 mJ 8 {L,L,L} 95 ms 110 mJ {L,L,B} => ( 90 ms, 130 mJ )

Pareto-Optimal Schedules Number Schedule Delay Energy 1 {B,B,B} 60 ms 170 mJ 2 {B,B,L} 65 ms 150 mJ 3 {B,L,B} 80 ms 160 mJ 4 {B,L,L} 85 ms 140 mJ 5 {L,B,B} 70 ms 6 {L,B,L} 75 ms 120 mJ 7 {L,L,B} 90 ms 130 mJ 8 {L,L,L} 95 ms 110 mJ 1 3 Better Worse 2 5 4 7 6 8 Number Schedule Delay Energy 1 {B,B,B} 60 ms 170 mJ 2 {B,B,L} 65 ms 150 mJ 3 {B,L,B} 80 ms 160 mJ 4 {B,L,L} 85 ms 140 mJ 5 {L,B,B} 70 ms 6 {L,B,L} 75 ms 120 mJ 7 {L,L,B} 90 ms 130 mJ 8 {L,L,L} 95 ms 110 mJ Pareto-optimal schedules determine best tradeoffs for given architecture Schedule efficiency effects architectural efficiency Pareto-optimal schedules determine best tradeoffs Some schedules just better ( #6 > #3 )

Schedule Efficiency Lowest energy for given performance level K Modes N Quanta x Lowest energy for given performance level Total Schedules: KN (121000000) Find most efficient schedule for given performance level

 Regions Merging regions requires exponential complexity…  Delay (0..m) Energy Delay (0..n) Energy Delay [m..n) Energy  Sum Merging regions requires exponential complexity…  Can we break into regions? Combine regions?

Approximate Regions Best energy/delay tradeoffs! Worst Energy & Delay Delay (0..n) Energy Delay (0..m) Energy Delay [m..n) Energy Best Case Pareto Frontier Pareto-Optimal Region Worst Case Pareto Frontier Sum ≤ΔD ≤ΔE Best Energy & Delay Limit region error to +/- 2.5% Best energy/delay tradeoffs! ( with bounded error )

Evaluation - DVFS DVFS 28nm node Low-Power Fully Depleted Silicon-on-Oxide (FDSOI) 2000MHz @ 1.1V 600MHz @ 0.6V

Evaluation - HMs HMs modeled off ARM’s big.LITTLE Little (A7): 2-issue in-order core Big (A15): 3-issue out-of-order core Validation (Dhrystone): System Evaluation Δ Performance Δ Energy Industry Big.Little 1.9x 3.5x Modeled Gem5+Mcpat 2.09x 3.01x

Coarse-Grained Comparison Lower DVFS levels are less efficient than HMs DVFS+HMs provides minimal benefits DVFS+HMs allows continued scaling HMs provide better benefits especially for smaller slowdowns Pareto-optimal schedules result in best possible tradeoffs Normalized to highest performance core Big @ 2GHz 100% slowdown = 2x runtime Most efficient DVFS schedule for 50% slowdown

Fine-Grained Architecture DVFS – On-chip voltage regulators Neglect efficiency losses HMs – Composite Cores architecture Shared L1 caches, frontend HMs incur ~7% power overheads Leakage => clock-gating (not power-gating) Dynamic => Over-provisioned hardware

Fine-Grained Comparison HMs beats Coarse-Grained DVFS + HMs until ~50% slowdown DVFS + HMs provides benefits but not additive ~10% higher savings ~5% savings for free Fine-Grained DVFS ≈ Coarse-Grained DVFS Start with Coarse-Grained

Benchmarks Not always a clear winner DVFS+HMs has different benefits Energy savings for a 5% slowdown

Summary Fine-Grained DVFS + HMs provide no benefits 5% 6% Fine-Grained DVFS + HMs provide no benefits for large slowdowns Fine-Grained HMs provide most benefits for small slowdowns

More Details in Paper Assumptions of fine-grained architectures Overheads analysis for HMs Switching Overheads Power Overheads Detailed benchmark analysis

      Conclusions Questions? DVFS + HMs best for large slowdowns Coarse Grained    Fine Grained DVFS + HMs best for large slowdowns HMs trump DVFS for small slowdowns

University of Michigan Heterogeneous Microarchitectures Trump Voltage Scaling for Low-Power Cores Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Ronald Dreslinski Jr., Thomas F. Wenisch, and Scott Mahlke University of Michigan PACT’23 August 26th, 2014

Switching Overheads

Leakage Overheads

Benchmarks I Hmmer Xalancbmk mcf

Benchmarks II perlbench omnetpp

Limitation: State Transfer 10s of KB <1 KB State transfer costs can be very high: ~20K cycles (ARM’s big.LITTLE) iCache Branch Pred iTLB dTLB dCache Reg File Big Pipeline Decode Fetch Little Pipeline dTLB dCache Reg File iCache Branch Pred iTLB Limits switching to coarse granularity: 100M Instructions ( Kumar’04)

Creating a Composite Core State transfer overheads: 20K to ~20 cycles Big uEngine Fetch iCache Branch Pred iTLB Decode O3 Execute RAT Load/Store Queue Controller <1KB Reg File Switching granularity: 100M to 1000 instructions dTLB dCache Little uEngine Define uEngine somewhere Fetch iCache Branch Pred iTLB dTLB dCache dCache dTLB Little pays ~8% energy overhead iCache Branch Pred iTLB Fetch Decode inO Execute Reg File Mem

1 Billion smartphones in 20141,2 Low-Power Cores Are everywhere… 1 Billion smartphones in 20141,2 [1] http://www.gartner.com/newsroom/id/2665715 [2] http://www.gartner.com/newsroom/id/2665715 http://www.dialaphone.co.uk/blog/2008/06/17/a-funny-look-back-at-some-old-cell-phones/ http://www.businesskorea.co.kr/article/1687/local-market-saturation-korea%E2%80%99s-smartphone-market-forecast-negative-growth-year

Cites M. Horowitz, T. Indermaur, and R. Gonzalez, “Low-power digital design,” in IEEE Symp. Low Power Electron. (ISLPE’94) Digest of Tech. Papers, Oct. 1994, pp. 8–11. Kumar, R., Farkas, K. I., Jouppi, N. P., Ranganathan, P., & Tullsen, D. M. (2003, December). Single-ISA heterogeneous multi-core architectures: The potential for processor power reduction. In Microarchitecture, 2003. MICRO-36. Proceedings. 36th Annual IEEE/ACM International Symposium on (pp. 81-92). IEEE. [1] https://software.intel.com/sites/default/files/ftalat.pdf [2] http://www.arm.com/files/downloads/big_LITTLE_Final_Final.pdf [3] “A fully-integrated 3-level dc-dc converter for nanosecond-scale dvfs” [4] “Composite cores: pushing heterogeneity within a core”