University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,

Slides:

Advertisements

Similar presentations

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Advertisements

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Instruction-Level Parallelism (ILP)

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 17: Basic Pipelining Today’s topics:  5-stage pipeline  Hazards and instruction scheduling Mid-term exam stats:  Highest: 90, Mean: 58.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

University of Michigan Electrical Engineering and Computer Science 1 StageNet: A Reconfigurable CMP Fabric for Resilient Systems Shantanu Gupta Shuguang.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Low Power Techniques in Processor Design

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.

COMP25212 CPU Multi Threading Learning Outcomes: to be able to: –Describe the motivation for multithread support in CPU hardware –To distinguish the benefits.

Dynamic Voltage Frequency Scaling for Multi-tasking Systems Using Online Learning Gaurav DhimanTajana Simunic Rosing Department of Computer Science and.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

Dept. of Electrical & Computer Engineering Self-Morphing Cores for Higher Power Efficiency and Improved Resilience Nithesh Kurella, Sudarshan Srinivasan.

On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.

11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.

Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

PipeliningPipelining Computer Architecture (Fall 2006)

CS 352H: Computer Systems Architecture

Stalling delays the entire pipeline

Adaptive Cache Partitioning on a Composite Core

Lynn Choi School of Electrical Engineering

Lynn Choi Dept. Of Computer and Electronics Engineering

/ Computer Architecture and Design

Hyperthreading Technology

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

Lecture: SMT, Cache Hierarchies

University of Michigan

Milad Hashemi, Onur Mutlu, Yale N. Patt

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

The University of Adelaide, School of Computer Science

Presentation transcript:

University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald Dreslinski, Thomas F. Wenisch, and Scott Mahlke University of Michigan Micro 45 May 8 th 2012

University of Michigan Electrical Engineering and Computer Science 2 High Performance Cores High performance cores waste energy on low performance phases Performance Energy Time High energy yields high performance Low performance DOES NOT yield low energy

University of Michigan Electrical Engineering and Computer Science 3 Core Energy Comparison Brooks, ISCA’00 Dally, IEEE Computer’08 Out-of-OrderIn-Order Out-Of-Order contains performance enhancing hardware Not necessary for correctness Do we always need the extra hardware?

University of Michigan Electrical Engineering and Computer Science 4 Previous Solution: Heterogeneous Multicore 2+ Cores Same ISA, different implementations –High performance, but more energy –Energy efficient, but less performance Share memory at high level –Share L2 cache ( Kumar ‘04) –Coherent L2 caches (ARM’s big.LITTLE) Operating System (or programmer) maps application to smallest core that provides needed performance

University of Michigan Electrical Engineering and Computer Science 5 Current System Limitations Migration between cores incurs high overheads –20K cycles (ARM’s big.LITTLE) Sample-based schedulers –Sample different cores performances and then decide whether to reassign the application –Assume stable performance with a phase Phase must be long to be recognized and exploited –100M-500M instructions in length Do finer grained phases exist? Can we exploit them? Do finer grained phases exist? Can we exploit them?

University of Michigan Electrical Engineering and Computer Science 6 Performance Change in GCC Average IPC over a 1M instruction window ( Quantum ) Average IPC over 2K Quanta

University of Michigan Electrical Engineering and Computer Science 7 Finer Quantum 20K instruction window from GCC Average IPC over 100 instruction quanta What if we could map these to a Little Core?

University of Michigan Electrical Engineering and Computer Science 8 Our Approach: Composite Cores Hypothesis : Exploiting fine-grained phases allows more opportunities to run on a Little core Problems I.How to minimize switching overheads? II.When to switch cores? Questions I.How fine-grained should we go? II.How much energy can we save?

University of Michigan Electrical Engineering and Computer Science 9 Problem I: State Transfer Fetch Decode Rename O3 Execute O3 Execute dTLB dCache Reg File iCache Branch Pred iTLB Decode InO Execute InO Execute Fetch iCache Branch Pred iTLB dTLB dCache RAT Reg File 10s of KB <1 KB 10s of KB State transfer costs can be very high: ~20K cycles (ARM’s big.LITTLE) State transfer costs can be very high: ~20K cycles (ARM’s big.LITTLE) Limits switching to coarse granularity: 100M Instructions ( Kumar’04) Limits switching to coarse granularity: 100M Instructions ( Kumar’04)

University of Michigan Electrical Engineering and Computer Science 10 Creating a Composite Core dTLB dCache RAT Reg File Decode O3 Execute dCache dTLB Fetch iCache Branch Pred iTLB Decode inO Execute Reg File Mem Load/Store Queue Fetch iCache Branch Pred iTLB iCache Branch Pred iTLB Fetch Controller dTLB dCache <1KB Big uEngine Little uEngine Only one uEngine active at a time

University of Michigan Electrical Engineering and Computer Science 11 Hardware Sharing Overheads Big uEngine needs –High fetch width –Complex branch prediction –Multiple outstanding data cache misses Little uEngine wants –Low fetch width –Simple branch prediction –Single outstanding data cache miss Must build shared units for Big uEngine –over-provision for Little uEngine Assume clock gating for inactive uEngine –Still has static leakage energy Little pays ~8% energy overhead to use over provisioned fetch + caches

University of Michigan Electrical Engineering and Computer Science 12 Problem II: When to Switch Goal : Maximize time on the Little uEngine subject to maximum performance loss User-Configurable Traditional OS-based schedulers won’t work –Decisions to frequent –Needs to be made in hardware Traditional sampling-based approaches won’t work –Performance not stable for long enough –Frequent switching just to sample wastes cycles

University of Michigan Electrical Engineering and Computer Science 13 What uEngine to Pick This value is hard to determine a priori, depends on application –Use a controller to learn appropriate value over time Run on Big Run on Little Run on Big Run on Little Let user configure the target value

University of Michigan Electrical Engineering and Computer Science 14 Reactive Online Controller Little uEngine True Big uEngine False Switching Controller Big Model + Threshold Controller Threshold Controller Little Model User-Selected Performance

University of Michigan Electrical Engineering and Computer Science 15 uEngine Modeling while(flag){ foo(); flag = bar(); } while(flag){ foo(); flag = bar(); } Little uEngine IPC: 1.66 IPC: ??? Big uEngine Collect Metrics of active uEngine iL1, dL1 cache misses L2 cache misses Branch Mispredicts ILP, MLP, CPI Collect Metrics of active uEngine iL1, dL1 cache misses L2 cache misses Branch Mispredicts ILP, MLP, CPI Use a linear model to estimate inactive uEngine’s performance IPC: 2.15

University of Michigan Electrical Engineering and Computer Science 16 Evaluation Architectural FeatureParameters Big uEngine3 wide 1.0GHz 12 stage pipeline 128 ROB Entries 128 entry register file Little uEngine2 wide 1.0GHz 8 stage pipeline 32 entry register file Memory System32 KB L1 i/d cache, 1 cycle access 1MB L2 cache, 15 cycle access 1GB Main Mem, 80 cycle access Controller5% performance loss relative to all big core

University of Michigan Electrical Engineering and Computer Science 17 Little Engine Utilization 3-Wide O3 (Big) vs. 2-Wide InOrder (Little) 5% performance loss relative to all Big More time on little engine with same performance loss More time on little engine with same performance loss Traditional OS-Based Quantum Fine-Grained Quantum

University of Michigan Electrical Engineering and Computer Science 18 Engine Switches Need LOTS of switching to maximize utilization ~1 Switch / 2800 Instructions ~1 Switch / 306 Instructions

University of Michigan Electrical Engineering and Computer Science 19 Performance Loss Composite Cores ( Quantum Length = 1000 ) Composite Cores ( Quantum Length = 1000 ) Switching overheads negligible until ~1000 instructions

University of Michigan Electrical Engineering and Computer Science 20 Fine-Grained vs. Coarse-Grained Little uEngine’s average power 8% higher –Due to shared hardware structures Fine-Grained can map 41% more instructions to the Little uEngine over Coarse-Grained. Results in overall 27% decrease in average power over Coarse-Grained

University of Michigan Electrical Engineering and Computer Science 21 1.Oracle Knows both uEngine’s performance for all quantums 2.Perfect Past Knows both uEngine’s past performance perfectly 3.Model Knows only active uEngine’s past, models inactive uEngine using default weights All models target 95% of the all Big uEngine’s performance Decision Techniques

University of Michigan Electrical Engineering and Computer Science 22 Little Engine Utilization High utilization for memory bound application Issue width dominates computation bound Maps 25% of the dynamic instructions onto the Little uEngine Maps 25% of the dynamic instructions onto the Little uEngine

University of Michigan Electrical Engineering and Computer Science 23 Energy Savings Includes the overhead of shared hardware structures 18% reduction in energy consumption

University of Michigan Electrical Engineering and Computer Science 24 User-Configured Performance 1% performance loss yields 4% energy savings 20% performance loss yields 44% energy savings

University of Michigan Electrical Engineering and Computer Science 25 More Details in the Paper Estimated uEngine area overheads uEngine model accuracy Switching timing diagram Hardware sharing overheads analysis

University of Michigan Electrical Engineering and Computer Science 26 Conclusions Even high performance applications experience fine-grained phases of low throughput –Map those to a more efficient core Composite Cores allows –Fine-grained migration between cores –Low overhead switching 18% energy savings by mapping 25% of the instructions to Little uEngine with a 5% performance loss Questions?

University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald Dreslinski, Thomas F. Wenisch, and Scott Mahlke University of Michigan Micro 45 May 8 th 2012

University of Michigan Electrical Engineering and Computer Science 28 Back Up

University of Michigan Electrical Engineering and Computer Science 29 The DVFS Question Lower voltage is useful when: –L2 Miss (stalled on commit) Little uArch is useful when: –Stalled on L2 Miss (stalled at issue) –Frequent branch mispredicts (shorter pipeline) –Dependent Computation

University of Michigan Electrical Engineering and Computer Science 30 Sharing Overheads

University of Michigan Electrical Engineering and Computer Science 31 Performance 5% performance loss

University of Michigan Electrical Engineering and Computer Science 32 Model Accuracy Little -> Big Big -> Little

University of Michigan Electrical Engineering and Computer Science 33 Regression Coefficients

University of Michigan Electrical Engineering and Computer Science 34 Different Than Kumar et al. Kumar et al.Composite Cores Coarse-grained switching OS Managed Fine-grain switching Hardware Managed Minimal shared state (L2’s) Maximizes shared state (L2’s, L1’s, Branch Predictor, TLBs) Requires sampling On-the-fly prediction 6 Wide O3 vs. 8 Wide O3 Has InOrder, but never uses it! 3 Wide O3 vs. 2 Wide InOrder Coarse-grained vs. fine-grained

University of Michigan Electrical Engineering and Computer Science 35 Register File Transfer RAT Num - - Registers Num Value Registers Num Value Commit 3 stage pipeline 1.Map to physical register in RAT 2.Read physical register 3.Write to new register file If commit updates, repeat

University of Michigan Electrical Engineering and Computer Science 36 uEngine Model