University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,

University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald Dreslinski, Thomas F. Wenisch, and Scott Mahlke University of Michigan Micro 45 May 8 th 2012

University of Michigan Electrical Engineering and Computer Science 2 High Performance Cores High performance cores waste energy on low performance phases Performance Energy Time High energy yields high performance Low performance DOES NOT yield low energy

University of Michigan Electrical Engineering and Computer Science 3 Core Energy Comparison Brooks, ISCA’00 Dally, IEEE Computer’08 Out-of-OrderIn-Order Out-Of-Order contains performance enhancing hardware Not necessary for correctness Do we always need the extra hardware?

University of Michigan Electrical Engineering and Computer Science 4 Previous Solution: Heterogeneous Multicore 2+ Cores Same ISA, different implementations –High performance, but more energy –Energy efficient, but less performance Share memory at high level –Share L2 cache ( Kumar ‘04) –Coherent L2 caches (ARM’s big.LITTLE) Operating System (or programmer) maps application to smallest core that provides needed performance

University of Michigan Electrical Engineering and Computer Science 5 Current System Limitations Migration between cores incurs high overheads –20K cycles (ARM’s big.LITTLE) Sample-based schedulers –Sample different cores performances and then decide whether to reassign the application –Assume stable performance with a phase Phase must be long to be recognized and exploited –100M-500M instructions in length Do finer grained phases exist? Can we exploit them? Do finer grained phases exist? Can we exploit them?

University of Michigan Electrical Engineering and Computer Science 6 Performance Change in GCC Average IPC over a 1M instruction window ( Quantum ) Average IPC over 2K Quanta

University of Michigan Electrical Engineering and Computer Science 7 Finer Quantum 20K instruction window from GCC Average IPC over 100 instruction quanta What if we could map these to a Little Core?

University of Michigan Electrical Engineering and Computer Science 8 Our Approach: Composite Cores Hypothesis : Exploiting fine-grained phases allows more opportunities to run on a Little core Problems I.How to minimize switching overheads? II.When to switch cores? Questions I.How fine-grained should we go? II.How much energy can we save?

University of Michigan Electrical Engineering and Computer Science 9 Problem I: State Transfer Fetch Decode Rename O3 Execute O3 Execute dTLB dCache Reg File iCache Branch Pred iTLB Decode InO Execute InO Execute Fetch iCache Branch Pred iTLB dTLB dCache RAT Reg File 10s of KB <1 KB 10s of KB State transfer costs can be very high: ~20K cycles (ARM’s big.LITTLE) State transfer costs can be very high: ~20K cycles (ARM’s big.LITTLE) Limits switching to coarse granularity: 100M Instructions ( Kumar’04) Limits switching to coarse granularity: 100M Instructions ( Kumar’04)

University of Michigan Electrical Engineering and Computer Science 10 Creating a Composite Core dTLB dCache RAT Reg File Decode O3 Execute dCache dTLB Fetch iCache Branch Pred iTLB Decode inO Execute Reg File Mem Load/Store Queue Fetch iCache Branch Pred iTLB iCache Branch Pred iTLB Fetch Controller dTLB dCache <1KB Big uEngine Little uEngine Only one uEngine active at a time

University of Michigan Electrical Engineering and Computer Science 11 Hardware Sharing Overheads Big uEngine needs –High fetch width –Complex branch prediction –Multiple outstanding data cache misses Little uEngine wants –Low fetch width –Simple branch prediction –Single outstanding data cache miss Must build shared units for Big uEngine –over-provision for Little uEngine Assume clock gating for inactive uEngine –Still has static leakage energy Little pays ~8% energy overhead to use over provisioned fetch + caches

University of Michigan Electrical Engineering and Computer Science 12 Problem II: When to Switch Goal : Maximize time on the Little uEngine subject to maximum performance loss User-Configurable Traditional OS-based schedulers won’t work –Decisions to frequent –Needs to be made in hardware Traditional sampling-based approaches won’t work –Performance not stable for long enough –Frequent switching just to sample wastes cycles

University of Michigan Electrical Engineering and Computer Science 13 What uEngine to Pick This value is hard to determine a priori, depends on application –Use a controller to learn appropriate value over time Run on Big Run on Little Run on Big Run on Little Let user configure the target value

University of Michigan Electrical Engineering and Computer Science 14 Reactive Online Controller Little uEngine True Big uEngine False Switching Controller Big Model + Threshold Controller Threshold Controller Little Model User-Selected Performance

University of Michigan Electrical Engineering and Computer Science 15 uEngine Modeling while(flag){ foo(); flag = bar(); } while(flag){ foo(); flag = bar(); } Little uEngine IPC: 1.66 IPC: ??? Big uEngine Collect Metrics of active uEngine iL1, dL1 cache misses L2 cache misses Branch Mispredicts ILP, MLP, CPI Collect Metrics of active uEngine iL1, dL1 cache misses L2 cache misses Branch Mispredicts ILP, MLP, CPI Use a linear model to estimate inactive uEngine’s performance IPC: 2.15

University of Michigan Electrical Engineering and Computer Science 16 Evaluation Architectural FeatureParameters Big uEngine3 wide O3 @ 1.0GHz 12 stage pipeline 128 ROB Entries 128 entry register file Little uEngine2 wide InOrder @ 1.0GHz 8 stage pipeline 32 entry register file Memory System32 KB L1 i/d cache, 1 cycle access 1MB L2 cache, 15 cycle access 1GB Main Mem, 80 cycle access Controller5% performance loss relative to all big core

University of Michigan Electrical Engineering and Computer Science 17 Little Engine Utilization 3-Wide O3 (Big) vs. 2-Wide InOrder (Little) 5% performance loss relative to all Big More time on little engine with same performance loss More time on little engine with same performance loss Traditional OS-Based Quantum Fine-Grained Quantum

University of Michigan Electrical Engineering and Computer Science 18 Engine Switches Need LOTS of switching to maximize utilization ~1 Switch / 2800 Instructions ~1 Switch / 306 Instructions

University of Michigan Electrical Engineering and Computer Science 19 Performance Loss Composite Cores ( Quantum Length = 1000 ) Composite Cores ( Quantum Length = 1000 ) Switching overheads negligible until ~1000 instructions

University of Michigan Electrical Engineering and Computer Science 20 Fine-Grained vs. Coarse-Grained Little uEngine’s average power 8% higher –Due to shared hardware structures Fine-Grained can map 41% more instructions to the Little uEngine over Coarse-Grained. Results in overall 27% decrease in average power over Coarse-Grained

University of Michigan Electrical Engineering and Computer Science 21 1.Oracle Knows both uEngine’s performance for all quantums 2.Perfect Past Knows both uEngine’s past performance perfectly 3.Model Knows only active uEngine’s past, models inactive uEngine using default weights All models target 95% of the all Big uEngine’s performance Decision Techniques

University of Michigan Electrical Engineering and Computer Science 22 Little Engine Utilization High utilization for memory bound application Issue width dominates computation bound Maps 25% of the dynamic instructions onto the Little uEngine Maps 25% of the dynamic instructions onto the Little uEngine

University of Michigan Electrical Engineering and Computer Science 23 Energy Savings Includes the overhead of shared hardware structures 18% reduction in energy consumption

University of Michigan Electrical Engineering and Computer Science 24 User-Configured Performance 1% performance loss yields 4% energy savings 20% performance loss yields 44% energy savings

University of Michigan Electrical Engineering and Computer Science 25 More Details in the Paper Estimated uEngine area overheads uEngine model accuracy Switching timing diagram Hardware sharing overheads analysis

University of Michigan Electrical Engineering and Computer Science 26 Conclusions Even high performance applications experience fine-grained phases of low throughput –Map those to a more efficient core Composite Cores allows –Fine-grained migration between cores –Low overhead switching 18% energy savings by mapping 25% of the instructions to Little uEngine with a 5% performance loss Questions?

University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald Dreslinski, Thomas F. Wenisch, and Scott Mahlke University of Michigan Micro 45 May 8 th 2012

University of Michigan Electrical Engineering and Computer Science 28 Back Up

University of Michigan Electrical Engineering and Computer Science 29 The DVFS Question Lower voltage is useful when: –L2 Miss (stalled on commit) Little uArch is useful when: –Stalled on L2 Miss (stalled at issue) –Frequent branch mispredicts (shorter pipeline) –Dependent Computation http://www.arm.com/files/downloads/big_LITTLE_Final_Final.pdf

University of Michigan Electrical Engineering and Computer Science 30 Sharing Overheads

University of Michigan Electrical Engineering and Computer Science 31 Performance 5% performance loss

University of Michigan Electrical Engineering and Computer Science 32 Model Accuracy Little -> Big Big -> Little

University of Michigan Electrical Engineering and Computer Science 33 Regression Coefficients

University of Michigan Electrical Engineering and Computer Science 34 Different Than Kumar et al. Kumar et al.Composite Cores Coarse-grained switching OS Managed Fine-grain switching Hardware Managed Minimal shared state (L2’s) Maximizes shared state (L2’s, L1’s, Branch Predictor, TLBs) Requires sampling On-the-fly prediction 6 Wide O3 vs. 8 Wide O3 Has InOrder, but never uses it! 3 Wide O3 vs. 2 Wide InOrder Coarse-grained vs. fine-grained

University of Michigan Electrical Engineering and Computer Science 35 Register File Transfer RAT Num - - Registers Num Value Registers Num Value Commit 3 stage pipeline 1.Map to physical register in RAT 2.Read physical register 3.Write to new register file If commit updates, repeat

University of Michigan Electrical Engineering and Computer Science 36 uEngine Model

University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,

Similar presentations

Presentation on theme: "University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,

Similar presentations

Presentation on theme: "University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,"— Presentation transcript:

Similar presentations

About project

Feedback