Aggregating Processor Free Time for Energy Reduction Aviral Shrivastava 1 Eugene Earlie 2 Nikil Dutt 1 Alex Nicolau 1 1 Center For Embedded Computer Systems,

Slides:



Advertisements
Similar presentations
fakultät für informatik informatik 12 technische universität dortmund Additional compiler optimizations Peter Marwedel TU Dortmund Informatik 12 Germany.
Advertisements

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Power Reduction Techniques For Microprocessor Systems
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Copyright © 2002 UCI ACES Laboratory A Design Space Exploration framework for rISA Design Ashok Halambi, Aviral Shrivastava,
Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.
Chia-Yen Hsieh Laboratory for Reliable Computing Microarchitecture-Level Power Management Iyer, A. Marculescu, D., Member, IEEE IEEE Transaction on VLSI.
Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.
Chapter 12 Pipelining Strategies Performance Hazards.
Performance and Energy Bounds for Multimedia Applications on Dual-processor Power-aware SoC Platforms Weng-Fai WONG 黄荣辉 Dept. of Computer Science National.
© ACES Labs, CECS, ICS, UCI. Energy Efficient Code Generation Using rISA * Aviral Shrivastava, Nikil Dutt
Automatic Generation of Operation Tables for Fast Exploration of Bypasses in Embedded Systems Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Sanghyun.
PBExplore: A Framework for CIL Exploration of Partial Bypasses in Embedded Processors Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Eugene Earlie 2.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.
Compiler-in-the-Loop ADL-driven Early Architectural Exploration Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Eugene Earlie 2 1 Center For Embedded.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
CS 423 – Operating Systems Design Lecture 22 – Power Management Klara Nahrstedt and Raoul Rivas Spring 2013 CS Spring 2013.
A Compiler-in-the-Loop (CIL) Framework to Explore Horizontally Partitioned Cache (HPC) Architectures Aviral Shrivastava*, Ilya Issenin, Nikil Dutt *Compiler.
Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Sanghyun Park, §Aviral Shrivastava and Yunheung Paek
Drowsy Caches: Simple Techniques for Reducing Leakage Power Authors: ARM Ltd Krisztián Flautner, Advanced Computer Architecture Lab, The University of.
A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.
DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.
CML CML Static Analysis of Processor Idle Cycle Aggregation (PICA) Jongeun Lee, Aviral Shrivastava Compiler Microarchitecture Lab Department of Computer.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Bypass Aware Instruction Scheduling for Register File Power Reduction Sanghyun Park, Aviral Shrivastava Nikil Dutt, Alex Nicolau Yunheung Paek Eugene Earlie.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑 教授 組員 : R 張馨怡 R 林秀萍.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
Varun Mathur Mingwei Liu Sanghyun Park, Aviral Shrivastava and Yunheung Paek.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Operation Tables for Scheduling in the presence of Partial Bypassing Aviral Shrivastava 1 Eugene Earlie 2 Nikil Dutt 1 Alex Nicolau 1 1 Center For Embedded.
Buffering Techniques Greg Stitt ECE Department University of Florida.
SECTIONS 1-7 By Astha Chawla
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Ann Gordon-Ross and Frank Vahid*
Register Pressure Guided Unroll-and-Jam
Overheads for Computers as Components 2nd ed.
A High Performance SoC: PkunityTM
Instruction Level Parallelism (ILP)
How to improve (decrease) CPI
Code Transformation for TLB Power Reduction
Loop-Level Parallelism
Presentation transcript:

Aggregating Processor Free Time for Energy Reduction Aviral Shrivastava 1 Eugene Earlie 2 Nikil Dutt 1 Alex Nicolau 1 1 Center For Embedded Computer Systems, University of California, Irvine, CA, USA 2 Strategic CAD Labs, Intel, Hudson, MA, USA S CL

Copyright © 2005 UCI ACES Laboratory CODES+ISSS Sep 18, Processor Activity Each dot denotes the time for which the Intel XScale was stalled during the execution of qsort application Each dot denotes the time for which the Intel XScale was stalled during the execution of qsort application Pipeline Hazards Single Miss Multiple Misses Cold Misses

Copyright © 2005 UCI ACES Laboratory CODES+ISSS Sep 18, Processor Stall Durations With IPC of 0.7 With IPC of 0.7 XScale is stalled for 30% of time But each stall duration is small But each stall duration is small Average stall duration = 4 cycles Longest stall duration < 100 cycles Each stall is an opportunity for optimization Each stall is an opportunity for optimization Temporarily switch to a different thread of execution Improve throughput Improve throughput Reduce energy consumption Reduce energy consumption Temporarily switch the processor to low-power state Reduce energy consumption Reduce energy consumption But state switching has overhead!!

Copyright © 2005 UCI ACES Laboratory CODES+ISSS Sep 18, Power State Machine of XScale RUN IDLE DROWSY SLEEP 450 mW10 mW 1 mW 0 mW 180 cycles 36,000 cycles >> 36,000 cycles Break-even stall duration for profitable switching Break-even stall duration for profitable switching 360 cycles Maximum processor stall Maximum processor stall < 100 cycles NOT possible to switch the processor to IDLE mode NOT possible to switch the processor to IDLE mode Need to create larger stall durations

Copyright © 2005 UCI ACES Laboratory CODES+ISSS Sep 18, Motivating Example for (int i=0; i<1000; i++) c[i] = a[i] + b[i]; 1. L: mov ip, r1, lsl#2 2. ldr r2, [r4, ip]// r2 = a[i] 3. ldr r3, [r5, ip]// r3 = b[i] 4. add r1, r1, #1 5. cmp r1, r0 6. add r3, r3, r2// r3 = a[i]+b[i] 7. str r3, [r6, ip]// c[i] = r3 8. ble L Data Bus Request Bus Request Buffer Memory Data Cache Processor Load Store Unit Memory Buffer Computation = 1 instruction/cycle Cache Line Size = 4 words Request Latency = 12 cycles Data Bandwidth = 1 word/3 cycles for (int i=0; i<1000; i++) c[i] = a[i] + b[i]; 1. L: mov ip, r1, lsl#2 2. ldr r2, [r4, ip]// r2 = a[i] 3. ldr r3, [r5, ip]// r3 = b[i] 4. add r1, r1, #1 5. cmp r1, r0 6. add r3, r3, r2// r3 = a[i]+b[i] 7. str r3, [r6, ip]// c[i] = r3 8. ble L Data Bus Request Bus Request Buffer Memory Data Cache Processor Load Store Unit Memory Buffer Computation Define C (Computation) Time to execute 4 iterations of this loop, assuming no cache misses C = 8 instructions x 4 iterations = 32 cycles Define ML (Memory Latency) Define ML (Memory Latency) Time to transfer all the data required by 4 iterations between memory and caches, assuming the request was made well in advance ML = 4 lines x 4 words/line x 3 cycles/word = 48 cycles Define Memory-bound Loops – Loops for which ML > C Define Memory-bound Loops – Loops for which ML > C Not possible to avoid processor stalls in memory-bound loops

Copyright © 2005 UCI ACES Laboratory CODES+ISSS Sep 18, Normal Execution Time Activity Processor Activity Memory Bus Activity Processor activity is dis-continuous Memory activity is dis-continuous for (int i=0; i<1000; i++) c[i] = a[i] + b[i]; 1. L: mov ip, r1, lsl#2 2. ldr r2, [r4, ip] 3. ldr r3, [r5, ip] 4. add r1, r1, #1 5. cmp r1, r0 6. add r3, r3, r2 7. str r3, [r6, ip] 8. ble L

Copyright © 2005 UCI ACES Laboratory CODES+ISSS Sep 18, Prefetching Time Activity Processor Activity Memory Bus Activity Each processor activity period increases for (int i=0; i<1000; i++) prefetch a[i+4]; prefetch b[i+4]; prefetch c[i+4]; c[i] = a[i] + b[i]; Memory activity is continuous

Copyright © 2005 UCI ACES Laboratory CODES+ISSS Sep 18, Prefetching Time Activity Processor Activity Memory Bus Activity Each processor activity period increases for (int i=0; i<1000; i++) prefetch a[i+4]; prefetch b[i+4]; prefetch c[i+4]; c[i] = a[i] + b[i]; Memory activity is continuous Total execution time reduces Processor activity is dis-continuous Memory activity is continuous

Copyright © 2005 UCI ACES Laboratory CODES+ISSS Sep 18, Aggregation Time Activity Processor Activity Memory Bus Activity Total execution time remains same Processor activity is continuous Memory activity is continuous Aggregated processor free time Aggregated processor activity

Copyright © 2005 UCI ACES Laboratory CODES+ISSS Sep 18, Aggregation Aggregation Aggregation Collect small stall times to create a large chunk of free time Traditional Approach Traditional Approach Slow down the processor DVS, DFS, DPS Aggregation vs. Dynamic Scaling Aggregation vs. Dynamic Scaling Easier for hardware to implement idle states, than dynamic scaling Good for leakage energy Aggregation is counter-intuitive Aggregation is counter-intuitive Traditional scheduling algorithms distribute load over resources Aggregation collects the processor activity and inactivity Hare in the Hare and Tortoise race!! Hare in the Hare and Tortoise race!! Focus on aggregating memory stalls

Copyright © 2005 UCI ACES Laboratory CODES+ISSS Sep 18, Related Work Low-power states are typically implemented using Low-power states are typically implemented using Clock gating, Power gating, voltage scaling, frequency scaling Rabaey et al. [Kluwer96] Low power design methodologies Between applications, processor can be switched to low-power mode Between applications, processor can be switched to low-power mode System Level Dynamic Power Management Benini et al. [TVSLI] A survey of design techniques for system-level dynamic power management Inside application Inside application Microarchitecture-level dynamic switching Gowan e al [DAC 98] Power considerations in the design of the alpha microprocessor Prefetching Prefetching Can aggregate memory activity in compute-bound loops Vanderwiel et al. [CSUR] Data prefetch mechanisms But not in memory-bound loops Existing Prefetching techniques can request only a few lines at-a-time Existing Prefetching techniques can request only a few lines at-a-time For large scale processor free time aggregation For large scale processor free time aggregation Need a prefetch mechanism to request large amounts of data No technique for aggregation of processor free time

Copyright © 2005 UCI ACES Laboratory CODES+ISSS Sep 18, HW/SW Approach for Aggregation Hardware Support Hardware Support Large-scale prefetching Processor Low-power mode Data analysis Data analysis To find out what to prefetch To discover memory-bound loops Software Support Software Support Code Transformations to achieve aggregation

Copyright © 2005 UCI ACES Laboratory CODES+ISSS Sep 18, Aggregation Mechanism Data Bus Request Bus Request Buffer Memory Data Cache Processor Load Store Unit Memory Buffer Prefetch Engine Programmable prefetch engine Programmable prefetch engine Compiler controlled Processor sets up the prefetch engine Processor sets up the prefetch engine What to prefetch When to wakeup the processor Prefetch engine starts prefetching Prefetch engine starts prefetching Processor goes to sleep Processor goes to sleep Zzz… Zzz… Prefetch Engine wakes up the processor at pre-calculated time Prefetch Engine wakes up the processor at pre-calculated time Processor executes on the data Processor executes on the data No cache misses No performance penalty Time Activity Processor Activity Memory Bus Activity Aggregation

Copyright © 2005 UCI ACES Laboratory CODES+ISSS Sep 18, Hardware support for Aggregation Instructions to control prefetch engine Instructions to control prefetch engine setPrefetch a, l setWakeup w Prefetch Engine Prefetch Engine Add line requests to request buffer Keep the request buffer non-empty Keep the request buffer non-empty Data bus will be saturated Data bus will be saturated Round-robin policy Generates wakeup interrupt after requesting w lines After fetching data, disable and disengage Processor Processor Low-power state Wait for wakeup interrupt from the prefetch engine Data Bus Request Bus Request Buffer Memory Data Cache Processor Load Store Unit Memory Buffer Prefetch Engine

Copyright © 2005 UCI ACES Laboratory CODES+ISSS Sep 18, Data analysis for Aggregation To find out what data is needed To find out what data is needed To find whether a loop is memory bound To find whether a loop is memory bound Compute ML Compute ML Source code analysis to find what is needed Innermost For-loops with Innermost For-loops with constant step constant step known bounds known bounds Address functions of the references Address functions of the references affine functions of iterators affine functions of iterators Contiguous lines are required Contiguous lines are required Find memory-bound loops (ML > C) Find memory-bound loops (ML > C) Evaluate C (Computation) Simple analysis of assembly code Simple analysis of assembly code Compute ML (Memory Latency) for (int i=0; i<1000; i++) c[i] = a[i] + b[i]; 1. L: mov ip, r1, lsl#2 2. ldr r2, [r4, ip] 3. ldr r3, [r5, ip] 4. add r1, r1, #1 5. cmp r1, r0 6. add r3, r3, r2 7. str r3, [r6, ip] 8. ble L Scope of analysis Data Analysis in Paper

Copyright © 2005 UCI ACES Laboratory CODES+ISSS Sep 18, Code Transformations for Aggregation Cannot request all the data at once Cannot request all the data at once Wakeup the processor before it starts to overwrite unused data in the cache Loop Tiling is needed Loop Tiling is needed for (int i=0; i<N; i++) c[i] = a[i] + b[i]; // Set the prefetch engine 1. setPrefetchArray a, N/L 2. setPrefetchArray b, N/L 3. setPrefetchArray c, N/L 4. startPrefetch - - for (i1=0; i1<N; i1+=T) setProcWakeup w procIdleMode for (i2=i1; i2<i1+T; i2++) c[i2] = a[i2] + b[i2] Set up prefetch engine Tile the loop Set to wakeup the processor Put processor to sleep Compute w and T Time Activity Processor Activity Memory Bus Activity Aggregation wt T w: Wakeup time T: Tile size

Copyright © 2005 UCI ACES Laboratory CODES+ISSS Sep 18, Computation of w and T Speed at which memory is generating data Speed at which memory is generating data r/ML Speed at which processor is consuming data Speed at which processor is consuming data r/C Wakeup time w Wakeup time w Do not overwrite the cache w * (r/ML) > L w = L* ML/r Tile size T Tile size T Finish all the prefetched data (w+t) * (r/ML) = t * r/C T = w*ML/(ML-C) Time Activity Processor Activity Memory Bus Activity Aggregation wt T w: Wakeup time T: Tile size MemoryProcessor Modeled as a Producer Consumer Problem Cache

Copyright © 2005 UCI ACES Laboratory CODES+ISSS Sep 18, // epilogue 14. setProcWakeup w2 15. procIdleMode 16. for (i1=T2; i1<N; i1++) 17. c[i1] = a[i1] + b[i1] Complete Transformation Setup the prefetch engine Prologue Tile the kernel of the loop Epilogue // prologue 5. setProcWakeup w1 6. procIdleMode 7. for (i1=0; i1<T1; i1++) 8. c[i1] = a[i1] + b[i1] // Set the prefetch engine 1. setPrefetchArray a, N/L 2. setPrefetchArray b, N/L 3. setPrefetchArray c, N/L 4. startPrefetch // tile the kernel of the loop 9. for (i1=0; i1<T2; i1+=T) 10. setProcWakeup w 11. procIdleMode 12. for (i2=i1; i2<i1+T; i2++) 13. c[i2] = a[i2] + b[i2] for (int i=0; i<N; i++) c[i] = a[i] + b[i];

Copyright © 2005 UCI ACES Laboratory CODES+ISSS Sep 18, Experiments Platform – Intel XScale Platform – Intel XScale Experiment 1: Free Time Aggregation Experiment 1: Free Time Aggregation Benchmarks: Stream kernels Used by architects to tune the memory performance to the computation power of the processor Used by architects to tune the memory performance to the computation power of the processor Metrics: Sleep window and Sleep time Experiment 2: Processor Energy Reduction Experiment 2: Processor Energy Reduction Benchmarks: Multimedia applications Typical application set for the Intel XScale Typical application set for the Intel XScale Metric: Energy Reduction Evaluate architectural overheads Evaluate architectural overheads Area Power Performance

Copyright © 2005 UCI ACES Laboratory CODES+ISSS Sep 18, Experiment 1: Sleep Window Up to 50,000 Processor Free Cycles can be aggregated Sleep window = L*ML/r Sleep window = L*ML/r Unrolling Unrolling Does not change ML, but decreases C Unrolling does not change sleep window More loops become memory-bound (ML > C) Increases the scope of aggregation

Copyright © 2005 UCI ACES Laboratory CODES+ISSS Sep 18, Experiment 1: Sleep Time Processor can be in low-power mode for up to 75% of execution time Sleep Time = (ML-C)/ML Sleep Time = (ML-C)/ML Unrolling Unrolling Unrolling does not change ML, decreases C Increases scope of aggregation Increases Sleep Time Sleep Time : % Loop Execution time when processor can be in sleep mode

Copyright © 2005 UCI ACES Laboratory CODES+ISSS Sep 18, Experiment 2: Processor Energy Savings P_busy = 450 mW P_busy = 450 mW P_stall = 112 mW P_stall = 112 mW P_idle = 10 mW P_idle = 10 mW P_myIdle = 50 mW P_myIdle = 50 mW Up to 18% savings in Processor Energy Initial Energy Initial Energy E orig = (N busy *P busy ) + (N stall *P stall ) Final Energy Final Energy E final = (N busy *P busy ) + (N stall *P stall ) + (N my_idle *P my_idle )

Copyright © 2005 UCI ACES Laboratory CODES+ISSS Sep 18, Architectural Overheads Synthesized Prefetch Engine using Synthesized Prefetch Engine using Synopsys design compiler 2001 Library lsi_10k Linearly scale the area and power numbers Area Overhead Area Overhead Very small Power Overhead Power Overhead Synopsys power estimate < 1% Performance Overhead Performance Overhead < 1% Data Bus Request Bus Request Buffer Memory Data Cache Processor Load Store Unit Memory Buffer Prefetch Engine

Copyright © 2005 UCI ACES Laboratory CODES+ISSS Sep 18, Summary & Future Work Existing prefetching techniques cannot achieve large-scale processor free time aggregation Existing prefetching techniques cannot achieve large-scale processor free time aggregation We presented a hardware-software cooperative approach to aggregate the processor free time We presented a hardware-software cooperative approach to aggregate the processor free time Up to 50,000 processor free cycles can be aggregated Without aggregation, max processor free time < 100 cycles Up to 75% of loop time can be free Processor can be switched to low-power mode during the aggregated free time Processor can be switched to low-power mode during the aggregated free time Up to 18% processor energy savings Minimal Overheads Minimal Overheads Area (< 1%) Power (<1%) Performance (<1%) To do To do Increase the scope of application of aggregation techniques Investigate the effect on leakage energy