1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

Dynamic Optimization using ADORE Framework 10/22/2003 Wei Hsu Computer Science and Engineering Department University of Minnesota.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

1 Presenter: Chien-Chih Chen. 2 An Assertion Library for On- Chip White-Box Verification at Run-Time On-Chip Verification of NoCs Using Assertion Processors.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

A highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Walid Najjar* *University of California, Riverside **The.

A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.

Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Computer Organization and Architecture

Synthesis of Customized Loop Caches for Core-Based Embedded Systems Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering University.

A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.

A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1, Pablo Viana 2, Frank Vahid 1, Walid Najjar 1, and Edna Barros.

1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.

Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.

Propagating Constants Past Software to Hardware Peripherals Frank Vahid*, Rilesh Patel and Greg Stitt Dept. of Computer Science and Engineering University.

Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.

Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Cisc Complex Instruction Set Computing By Christopher Wong 1.

Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.

An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.

A Novel Cache Architecture with Enhanced Performance and Security Zhenghong Wang and Ruby B. Lee.

A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

Presenter: Hong-Wei Zhuang On-Chip SOC Test Platform Design Based on IEEE 1500 Standard Very Large Scale Integration (VLSI) Systems, IEEE Transactions.

A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.

1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.

Lu Hao Profiling-Based Hardware/Software Co- Exploration for the Design of Video Coding Architectures Heiko Hübert and Benno Stabernack.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.

Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑教授組員 : R 張馨怡 R 林秀萍.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

Lightweight Runtime Control Flow Analysis for Adaptive Loop Caching + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Marisha.

Transforming Policies into Mechanisms with Infokernel Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Nathan C. Burnett, Timothy E. Denehy, Thomas J.

Time Management.  Time management is concerned with OS facilities and services which measure real time.  These services include:  Keeping track of.

On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.

Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization Ajay Nair, Roman Lysecky Department of Electrical and Computer.

The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

Programmable Logic Devices

Embedded Systems Design

Ann Gordon-Ross and Frank Vahid*

A High Performance SoC: PkunityTM

A Self-Tuning Configurable Cache

Dynamic Hardware/Software Partitioning: A First Approach

Automatic Tuning of Two-Level Caches to Embedded Applications

What Are Performance Counters?

Presentation transcript:

1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine This work was supported in part by the U.S. National Science Foundation and a U.S. Dept. of Education GAANN Fellowship International Conference on Compilers, Architecture, and Synthesis for Embedded Systems 2003.

2 System Optimizations - Static Specialization of a system for a particular application or suite of applications to improve power consumption and/or performance Static optimizations are performed at design time by the designer There are many static optimization approaches: Critical regions can be partitioned to configurable logic Constant propagation and code specialization for statically determined invariant variables Critical regions can be locked to a specialized cache Many more…

3 System Optimizations - Static * Undergrad/microchip.jpg Modeled input stimulus * Power Consumption Design Time

4 Static Optimization Drawbacks Designer must perform optimizations May disrupt standard software tool flows Runtime environment may present optimization opportunities that are not evident during design time simulation Simulation is usually used Framework – may take months to set up a testing environment with realistic input stimuli Exploration time – may take weeks to run one of hundreds of possible configurations

5 Profiling and performing dynamic optimizations System Optimizations - Dynamic Dynamic software optimizations are becoming increasingly popular for improving software performance and power. Dynamic optimizations are performed in system during runtime End Product Runtime Input Power Consumption Execution Time System startup Modify system to use dynamic optimizations Optimized system Change in stimulus Designer Re-optimization of the system

6 System Optimizations - Dynamic There are many dynamic optimization approaches Dynamo performs dynamic software optimizations on the most frequently executed regions of code Frequently executed regions of code can be remapped to non-interfering cache locations Dynamic binary translation methods store translation results of frequently executed regions of code for quick look-up Value profiling can determine runtime invariant variables for constant propagation and/or code specialization Many others….

7 Dynamic Optimizations - Effectiveness For dynamic optimizations to be most effective, optimizations are typically applied to the most frequently executed regions of code For a large selection of the MediaBench benchmark suite, we observed that 90% of the execution time was spent in approximately 10% of the code Profiling is used to determine the critical regions of code

8 Previous Profiling Methods Desktop targeted profiling methods Instrumentation and sampling These methods are unsuitable for embedded systems Causes disruption of run-time behavior Early methods used logic analyzers Not possible for today’s systems-on-a-chip (SOCs) JTAG standard allows for internal registers to be read Typically used for testing and debugging Interrupts processor to write internal information to external pins Desktop Embedded

9 Profiling Methodology Goal The goal of our profiling approach is to design a profiling tool suitable for embedded systems to determine the most critical regions of code

10 Critical Region Detection - Operational Requirements Non-intrusion Important for real-time systems Minimizes the impact on current tool chains i.e. no special compilers or binary modification tools Low power Battery operated systems Systems with limited cooling Small area Less significant due to the large transistor capacities of current and future chips Accuracy Exact results are not required for the information to be useful -- instead, reasonable accuracy is acceptable

11 Frequent Loop Detection We analyzed the critical regions for various Powerstone and Mediabench benchmarks We translate the problem of finding the critical regions to finding the frequently executed loops Short backwards branch (sbb) instruction is typically the last instruction of a loop 15% - Subroutines with no inner loops 85% - Small inner loops All Critical Regions

12 Percentage of Execution Time for Frequent Loops In addition to detection of frequent loops we also want to know the loops’ percentage contribution to total execution time. Application X Application Y Loop A - 10% Loop B - 10% Loop C - 80% Loop A - 32% Loop B - 33% Loop C - 35% Optimization of most frequent loop only = 80% for X and 35% for Y Loop C - 80% Loop C - 35% Optimization of all critical loops = 80% for X and 100% for Y Loop C - 80% Loop C - 35% Loop B - 33% Loop A - 32%

13 Frequent Loop Detection - Cache Based Architecture Microprocessor Frequent Loop Cache Controller Frequent Loop Cache To L1 Memory rd/wr addr sbb rd/wr addr data saturation ++

Cache Operation Sbb TraceAddress Frequency CONFLICT

15 Cache Operation - Conflict Resolution Resolve most conflicts using associativity and an LRU replacement policy for further conflicts Further conflicts may cause frequent loops to constantly be replaced in the cache - thrashing Our experiments did not suffer from this contention but a victim buffer may be added if necessary

16 Cache Operation – Frequency Width Our goal is to find the smallest possible cache needed to determine the frequent loops We keep the cache small by allowing the frequency field width to be varied If the frequency field is too small, saturations can occur and frequency information may be lost

17 Cache Operation - Frequency Counter Saturation Address Freq (8 bits) Sbb Trace SATURATION All frequencies are divided by 2 with a shift right (built as a special feature of the cache and activated by asserting the saturation signal to the cache)

18 Experimental Setup We ran extensive experiments to determine the best frequent loop cache configuration Cache configurations simulated: To determine the accuracy of each cache configuration we wrote a trace simulator for the cache architecture in C++ Cache Sizes Cache Associativities Frequency Counter Field Widths 16, 32, and 64 entries 1, 2, 4, and 8-way 4 to 32 bits X X= 336 configurations

19 Experimental Setup Benchmarks Selected Powerstone benchmarks running on a 32-bit MIPS instruction set simulator Selected MediaBench benchmarks running on SimpleScalar Power consumption UMC 0.18-micron CMOS technology running at 250 MHz at 1.8V Cache memory power consumption obtained using the Artisan memory compiler Additional logic and functionality modeled in synthesizable VHDL using Synopsys Design Compiler

20 Accuracy - Sum of Differences We computed the average difference between the actual loop execution time percentage and the computed loop execution time percentage for the ten most frequently executed loops

21 Results - Sum of Differences Sum of differences results averaged over all Powerstone benchmarks

22 Results - Best Cache Configuration We determine the smallest possible cache configuration necessary to give good results Overall best cache configuration - 2-way 32-entry cache with a frequency width of 24 bits 95% accuracy for Powerstone 90% accuracy for MediaBench No change to system performance

23 Results - Base System Power and Area MIPS32 4Kp microprocessor core Embedded processor with a cache Small - area of 1.7 mm 2 Low power mW

24 Results - Frequent Loop Detector Power Overhead For the best cache configuration - Increase in average power consumption of the total system with frequent loop detector is 2.4% Power Consumption of Operations 142 mW per cache read and increment 156 mW per cache write 20.7 mW per saturation Average Frequency of Operations Cache updates – 4.25% Saturations – %

25 Results - Frequent Loop Detector Area Overhead Resulting area overhead of 10.5% compared to the reported size of the MIPS32 4Kp Our numbers are pessimistic while reported microprocessor areas are likely optimistic Area Overhead Frequent loop cache controller, incrementor and additional control/steering logic 1400 gates ( mm 2 ) Cache including saturation logic0.167 mm 2

26 Reducing Power Overhead via Frequent Update Coalescing Since frequent loops tend to iterate many times, the same entry is updated in the frequent loop cache many times in a row Coalesce consecutive sbb executions into one cache update to reduce cache updates 1110 increment frequency } add 4 to frequency Sbb Trace

27 Reducing Power Overhead via Frequent Update Coalescing Powerstone MediaBench

28 Sampling for Further Reduced Power Overhead Instead of tallying every sbb executed, only tally sbbs that occur at a fixed sampling interval This method does not require interrupting of the processor.

29 Sum of Differences for Sampling Powerstone MediaBench

30 Sum of Differences for Sampling When going from a sampling interval of 1 to 50 - Average accuracy decreases for Powerstone benchmarks by 5% Average accuracy increases for MediaBench benchmarks by 2%

31 Results for a Sampling Interval of 50 Coalescing plus sampling reduces the average system power overhead to a mere 0.02% Still no change to system performance Average Frequency of Operations Cache updates – 0.03% Coalesces % No Saturations

32 Example Use - Warp Processing The detector has been successfully incorporated into a novel prototype system- on-a-chip architecture performing what is presently known as warp processing (also being developed at UCR)

33 Warp Processing µPµP µPµPµPµP Mem SW ______ SW ______ µPµP Initially execute application in software only1 SW ______ SW ______ µPµP Profiler Profile application to determine critical regions2 Profiler Dynamic Partitioning Module Partition critical regions to hardware 3 Dynamic Partitioning Module Configurable Logic Program configurable logic & update software binary 4 Configurable Logic After some execution, time/energy are warped; User is unaware of configurable logic!5

34 Conclusions We have presented a frequent loop detector that is small, power-efficient, non-intrusive and accurately provides relative frequencies of loops 2-way set-associative 32-entry cache with a 24-bit frequency counter Power overhead of 2.4% compared to a low-power 32-bit embedded processor Power overhead is easily reducible to well below 0.1% using simple coalescing and sampling methods Currently being used in the profiling step of the Warp processor at UCR