Ann Gordon-Ross and Frank Vahid*

Slides:



Advertisements
Similar presentations
1 Fast Configurable-Cache Tuning with a Unified Second-Level Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
T-SPaCS – A Two-Level Single-Pass Cache Simulation Methodology + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Wei Zang.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.
Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.
A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
A highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Walid Najjar* *University of California, Riverside **The.
A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.
A Highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang, Frank Vahid and Walid Najjar University of California, Riverside ISCA 2003.
Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.
The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.
Chuanjun Zhang, UC Riverside 1 Using a Victim Buffer in an Application- Specific Memory Hierarchy Chuanjun Zhang*, Frank Vahid** *Dept. of Electrical Engineering.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
DAC 2001: Paper 18.2 Center for Embedded Computer Systems, UC Irvine Center for Embedded Computer Systems University of California, Irvine
A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1, Pablo Viana 2, Frank Vahid 1, Walid Najjar 1, and Edna Barros.
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
1 Hot Caches, Cool Techniques: Online Tuning of Highly Configurable Caches for Reduced Energy Consumption Ann Gordon-Ross Department of Computer Science.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.
A Compiler-in-the-Loop (CIL) Framework to Explore Horizontally Partitioned Cache (HPC) Architectures Aviral Shrivastava*, Ilya Issenin, Nikil Dutt *Compiler.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.
Cache-Conscious Structure Definition By Trishul M. Chilimbi, Bob Davidson, and James R. Larus Presented by Shelley Chen March 10, 2003.
Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.
A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.
A S ELF -T UNING C ACHE ARCHITECTURE FOR E MBEDDED S YSTEMS Chuanjun Zhang, Frank Vahid and Roman Lysecky Presented by: Wei Zang Mar. 29, 2010.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
Dynamic Phase-based Tuning for Embedded Systems Using Phase Distance Mapping + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.
Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.
Improving Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection Hamid Noori †, Maziar Goudarzi ‡, Koji Inoue ‡, and Kazuaki.
Minimum Effort Design Space Subsetting for Configurable Caches + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing This work.
1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time.
Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.
Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Lightweight Runtime Control Flow Analysis for Adaptive Loop Caching + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Marisha.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
Sunpyo Hong, Hyesoon Kim
Exploiting Dynamic Phase Distance Mapping for Phase-based Tuning of Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.
The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.
Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.
1 Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache Rajiv Ravindran, Pracheeti Nagarkar, Ganesh Dasika, Robert Senger, Eric Marsman,
Computer Architecture Principles Dr. Mike Frank
Selective Code Compression Scheme for Embedded System
5.2 Eleven Advanced Optimizations of Cache Performance
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
Tosiron Adegbija and Ann Gordon-Ross+
ICIEV 2014 Dhaka, Bangladesh
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Tosiron Adegbija and Ann Gordon-Ross+
A Self-Tuning Configurable Cache
Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,
Automatic Tuning of Two-Level Caches to Embedded Applications
rePLay: A Hardware Framework for Dynamic Optimization
Presentation transcript:

A First Look at the Interplay of Code Reordering and Configurable Caches Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine Nikil Dutt Center for Embedded Computer Systems School for Information and Computer Science University of California, Irvine This work was supported by the U.S. National Science Foundation, and by the Semiconductor Research Corporation

Optimizations Optimization is an important part of the design of an application or system Area Performance Power and/or energy

Instruction Cache Optimizations The instruction cache is a good candidate for optimizations Gordon-Ross ‘04 Instruction caches have predictable spatial and temporal locality. 90% of execution time is spent in 10% of the code ARM920T(Segars ‘01) Power hungry - 29% of power consumption

Instruction Cache Tuning - Code Reordering Tune the instruction stream for increased cache utilization and thus increased performance Reorder the code so that infrequently executed regions of code do not pollute the instruction cache. int x; x = 5; … Download Compile Link obj file App Code reordering is typically applied during link time however runtime methods do exist but incur undesirable runtime overhead. Execute

Instruction Cache Tuning - Code Reordering while (input) while (input) Read input Read input no 100 Is the input valid? Is the input valid? Code Reordering yes yes 1 no Process input Error handling routine Process input Done Done Error handling routine

Instruction Cache Tuning - Configurable Cache Tuning Tune the cache to the instruction stream for decreased energy and/or increased performance Cache tuning can be performed during application/platform design or even in system during runtime incurring no runtime overhead (Zhang - DATE’04) OR

Instruction Cache Tuning - Configurable Cache Tuning Tunable parameters include: Cache Associativity Total Cache Size Cache Line Size L1 Cache L1 Cache L1 Cache { }

Motivation - Code Reordering + Cache Configuration Cache configuration tunes the cache to the instruction stream How do these optimizations affect each other? Complement? Obviate? int x; x = 5; … Degrade? Instruction Cache Code reordering tunes the instruction stream for the cache

Pettis and Hansen Code Reordering Many current code reordering techniques are based heavily off of the Pettis and Hansen code reordering algorithm - 1990 Reorder basic blocks using edge profiling to increase locality Orders basic blocks so that the most frequently executed path through the basic blocks is placed as straight-line code

Pettis and Hansen Bottom-up Positioning Algorithm Control Flow Graph Process arc weights in decreasing order For each arc, merge basic blocks at the source and destination of each arc to form a chain If one of the blocks is already in the middle of a chain, form a new chain Reordered basic block chains Execution frequencies Basic Blocks

Configurable Cache Architecture We used the configurable cache architecture proposed by Zhang - ISCA’03

Configurable Cache Architecture The base cache consists of 4 2KByte banks that may individually be shutdown for size configuration Way concatenation allows for configurable associativity Way shutdown 8 KBytes 4 KBytes 8 KBytes 2-way

Configurable Cache Heuristic L1 Cache …then tune cache line size… 16, 32, and 64 bytes …and finally tune cache associativity L1 Cache Direct-mapped, 2-way and 4-way L1 Cache First tune cache size… { } 2, 4, and 8 KBytes

Evaluation Framework Chosen cache configuration Cache Exploration Heuristic No code reordering Powerstone MediaBench EEMBC Exhaustive search for comparison purposes Chosen cache configuration Instrument the executable to gather edge profiles Execute the application Code reordered executable PLTO* Pentium Link Time Optimizer Hit and miss ratios for each configuration Provide edge profiles to perform code reordering Execute the application to gather edge profiles Cache energy - Cacti Main memory energy - Samsung memory *Provided by the University of Arizona

Results - Energy Savings Base cache = 2KB, d-m, 16 byte line size Base Cache Without Code Reordering Base Cache With Code Reordering Configured Cache Without Code Reordering Configured Cache With Code Reordering 1.5 1.5 Code reordering alone = 3.5% energy reduction Cache configuration alone = 15% energy reduction Cache configuration + code reordering = 17% energy reduction

Results - Performance Benefits Base Cache Without Code Reordering Base Cache With Code Reordering Configured Cache Without Code Reordering Configured Cache With Code Reordering 1.5 1.6 Code reordering alone = 3.5% performance benefit Cache configuration alone = 17% performance benefit Cache configuration + code reordering = 18.5% performance benefit On average, code reordering gives little additional benefit over cache configuration alone. However a few benchmarks see added benefits.

Change in Cache Requirements Due to Code Reordering x x x x * x * * x * * x x * *Powerstone **Mediabench ***EEMBC x - larger line size * - smaller cache size - reduction in cache area

Conclusions We explore the interplay of two instruction cache optimization techniques - code reordering and cache configuration Cache configuration largely obviates the need for code reordering with respect to energy and performance Cache configuration applied dynamically during runtime eliminates the need for designer applied code reordering Code reordering improved cache utilization in 52% of the benchmarks Reduced instruction cache size by an average of 13% and as high as 90% - beneficial for small custom synthesized embedded systems where area is critical

Future Work We plan to use a more advanced code reordering methodology that will take into account set assiociativity or multiple levels of cache We plan to study the iterative interplay of code reordering and cache configuration using a code reordering technique that takes the cache configuration into consideration