Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

Computer Organization and Architecture

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Instruction-Level Parallelism (ILP)

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

Copyright © 2002 UCI ACES Laboratory A Design Space Exploration framework for rISA Design Ashok Halambi, Aviral Shrivastava,

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Ryota Shioya, Masahiro Goshimay and Hideki Ando Micro 47 Presented by Kihyuk Sung.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.

University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,

2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.

A Decompression Architecture for Low Power Embedded Systems Lekatsas, H.; Henkel, J.; Wolf, W.; Computer Design, Proceedings International.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Lx: A Technology Platform for Customizable VLIW Embedded Processing.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,

Varun Mathur Mingwei Liu Sanghyun Park, Aviral Shrivastava and Yunheung Paek.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

PipeliningPipelining Computer Architecture (Fall 2006)

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

CS 352H: Computer Systems Architecture

5.2 Eleven Advanced Optimizations of Cache Performance

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Improving Program Efficiency by Packing Instructions Into Registers

Pipelining: Advanced ILP

Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.

Ka-Ming Keung Swamy D Ponpandi

Control unit extension for data hazards

Lecture 20: OOO, Memory Hierarchy

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Dynamic Hardware Prediction

Ka-Ming Keung Swamy D Ponpandi

Presentation transcript:

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida State University June 8-16, 2007

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 2 Instruction Packing Store frequently occurring instructions as specified by the compiler in a small, low- power Instruction Register File (IRF) Allow multiple instruction fetches from the IRF by packing instruction references together  Tightly packed – multiple IRF references  Loosely packed – piggybacks an IRF reference onto an existing instruction Facilitate parameterization of some instructions using an Immediate Table (IMM)

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 3 Instruction Cache PC IF/ID IRF IMM IRWP packed instruction insn1 insn2 imm3 insn3 imm3 insn4 Execution of IRF Instructions Instruction Fetch StageFirst Half of Instruction Decode Stage To Instruction Decoder Executing a Tightly Packed Param4c Instruction

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 4 Outline Introduction IRF and Instruction Packing Overview Integrating an IRF with an L0 I-Cache Decoupling Instruction Fetch Experimental Evaluation Related Work Conclusions & Future Work

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 5 MIPS+IRF Instruction Formats inst1sinst2inst3 inst5 param inst4 param instrtrd rs shamt function instrtimmediaters opcode immediate win 6 bits 11 bits 1 bit 6 bits2 bits24 bits 6 bits 5 bits T-type R-type I-type J-type

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 6 Previous Work in IRF Register Windowing + Loop Cache (MICRO 2005) Compiler Optimizations (CASES 2006)  Instruction Selection  Register Renaming  Instruction Scheduling

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 7 Integrating an IRF with an L0 I-Cache L0 or Filter Caches  Small and direct-mapped Fast hit time Low energy per access Higher miss rate than L1  256B L0 I-cache 8B line size [Kin97] Fetch energy reduced 68% Cycle time increased 46%!!! IRF reduces code size, while L0 only focuses on energy reduction at the cost of performance IRF can alleviate performance penalty associated with L0 cache misses, due to overlapping fetch

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 8 L0 Cache Miss Penalty Cycle Insn1 Insn2 Insn3 Insn4 IF ID EX M M M M WB IF

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 9 Overlapping Fetch with an IRF Cycle Insn1 Pack2a Pack2b Insn3 IF IF ab ID ID a EX b EX EX a MbMb M M MaMa WB b WB WB a IF ID b

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 10 Decoupling Instruction Fetch Instruction bandwidth in a pipeline is usually uniform (fetch, decode, issue, commit, …)  Artificially limits the effective design space Front-end throttling improves energy utilization by reducing the fetch bandwidth in areas of low ILP IRF can provide virtual front-end throttling  Fetch fewer instructions every cycle, but allow multiple issue of packed instructions  Areas of high ILP are often densely packed  Lower ILP for infrequently executed sections of code

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 11 Out-of-order Pipeline Configurations

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 12 Experimental Evaluation MiBench embedded benchmark suite – 6 categories representing common tasks for various domains SimpleScalar MIPS/PISA architectural simulator  Wattch/Cacti extensions for modeling energy consumption (inactive portions of pipeline only dissipate 10% of normal energy when using cc3 clock gating) VPO – Very Portable Optimizer targeted for SimpleScalar MIPS/PISA

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 13 L0 Study Configuration Data ParameterLow-Power In-order Embedded Processor I-Fetch Queue4 entries Branch PredictorBimodal-128 entries, 3 cycle penalty Fetch/Decode/IssueSingle instruction RUU size8 LSQ size8 L1 Data Cache16 KB, 256 lines, 16B line, 4-way s.a., 1 cycle hit L1 Instruction Cache 16 KB, 256 lines, 16B line, 4-way s.a., 1 / 2 cycle hit L0 Instruction Cache 256 B, 32 lines, 8B line, direct mapped, 1 cycle hit Memory Latency32 cycles IRF/IMM4 windows, 32-entry IRF (128 total), 32-entry IMM. 1 branch/pack

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 14 Execution Efficiency for L0 I-Caches

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 15 Energy Efficiency for L0 I-Caches

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 16 Decoupled Fetch Configurations ParameterHigh-end Out-of-order Embedded Processor I-Fetch Queue4/8 entries Branch PredictorBimodal-2048 entries, 3 cycle penalty Fetch Width1 / 2 / 4 Decode/Issue/Commit Width1 / 2 / 3 / 4 RUU size16 LSQ size8 L1 Data Cache32 KB, 512 lines, 16B line, 4-way s.a., 1 cycle hit L1 Instruction Cache32 KB, 512 lines, 16B line, 4-way s.a., 1 cycle hit Unified L2 Cache256 KB, 1024 lines, 64B line, 4-way s.a. 6 cycle hit Memory Latency32 cycles IRF/IMM4 windows, 32-entry IRF (128 total), 32-entry IMM. 1 branch/pack

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 17 Execution Efficiency for Asymmetric Pipeline Bandwidth

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 18 Energy Efficiency for Asymmetric Pipeline Bandwidth

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 19 Energy-Delay 2 for Asymmetric Pipeline Bandwidth

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 20 Related Work L-caches – subdivide instruction cache, such that one portion contains the most frequently accessed code Loop Caches – capture simple loop behaviors and replay instructions Zero Overhead Loop Buffers (ZOLB) Pipeline gating / Front-end throttling – stall fetch when in areas of low IPC

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 21 Conclusions and Future Work Future Topics  Can we pack areas where L0 is likely to miss?  IRF + encrypted or compressed I-Caches  IRF + asymmetric frequency clustering (of pipeline backend functional units) IRF can alleviate fetch bottlenecks from L0 I-Cache misses or branch mispredictions  Increased IPC of L0 system by 6.75%  Further decreased energy of L0 system by 5.78% Decoupling fetch provides a wider spectrum of design points to be evaluated (energy/performance)

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 22 The End Questions ???

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 23

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 24 Energy Consumption

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 25 Static Code Size

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 26 Conclusions & Future Work Compiler optimizations targeted specifically for IRF can further reduce energy (12.2%  15.8%), code size (16.8%  28.8%) and execution time Unique transformation opportunities exist due to IRF, such as code duplication for code size reduction and predication As processor designs become more idiosyncratic, it is increasingly important to explore the possibility of evolving existing compiler optimizations Register targeting and loop unrolling should also be explored with instruction packing Enhanced parameterization techniques

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 27

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 28 Instruction Redundancy Profiled largest benchmark in each of six MiBench categories Most frequent 32 instructions comprise 66.5% of total dynamic and 31% of total static instructions

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 29 Compilation Framework

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 30