UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.

UPC Dynamic Removal of Redundant Computations Carlos Molina, Antonio González and Jordi Tubella Universitat Politècnica de Catalunya - Barcelona

CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.

UPC Trace-Level Reuse A. González, J. Tubella and C. Molina Dpt. d´Arquitectura de Computadors Universitat Politècnica de Catalunya 1999 International.

UPC Value Compression to Reduce Power in Data Caches Carles Aliagas, Carlos Molina and Montse García Universitat Rovira i Virgili – Tarragona, Spain {caliagas,

Multiscalar processors

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Slipstream Processors by Pujan Joshi1 Pujan Joshi May 6 th, 2008 Slipstream Processors Improving both Performance and Fault Tolerance.

UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain Antonio González.

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

Revisiting Load Value Speculation:

In-Line Interrupt Handling for Software Managed TLBs Aamer Jaleel and Bruce Jacob Electrical and Computer Engineering University of Maryland at College.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan.

CS5222 Advanced Computer Architecture Part 3: VLIW Architecture

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek,

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Varun Mathur Mingwei Liu Sanghyun Park, Aviral Shrivastava and Yunheung Paek.

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Lecture: Out-of-order Processors

Multiscalar Processors

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Improving Program Efficiency by Packing Instructions Into Registers

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

Milad Hashemi, Onur Mutlu, Yale N. Patt

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

Ka-Ming Keung Swamy D Ponpandi

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Lecture 10: Branch Prediction and Instruction Delivery

Lecture 20: OOO, Memory Hierarchy

Patrick Akl and Andreas Moshovos AENAO Research Group

rePLay: A Hardware Framework for Dynamic Optimization

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Ka-Ming Keung Swamy D Ponpandi

Project Guidelines Prof. Eric Rotenberg.

Spring 2019 Prof. Eric Rotenberg

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de 2005) Advisors: Antonio González and Jordi Tubella

2 Outline Motivation & Objectives Overview of Proposals To improve the memory system To speed-up the execution of instructions Non Redundant Data Cache Trace-Level Speculative Multithreaded Arch. Conclusions & Future Work

3 Outline Motivation & Objectives Overview of Proposals To improve the memory system To speed-up the execution of instructions Non Redundant Data Cache Trace-Level Speculative Multithreaded Arch. Conclusions & Future Work

4 Motivation General by design real-world programs operating systems Often designed in mind to future expansion code reuse Input sets have little variation Repetition is relatively common Even with aggressive compilers

5 Types of Repetition Repetition Computations Values z = F (x, y)

6 Repetitive Computations 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % Spec CPU2000, 500 million instructions

7 z = F (x, y) Types of Repetition Repetition Computations Values

8 Repetitive Values 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % Spec CPU2000, 500 million instructions, analysis of destination value

9 Objectives Exploit Value Repetition of Store Instructions Redundant store instructions Non redundant data cache To improve the memory system Exploit Computation Repetition of all Insts Redundant computation buffer (ILR) Trace-level reuse (TLR) Trace-level speculative multithreaded architecture (TLS) To speed-up the execution of instructions

10 Experimental Framework Methodology Analysis of benchmarks Definition of proposal Evaluation of proposal Tools Atom Cacti 3.0 Simplescalar Tool Set Benchmarks Spec CPU95 Spec CPU2000

11 Outline Motivation & Objectives Overview of Proposals To improve the memory system To speed-up the execution of instructions Non Redundant Data Cache Trace-Level Speculative Multithreaded Arch. Conclusions & Future Work

12 Techniques to Improve Memory Redundant Stores Non Redundant Cache Value Repetition

13 Contributions Redundant stores Analysis of repetition into same storage location Redundant stores applied to reduce memory traffic Redundant Store Redundant Stores Instructions Do NOT modify memory STORE i, Value Y) If (Value X==Value Y) then Memory Value X Molina, González, Tubella, “Reducing Memory Traffic via Redundant Store Instructions”, HPCN’99 Main results 15%-25% of redundant store instructions 5%-20% of memory traffic reduction

14 Non redundant data cache (NRC) Data Cache Non Redundant Data Cache Molina, Aliagas, García,Tubella, González, “Non Redundant Data Cache”, ISLPED’03 Aliagas, Molina, García, González, Tubella, “Value Compression to Reduce Power in Data Caches”, EUROPAR’03 Contributions Analysis of repetition in several storage locations If (Value A==Value D) then Value Repetition Tag X Tag Y Value A Value C Value B Value D Main results On average, a value is stored 4 times at any given time NRC: -32% area, -13% energy, -25% latency, +5% miss FFFF 1234

15 Outline Motivation & Objectives Overview of Proposals To improve the memory system To speed-up the execution of instructions Non Redundant Data Cache Trace-Level Speculative Multithreaded Arch. Conclusions & Future Work

16 Techniques to Speed-up I Execution Data Value Reuse Data Value Speculation  Avoid serialization caused by data dependences  Determine results of instructions without executing them  Target is to speed-up the execution of programs  Avoid serialization caused by data dependences  Determine results of instructions without executing them  Target is to speed-up the execution of programs Computation Repetition

17  NON SPECULATIVE !!!  Buffers previous inputs and their corresponding outputs  Only possible if a computation has been done in the past  Inputs have to be ready at reuse test time  NON SPECULATIVE !!!  Buffers previous inputs and their corresponding outputs  Only possible if a computation has been done in the past  Inputs have to be ready at reuse test time Techniques to Speed-up I Execution Computation Repetition Data Value Reuse Data Value Speculation

18  SPECULATIVE !!!  Predicts values as a function of the past history  Needs to confirm speculation at a later point  Solves reuse test but introduces misspeculation penalty  SPECULATIVE !!!  Predicts values as a function of the past history  Needs to confirm speculation at a later point  Solves reuse test but introduces misspeculation penalty Techniques to Speed-up I Execution Computation Repetition Data Value Reuse Data Value Speculation

19 Applied to a SINGLE instruction Techniques to Speed-up I Execution Computation Repetition Data Value Reuse Instruction Level Trace Level Data Value Speculation Instruction Level Trace Level

20 Applied to a GROUP of instructions Techniques to Speed-up I Execution Computation Repetition Data Value Reuse Instruction Level Trace Level Data Value Speculation Instruction Level Trace Level

21 Techniques to Speed-up I Execution Computation Repetition Data Value Reuse Instruction Level Trace Level Data Value Speculation Instruction Level Trace Level

22 Redundant Computation Buffer (RCB) Instruction Level Reuse (ILR) index Reuse Table Fetch Decode & Rename Commit OOO Execution Molina, González, Tubella, “Dynamic Removal of Redundant Computations”, ICS’99 Contributions Performance potential of ILR RCB Main results Ideal ILR speed-up of 1.5 RCB speed-up of 1.1 (outperforms previous proposals)

23 Contributions Trace Level Reuse Initial design issues for integrating TLR Trace Level Reuse (TLR) González, Tubella, Molina, “Trace-Level Reuse”, ICPP’99 Performance potential of TLR I1 I2 I3 I4 I5 I6 TRACE Main results Ideal TLR speed-up of 3.6 4K-entry table: 25% of reuse, average trace size of 6

24 Compiler analysis to support TSMA Trace Level Speculation (TLS) Molina, González, Tubella, “Trace-Level Speculative Multithreaded Architecture (TSMA)”, ICCD’02 Molina, González, Tubella “Compiler Analysis for TSMA”, INTERACT’05 Molina, Tubella, González, “Reducing Misspeculation Penalty in TSMA”, ISHPC’05 Contributions Trace Level Speculative Multithreaded Architecture Main results speedup of 1.38 with a 20% of misspeculations Microarchitecture Support for Trace Speculation Two orthogonal issues Control and Data Speculation Techniques TSMA Static Analysis Based on Profiling Info

25 Objectives & Proposals To improve the memory system To speed-up the execution of instructions Redundant store instructions Non redundant data cache Redundant computation buffer (ILR) Trace-level reuse buffer (TLR) Trace-level speculative multithreaded architecture (TLS)

26 Outline Motivation & Objectives Overview of Proposals To improve the memory system To speed-up the execution of instructions Non Redundant Data Cache Trace-Level Speculative Multithreaded Arch. Conclusions & Future Work

27 Motivation Caches spend close to 50% of total die area Caches are responsible of a significant part of total power dissipated by a processor

28 Data Value Repetition percentage of repetitive values percentage of time Spec CPU2000, 1 billion instructions, 256KB data cache

29 Value A Value C Value B Value D FFFF 1234 Conventional Cache If (Value A==Value D) then Tag X Tag Y Value Repetition

FFFF FFFF 1234 Non Redundant Data Cache Tag X Tag Y Pointer Table Value Table Die Area Reduction

31 Additional Hardware: Pointers 1234 FFFF 0000 Non Redundant Data Cache Tag X Tag Y Pointer Table Value Table

32 Additional Hardware: Counters 1234 FFFF 0000 Non Redundant Data Cache Tag X Tag Y Pointer Table Value Table 1 2 1

33 Data Value Inlining Some values can be represented with a small number of bits (Narrow Values) Narrow values can be inlined into pointer area Simple sign extension is applied Benefits enlarges effective capacity of VT reduces latency reduces power dissipation

34 0 F 2 Data Value Inlining FFFF10000 Non Redundant Data Cache Tag X Tag Y Pointer Table Value Table

35 Miss Rate vs Die Area Miss Ratio % % % % % % % % L2 Cache: 256KB 512KB 1MB 2MB 4MB | | | 0,1 0,5 1,0 cm 2 VT50 VT30 VT20CONV Spec CPU2000, 1 billion instructions

36 Results Caches ranging from 256 KB to 4 MB

37 Outline Motivation & Objectives Overview of Proposals To improve the memory system To speed-up the execution of instructions Non Redundant Data Cache Trace-Level Speculative Multithreaded Arch. Conclusions & Future Work

38 Trace Level Speculation Avoids serialization caused by data dependences Skips in a row multiple instructions Predicts values based on the past Solves live-input test Introduces penalties due to misspeculations

39 Trace Level Speculation Two orthogonal issues microarchitecture support for trace speculation control and data speculation techniques –prediction of initial and final points –prediction of live output values Trace Level Speculative Multithreaded Architecture (TSMA) does not introduce significant misspeculation penalties Compiler Analysis based on static analysis that uses profiling data

40 Trace Level Speculation with Live Output Test Live Output Update & Trace Speculation NST ST Miss Trace Speculation Detection & Recovery Actions INSTRUCTION EXECUTION INSTRUCTION SPECULATION INSTRUCTION VALIDATION Instruction Flow

41 TSMA Block Diagram Cache I Engine Fetch Rename Decode & Units Functional Predictor Branch Speculation Trace NST Reorder Buffer ST Reorder Buffer NST Ld/St Queue ST Ld/St Queue NST I Window ST I Window Look Ahead Buffer Engine Verification L1NSDC L2NSDC L1SDC Data Cache Register File NST Arch. Register File ST Arch.

42 Compiler Analysis Focuses on developing effective trace selection schemes for TSMA based on static analysis that uses profiling data Trace Selection Graph Construction (CFG & DDG) Graph Analysis

43 Graph Analysis Two important issues initial and final point of a trace –maximize trace length & minimize misspeculations predictability of live output values –prediction accuracy and utilization degree Three basic heuristics Procedure Trace Heuristic Loop Trace Heuristic Instruction Chaining Trace Heuristic

44 Trace Speculation Engine Traces are communicated to the hardware at program loading time filling a special hardware structure (trace table) Each entry of the trace table contains initial PC final PC live-output values information branch history frequency counter

45 Simulation Parameters Base microarchitecture out of order machine, 4 instructions per cycle I cache: 16KB, D cache: 16KB, L2 shared: 256KB bimodal predictor 64-entry ROB, FUs: 4 int, 2 div, 2 mul, 4 fps TSMA additional structures each thread: I window, reorder buffer, register file speculative data cache: 1KB trace table: 128 entries, 4-way set associative look ahead buffer: 128 entries verification engine: up to 8 instructions per cycle

46 Speedup Spec CPU2000, 250 million instructions

47 Misspeculations Spec CPU2000, 250 million instructions

48 Outline Motivation & Objectives Overview of Proposals To improve memory system To speed-up the execution of instructions Non Redundant Data Cache Trace-Level Speculative Multithreaded Arch. Conclusions & Future Work

49 Conclusions Repetition is very common in programs Can be applied to improve the memory system to speed-up the execution of instructions Investigated several alternatives Novel cache organizations Instruction level reuse approach Trace level reuse concept Trace level speculation architecture

50 Future Work Value repetition in instruction caches Profiling to support data value reuse schemes Traces starting at different PCs Value prediction in TSMA Multiple speculations in TSMA Multiple threads in TSMA

51 Publications Value Repetition in Cache Organizations Reducing Memory Traffic Via Redundant Store Instructions, HPCN'99 Non Redundant Data Cache, ISLPED'03 Value Compression to Reduce Power in Data Caches, EUROPAR'03 Instruction & Trace Level Reuse The Performance Potential of Data Value Reuse, TR-UPC-DAC’98 Dynamic Removal of Redundant Computations, ICS'99 Trace Level Reuse, ICPP'99 Trace Level Speculation Trace-Level Speculative Multithreaded Architecture, ICCD'02 Compiler Analysis for TSMA, INTERACT’05 Reducing Misspeculation Penalty in TSMA, ISHPC´05

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona, 14 de Diciembre de 2005) Advisors: Antonio González and Jordi Tubella