UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de 2005) Advisors: Antonio González and Jordi Tubella
2 Outline Motivation & Objectives Overview of Proposals To improve the memory system To speed-up the execution of instructions Non Redundant Data Cache Trace-Level Speculative Multithreaded Arch. Conclusions & Future Work
3 Outline Motivation & Objectives Overview of Proposals To improve the memory system To speed-up the execution of instructions Non Redundant Data Cache Trace-Level Speculative Multithreaded Arch. Conclusions & Future Work
4 Motivation General by design real-world programs operating systems Often designed in mind to future expansion code reuse Input sets have little variation Repetition is relatively common Even with aggressive compilers
5 Types of Repetition Repetition Computations Values z = F (x, y)
6 Repetitive Computations 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % Spec CPU2000, 500 million instructions
7 z = F (x, y) Types of Repetition Repetition Computations Values
8 Repetitive Values 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % Spec CPU2000, 500 million instructions, analysis of destination value
9 Objectives Exploit Value Repetition of Store Instructions Redundant store instructions Non redundant data cache To improve the memory system Exploit Computation Repetition of all Insts Redundant computation buffer (ILR) Trace-level reuse (TLR) Trace-level speculative multithreaded architecture (TLS) To speed-up the execution of instructions
10 Experimental Framework Methodology Analysis of benchmarks Definition of proposal Evaluation of proposal Tools Atom Cacti 3.0 Simplescalar Tool Set Benchmarks Spec CPU95 Spec CPU2000
11 Outline Motivation & Objectives Overview of Proposals To improve the memory system To speed-up the execution of instructions Non Redundant Data Cache Trace-Level Speculative Multithreaded Arch. Conclusions & Future Work
12 Techniques to Improve Memory Redundant Stores Non Redundant Cache Value Repetition
13 Contributions Redundant stores Analysis of repetition into same storage location Redundant stores applied to reduce memory traffic Redundant Store Redundant Stores Instructions Do NOT modify memory STORE i, Value Y) If (Value X==Value Y) then Memory Value X Molina, González, Tubella, “Reducing Memory Traffic via Redundant Store Instructions”, HPCN’99 Main results 15%-25% of redundant store instructions 5%-20% of memory traffic reduction
14 Non redundant data cache (NRC) Data Cache Non Redundant Data Cache Molina, Aliagas, García,Tubella, González, “Non Redundant Data Cache”, ISLPED’03 Aliagas, Molina, García, González, Tubella, “Value Compression to Reduce Power in Data Caches”, EUROPAR’03 Contributions Analysis of repetition in several storage locations If (Value A==Value D) then Value Repetition Tag X Tag Y Value A Value C Value B Value D Main results On average, a value is stored 4 times at any given time NRC: -32% area, -13% energy, -25% latency, +5% miss FFFF 1234
15 Outline Motivation & Objectives Overview of Proposals To improve the memory system To speed-up the execution of instructions Non Redundant Data Cache Trace-Level Speculative Multithreaded Arch. Conclusions & Future Work
16 Techniques to Speed-up I Execution Data Value Reuse Data Value Speculation Avoid serialization caused by data dependences Determine results of instructions without executing them Target is to speed-up the execution of programs Avoid serialization caused by data dependences Determine results of instructions without executing them Target is to speed-up the execution of programs Computation Repetition
17 NON SPECULATIVE !!! Buffers previous inputs and their corresponding outputs Only possible if a computation has been done in the past Inputs have to be ready at reuse test time NON SPECULATIVE !!! Buffers previous inputs and their corresponding outputs Only possible if a computation has been done in the past Inputs have to be ready at reuse test time Techniques to Speed-up I Execution Computation Repetition Data Value Reuse Data Value Speculation
18 SPECULATIVE !!! Predicts values as a function of the past history Needs to confirm speculation at a later point Solves reuse test but introduces misspeculation penalty SPECULATIVE !!! Predicts values as a function of the past history Needs to confirm speculation at a later point Solves reuse test but introduces misspeculation penalty Techniques to Speed-up I Execution Computation Repetition Data Value Reuse Data Value Speculation
19 Applied to a SINGLE instruction Techniques to Speed-up I Execution Computation Repetition Data Value Reuse Instruction Level Trace Level Data Value Speculation Instruction Level Trace Level
20 Applied to a GROUP of instructions Techniques to Speed-up I Execution Computation Repetition Data Value Reuse Instruction Level Trace Level Data Value Speculation Instruction Level Trace Level
21 Techniques to Speed-up I Execution Computation Repetition Data Value Reuse Instruction Level Trace Level Data Value Speculation Instruction Level Trace Level
22 Redundant Computation Buffer (RCB) Instruction Level Reuse (ILR) index Reuse Table Fetch Decode & Rename Commit OOO Execution Molina, González, Tubella, “Dynamic Removal of Redundant Computations”, ICS’99 Contributions Performance potential of ILR RCB Main results Ideal ILR speed-up of 1.5 RCB speed-up of 1.1 (outperforms previous proposals)
23 Contributions Trace Level Reuse Initial design issues for integrating TLR Trace Level Reuse (TLR) González, Tubella, Molina, “Trace-Level Reuse”, ICPP’99 Performance potential of TLR I1 I2 I3 I4 I5 I6 TRACE Main results Ideal TLR speed-up of 3.6 4K-entry table: 25% of reuse, average trace size of 6
24 Compiler analysis to support TSMA Trace Level Speculation (TLS) Molina, González, Tubella, “Trace-Level Speculative Multithreaded Architecture (TSMA)”, ICCD’02 Molina, González, Tubella “Compiler Analysis for TSMA”, INTERACT’05 Molina, Tubella, González, “Reducing Misspeculation Penalty in TSMA”, ISHPC’05 Contributions Trace Level Speculative Multithreaded Architecture Main results speedup of 1.38 with a 20% of misspeculations Microarchitecture Support for Trace Speculation Two orthogonal issues Control and Data Speculation Techniques TSMA Static Analysis Based on Profiling Info
25 Objectives & Proposals To improve the memory system To speed-up the execution of instructions Redundant store instructions Non redundant data cache Redundant computation buffer (ILR) Trace-level reuse buffer (TLR) Trace-level speculative multithreaded architecture (TLS)
26 Outline Motivation & Objectives Overview of Proposals To improve the memory system To speed-up the execution of instructions Non Redundant Data Cache Trace-Level Speculative Multithreaded Arch. Conclusions & Future Work
27 Motivation Caches spend close to 50% of total die area Caches are responsible of a significant part of total power dissipated by a processor
28 Data Value Repetition percentage of repetitive values percentage of time Spec CPU2000, 1 billion instructions, 256KB data cache
29 Value A Value C Value B Value D FFFF 1234 Conventional Cache If (Value A==Value D) then Tag X Tag Y Value Repetition
FFFF FFFF 1234 Non Redundant Data Cache Tag X Tag Y Pointer Table Value Table Die Area Reduction
31 Additional Hardware: Pointers 1234 FFFF 0000 Non Redundant Data Cache Tag X Tag Y Pointer Table Value Table
32 Additional Hardware: Counters 1234 FFFF 0000 Non Redundant Data Cache Tag X Tag Y Pointer Table Value Table 1 2 1
33 Data Value Inlining Some values can be represented with a small number of bits (Narrow Values) Narrow values can be inlined into pointer area Simple sign extension is applied Benefits enlarges effective capacity of VT reduces latency reduces power dissipation
34 0 F 2 Data Value Inlining FFFF10000 Non Redundant Data Cache Tag X Tag Y Pointer Table Value Table
35 Miss Rate vs Die Area Miss Ratio % % % % % % % % L2 Cache: 256KB 512KB 1MB 2MB 4MB | | | 0,1 0,5 1,0 cm 2 VT50 VT30 VT20CONV Spec CPU2000, 1 billion instructions
36 Results Caches ranging from 256 KB to 4 MB
37 Outline Motivation & Objectives Overview of Proposals To improve the memory system To speed-up the execution of instructions Non Redundant Data Cache Trace-Level Speculative Multithreaded Arch. Conclusions & Future Work
38 Trace Level Speculation Avoids serialization caused by data dependences Skips in a row multiple instructions Predicts values based on the past Solves live-input test Introduces penalties due to misspeculations
39 Trace Level Speculation Two orthogonal issues microarchitecture support for trace speculation control and data speculation techniques –prediction of initial and final points –prediction of live output values Trace Level Speculative Multithreaded Architecture (TSMA) does not introduce significant misspeculation penalties Compiler Analysis based on static analysis that uses profiling data
40 Trace Level Speculation with Live Output Test Live Output Update & Trace Speculation NST ST Miss Trace Speculation Detection & Recovery Actions INSTRUCTION EXECUTION INSTRUCTION SPECULATION INSTRUCTION VALIDATION Instruction Flow
41 TSMA Block Diagram Cache I Engine Fetch Rename Decode & Units Functional Predictor Branch Speculation Trace NST Reorder Buffer ST Reorder Buffer NST Ld/St Queue ST Ld/St Queue NST I Window ST I Window Look Ahead Buffer Engine Verification L1NSDC L2NSDC L1SDC Data Cache Register File NST Arch. Register File ST Arch.
42 Compiler Analysis Focuses on developing effective trace selection schemes for TSMA based on static analysis that uses profiling data Trace Selection Graph Construction (CFG & DDG) Graph Analysis
43 Graph Analysis Two important issues initial and final point of a trace –maximize trace length & minimize misspeculations predictability of live output values –prediction accuracy and utilization degree Three basic heuristics Procedure Trace Heuristic Loop Trace Heuristic Instruction Chaining Trace Heuristic
44 Trace Speculation Engine Traces are communicated to the hardware at program loading time filling a special hardware structure (trace table) Each entry of the trace table contains initial PC final PC live-output values information branch history frequency counter
45 Simulation Parameters Base microarchitecture out of order machine, 4 instructions per cycle I cache: 16KB, D cache: 16KB, L2 shared: 256KB bimodal predictor 64-entry ROB, FUs: 4 int, 2 div, 2 mul, 4 fps TSMA additional structures each thread: I window, reorder buffer, register file speculative data cache: 1KB trace table: 128 entries, 4-way set associative look ahead buffer: 128 entries verification engine: up to 8 instructions per cycle
46 Speedup Spec CPU2000, 250 million instructions
47 Misspeculations Spec CPU2000, 250 million instructions
48 Outline Motivation & Objectives Overview of Proposals To improve memory system To speed-up the execution of instructions Non Redundant Data Cache Trace-Level Speculative Multithreaded Arch. Conclusions & Future Work
49 Conclusions Repetition is very common in programs Can be applied to improve the memory system to speed-up the execution of instructions Investigated several alternatives Novel cache organizations Instruction level reuse approach Trace level reuse concept Trace level speculation architecture
50 Future Work Value repetition in instruction caches Profiling to support data value reuse schemes Traces starting at different PCs Value prediction in TSMA Multiple speculations in TSMA Multiple threads in TSMA
51 Publications Value Repetition in Cache Organizations Reducing Memory Traffic Via Redundant Store Instructions, HPCN'99 Non Redundant Data Cache, ISLPED'03 Value Compression to Reduce Power in Data Caches, EUROPAR'03 Instruction & Trace Level Reuse The Performance Potential of Data Value Reuse, TR-UPC-DAC’98 Dynamic Removal of Redundant Computations, ICS'99 Trace Level Reuse, ICPP'99 Trace Level Speculation Trace-Level Speculative Multithreaded Architecture, ICCD'02 Compiler Analysis for TSMA, INTERACT’05 Reducing Misspeculation Penalty in TSMA, ISHPC´05
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona, 14 de Diciembre de 2005) Advisors: Antonio González and Jordi Tubella