UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

Slides:

Advertisements

Similar presentations

UPC MICRO35 Istanbul Nov Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

Advertisements

Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Topics Left Superscalar machines IA64 / EPIC architecture

UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Krste Asanovic Electrical Engineering and Computer Sciences

NC STATE UNIVERSITY 1 Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Alpha Microarchitecture Onur/Aditya 11/6/2001.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

UPC Dynamic Removal of Redundant Computations Carlos Molina, Antonio González and Jordi Tubella Universitat Politècnica de Catalunya - Barcelona

UPC Trace-Level Reuse A. González, J. Tubella and C. Molina Dpt. d´Arquitectura de Computadors Universitat Politècnica de Catalunya 1999 International.

UPC Reducing Power Consumption of the Issue Logic Daniele Folegnani and Antonio González Universitat Politècnica de Catalunya.

March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.

UPC Value Compression to Reduce Power in Data Caches Carles Aliagas, Carlos Molina and Montse García Universitat Rovira i Virgili – Tarragona, Spain {caliagas,

Multiscalar processors

The Auction: Optimizing Banks Usage in Non-Uniform Cache Architectures Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.

Trace Processors Presented by Nitin Kumar Eric Rotenberg Quinn Jacobson, Yanos Sazeides, Jim Smith Computer Science Department University of Wisconsin-Madison.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)

UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain Antonio González.

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

Revisiting Load Value Speculation:

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan.

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

1 Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Speculation Amir Roth University of Pennsylvania.

Analysis of NUCA Policies for CMPs Using Parsec Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research Center Intel.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

D A C U C P Speculative Alias Analysis for Executable Code Manel Fernández and Roger Espasa Computer Architecture Department Universitat Politècnica de.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.

OLSRp: Predicting Control Information to Achieve Scalability in OLSR Ad Hoc Networks Esunly Medina ф Roc Meseguer ф Carlos Molina λ Dolors Royo ф Santander.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Multiscalar Processors

/ Computer Architecture and Design

The University of Adelaide, School of Computer Science

Lecture: Out-of-order Processors

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Ka-Ming Keung Swamy D Ponpandi

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Krste Asanovic Electrical Engineering and Computer Sciences

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Lecture 10: Branch Prediction and Instruction Delivery

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Lecture 10: ILP Innovations

rePLay: A Hardware Framework for Dynamic Optimization

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Ka-Ming Keung Swamy D Ponpandi

Project Guidelines Prof. Eric Rotenberg.

Presentation transcript:

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI, Nara City (Japan) - September 7-9, 2005 λ Intel Barcelona Research Center Intel Labs - UPC Barcelona, Spain ф Dept. Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona, Spain ψ Dept. Enginyeria Informàtica Universitat Rovira i Virgili Tarragona, Spain

Techniques to Boost I Execution Data Value Reuse Data Value Speculation  Avoid serialization caused by data dependences  Determine results of instructions without executing them  Target is to boost the execution of programs  Avoid serialization caused by data dependences  Determine results of instructions without executing them  Target is to boost the execution of programs Computation Repetition

 NON SPECULATIVE !!!  Buffers previous inputs and their corresponding outputs  Only possible if a computation has been done in the past  Inputs have to be ready at reuse test time  NON SPECULATIVE !!!  Buffers previous inputs and their corresponding outputs  Only possible if a computation has been done in the past  Inputs have to be ready at reuse test time Techniques to Boost I Execution Computation Repetition Data Value Reuse Data Value Speculation

 SPECULATIVE !!!  Predicts values as a function of the past history  Needs to confirm speculation at a later point  Solves reuse test but introduces misspeculation penalty  SPECULATIVE !!!  Predicts values as a function of the past history  Needs to confirm speculation at a later point  Solves reuse test but introduces misspeculation penalty Techniques to Boost I Execution Computation Repetition Data Value Reuse Data Value Speculation

Trace Level Speculation Avoids serialization caused by data dependences Skips in a row multiple instructions Predicts values based on the past Introduces penalties due to misspeculations With Live Output Test Trace Level Speculation With Live Input Test

BUFFER Trace Level Speculation with Live Output Test Live Output Update & Trace Speculation NST ST Trace Miss Speculation Detection & Recovery Actions INSTRUCTION EXECUTION NOT EXECUTED LIVE OUTPUT VALIDATION

Motivation Two orthogonal issues microarchitecture support for trace speculation control and data speculation techniques –prediction of initial and final points –prediction of live output values This work focuses on microarchitecture support (TSMA) concretely, on reducing penalties due to misspeculations Molina, González, Tubella, “Trace-Level Speculative Multithreaded Architecture (TSMA)”, ICCD’02 Molina, González, Tubella “Compiler Analysis for TSMA”, INTERACT’05

Outline TSMA ( Trace-level Speculative Multithreaded Architecture ) Verification Engine Enhanced Verification Engine Experimental Framework Simulation Results Conclusions

TSMA Block Diagram Cache I Engine Fetch Rename Decode & Units Functional Predictor Branch Speculation Trace NST Reorder Buffer ST Reorder Buffer NST Ld/St Queue ST Ld/St Queue NST I Window ST I Window Look Ahead Buffer Engine Verification L1NSDC L2NSDC L1SDC Data Cache Register File NST Arch. Register File ST Arch.

Verification Engine ST stores it’s commited instructions in the LAB Look-Ahead Buffer I1I2I3I4 Program Counters Operation Type Sources & Destination Register Numbers Sources & Destination Register Values Effective Address NST verifies instructions from the LAB Source values are tested with the non-speculative state If they match, destination value is updated

BRANCHES: source value tested; program counter updated Verification Engine Look-Ahead Buffer I1I2I3I4 VERIFICATION ENGINE Non-Speculative Memory Hierarchy Non-Speculative Register File BRANCH R source1, Target Non-Speculative Register File

BRANCHES: source value tested; program counter updated Verification Engine Look-Ahead Buffer I1I2I3I4 VERIFICATION ENGINE ARITH IS: source values tested; destination register updated Non-Speculative Memory Hierarchy Non-Speculative Register File Non-Speculative Register File R dest R source1 OP R source2 Non-Speculative Register File

BRANCHES: source value tested; program counter updated Verification Engine Look-Ahead Buffer I1I2I3I4 VERIFICATION ENGINE ARITH IS: source values tested; destination register updated Non-Speculative Memory Hierarchy Non-Speculative Register File STORES: effective address verified; destination memory updated M [ R source1, literal ] R source2 Non-Speculative Register File Non-Speculative Memory Hierarchy

BRANCHES: source value tested; program counter updated Verification Engine Look-Ahead Buffer I1I2I3I4 VERIFICATION ENGINE ARITH IS: source values tested; destination register updated Non-Speculative Memory Hierarchy Non-Speculative Register File STORES: effective address verified; destination memory updated LOADS: effective address verified; memory value checked; register updated R dest M [ R source1, literal ] Non-Speculative Register File Non-Speculative Memory Hierarchy Non-Speculative Register File

Squashed Is from LAB On average, up to 85 instructions are squashed from LAB in each thread synchronization

Correctly Executed Is On average, over 20% of the squashed instructions were correctly executed by ST

Our Proposal Enhanced Verification Engine does not throw away execution results of instructions that are independent of the mispredicted point reduce the number of Is fetched and executed thread synchronizations can be delayed or even aborted verification of branches, loads, stores and single-cycle instructions is reconsidered.

Related Work Instruction reissue [Lipasti 1997, González & González 1997, Sato 1998] Squash reuse [Sodani & Sohi 1997] Control independence in trace processors [Rotenberg et al, 1997] Dynamic control independence [Chou et al 1999] Register integration [Roth & Sohi 2000]

BRANCHES: branch target is validated instead of source values. Enhanced Verification Engine Look-Ahead Buffer I1I2I3I4 ENHANCED VERIFICATION ENGINE Non-Speculative Memory Hierarchy Non-Speculative Register File BRANCH R source1, Target

BRANCHES: branch target is validated instead of source values. Enhanced Verification Engine Look-Ahead Buffer I1I2I3I4 ENHANCED VERIFICATION ENGINE ARITH IS: if source values do not match, instruction is re-executed. Non-Speculative Memory Hierarchy Non-Speculative Register File R dest R source1 OP R source2 Non-Speculative Register File Non-Speculative Register File F.U

BRANCHES: branch target is validated instead of source values. Enhanced Verification Engine Look-Ahead Buffer I1I2I3I4 ENHANCED VERIFICATION ENGINE ARITH IS: if source values do not match, instruction is re-executed. Non-Speculative Memory Hierarchy Non-Speculative Register File STORES: effective address is re-computed if fails and memory is updated with value obtained from the non-speculative architectural state. M [ R source1, literal ] R source2 Non-Speculative Register File Non-Speculative Memory Hierarchy

BRANCHES: branch target is validated instead of source values. Enhanced Verification Engine Look-Ahead Buffer I1I2I3I4 ENHANCED VERIFICATION ENGINE ARITH IS: if source values do not match, instruction is re-executed. Non-Speculative Memory Hierarchy Non-Speculative Register File STORES: effective address is re-computed if fails and memory is updated with value obtained from the non-speculative architectural state. Non-Speculative Register File LOADS: effective address is re-computed if fails and destination value obtained from memory is commited to register file. R dest M [ R source1, literal ] Non-Speculative Memory Hierarchy Non-Speculative Register File

Incorrect Speculated Is RestStoresLoadsBranchesSimple Is On average, close to 90% of the instructions are branches, loads, stores and single-cycle instructions Only 1% Is inserted in LAB are incorrectly predicted

Experimental Framework Simulator Alpha version of the SimpleScalar Toolset Benchmarks Spec2000, ref input Maximum Optimization Level DEC C & F77 compilers with -non_shared -O5 Statistics Collected for 250 million instructions Skipping an initial part of 500 million instructions

Simulation Parameters Base microarchitecture out of order machine, 4 instructions per cycle I cache: 16KB, D cache: 16KB, L2 shared: 256KB bimodal predictor TSMA additional structures each thread: I window, reorder buffer, register file speculative data cache: 1KB trace table: 128 entries, 4-way set associative look ahead buffer: 128 entries verification engine: up to 8 instructions per cycle only one I reexecuted per cycle

Thread Synchronizations Conventional VE Enhanced VE On average, the number of thread synchronizations is about 10% lower (from 30% to 20%)

Speedup Conventional VE Enhanced VE On average, the average performance improvement is around 9%

Executed Is Reduced On average, almost 8% of the instructions are reduced in execution with the enhanced VE

Conclusions TSMA significant number of Is are correctly executed, but discarded when synchronizing novel hardware technique to enhance TSMA Enhanced Verification Engine thread synchros are delayed or even aborted branches, loads, stores and single-cycle Is are reconsidered Results show speedup of 38% (9% improvement) misprediction rate of 20% (10% reduction)

Future Work Aggressive trace level predictors Generalization to multiple threads

UPC Questions & Answers ISHPC-VI, Nara City (Japan) - September 7-9, 2005