Link-Time Path-Sensitive Memory Redundancy Elimination Manel Fernández and Roger Espasa Computer Architecture Department Universitat.

Slides:

Advertisements

Similar presentations

Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee

Advertisements

Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite Hussein Al-Zoubi.

Project : Phase 1 Grading Default Statistics (40 points) Values and Charts (30 points) Analyses (10 points) Branch Predictor Statistics (30 points) Values.

Programming Technologies, MIPT, April 7th, 2012 Introduction to Binary Translation Technology Roman Sokolov SMWare

Instruction Level Parallelism

Target Code Generation

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.

U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.

Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

Whole-Program Linear-Constant Analysis with Applications to Link-Time Optimization Ludo Van Put – Dominique Chanet – Koen De Bosschere Ghent University.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

1 S. Tallam, R. Gupta, and X. Zhang PACT 2005 Extended Whole Program Paths Sriraman Tallam Rajiv Gupta Xiangyu Zhang University of Arizona.

Memory Consistency in Vector IRAM David Martin. Consistency model applies to instructions in a single instruction stream (different than multi-processor.

PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

Memory Systems Performance Workshop 2004© David Ryan Koes MSP 2004 Programmer Specified Pointer Independence David Koes Mihai Budiu Girish Venkataramani.

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Software-Hardware Cooperative Memory Disambiguation Ruke Huang, Alok.

UPC Dynamic Removal of Redundant Computations Carlos Molina, Antonio González and Jordi Tubella Universitat Politècnica de Catalunya - Barcelona

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Performance Potentials of Compiler- directed Data Speculation Author: Youfeng Wu, Li-Ling Chen, Roy Ju, Jesse Fang Programming Systems Research Lab Intel.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

Multiscalar processors

Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Liveness analysis and Register Allocation Cheng-Chia Chen.

Load-Reuse Analysis design and evaluation Rastislav Bodík Rajiv Gupta Mary Lou Soffa.

Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.

UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain Antonio González.

1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.

P ARALLEL P ROCESSING I NSTITUTE · F UDAN U NIVERSITY 1.

UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Overview of Compilers and JikesRVM John.

Practical Path Profiling for Dynamic Optimizers Michael Bond, UT Austin Kathryn McKinley, UT Austin.

Pointer Analysis Survey. Rupesh Nasre. Aug 24, 2007.

ECE 4100/6100 Advanced Computer Architecture Lecture 2 Instruction-Level Parallelism (ILP) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.

Full and Para Virtualization

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

D A C U C P Speculative Alias Analysis for Executable Code Manel Fernández and Roger Espasa Computer Architecture Department Universitat Politècnica de.

Sunpyo Hong, Hyesoon Kim

IA64 Complier Optimizations Alex Bobrek Jonathan Bradbury.

Memory-Aware Compilation Philip Sweany 10/20/2011.

1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

A Framework For Trusted Instruction Execution Via Basic Block Signature Verification Milena Milenković, Aleksandar Milenković, and Emil Jovanov Electrical.

PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,

Chapter 1 Introduction.

Chapter 1 Introduction.

Antonia Zhai, Christopher B. Colohan,

Address-Value Delta (AVD) Prediction

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

Link-Time Path-Sensitive Memory Redundancy Elimination Manel Fernández and Roger Espasa Computer Architecture Department Universitat Politècnica de Catalunya Barcelona, Spain

Motivation The memory “gap” Processor speed increases faster than memory speed  L1-cache latency continues to increase Memory operations remain a significant bottleneck Memory redundancy Instructions that repeatedly access the same location  Lots of memory operations are redundant Hardware designers exploit memory redundancy  E.g., caches take advantage of temporal reuse The compiler must be very aggressive in memory optimizations

Memory redundancy Memory instructions that repeatedly access the same location Lots of memory operations are redundant Sources of redundancy Source code structure  Programmers introduce redundancy Traditional compilation  Separate compilation units  Limitations in the compilation model  Code generation introduces redundancy What percentage of memory operations are redundant at run time? … = *p; if ( … ) { *q = … … = *p; } redundant load redundancy source intervening store

Dynamic memory redundancy Load redundancy Store redundancy

Eliminating memory redundancy Can the compiler reduce the redundancy that appears in binary programs? Binary optimizations New opportunities appear on executable code  Compiler/language independence  Whole program view  Object code oriented optimizations  Easy collection/use of profiling information Executable code has its own problems  Lack of semantic information  “Nasty” features Redundancy in binary programs can be eliminated by using binary optimizers

Talk outline Motivation Memory redundancy elimination (MRE) Evaluation Summary

Memory redundancy elimination (MRE) Removal of memory instructions that repeatedly access the same location Targeted at redundancy type  Load redundancy elimination (LRE) in a path-sensitive fashion –Based on path-sensitive memory disambiguation  Store redundancy elimination (SRE) Targeted at redundancy distance  Eliminating close/distant redundancy In the context of a binary optimizer Overcome limitations of traditional compilers Need to deal with “executable code” problems

Load redundancy elimination (LRE) Fundamental problems Alias analysis for disambiguation Liveness analysis for register bypassing Cost-benefit analysis for applying LRE  Profile information is needed Eliminating close redundancy Within extended basic blocks (EBBs) Eliminating distant redundancy Intraprocedural dataflow analysis [HorspoolHo97] For fully/partially-redundant loads  Redundancy on all/some paths  Partial-LRE requires insertion of speculative loads R. N. Horspool and H. C. Ho. Partial redundancy elimination driven by a cost-benefit analysis, CSSE’97 Hot Path move r0, r I 1 load (p0), r1 move r1, r0... I 2 load (p0), r2...

Memory disambiguation Register use-def chains Symbolic descriptors for every use Disambiguation by instruction inspection Fails on path-sensitive redundancies Need to deal with path-sensitive information Partial-LRE is not sufficient either... I 0 def p0... I 1 load (p0),r1... I 3 add p0,8,p0... I Ø Ø-def p0... I 2 load (p0),r2... √ ?

Path-sensitive memory disambiguation Established for only a subset of all the possible paths Subsumes generic disambiguation Path-sensitive LRE Partial-LRE is now adapted for dealing with path-sensitive redundancies Availability on edge (AVEDG ij ) Path-sensitive redundancy... I 0 def p0... I 1 load (p0),r1 move r1, r0... I 3 add p0,8,p0 load (p0),r0... I Ø Ø-def p0... move r0, r2 I 2 load (p0),r √ x

Store redundancy elimination (SRE)... I 1 store r1, (p0)... I 2 store r2, (p0) Similar approach than LRE SRE on EBBs Full- and Partial-SRE  New formulation of the analysis  No path-sensitive elimination! Elimination of dead stores Other optimizations produce a lot of dead stores Form of dead code elimination Based on heuristics  Includes a basic analysis for useless stack locations... I 1 load (p0), r0... I 2 store r0, (p0)

Talk outline Motivation Memory redundancy elimination (MRE) Evaluation Summary

Methodology Benchmark suite SPECint95  Compiled on an AlphaServer with full optimizations  Intrumented using Pixie to get profiling information  Aggressively re-optimized using Alto Experimental framework Alto executable optimizer Evaluation Dynamic number of loads/stores Actual execution time  AlphaServer GS-140, Alpha EV

Dynamic number of loads/stores

Execution time Relative execution time on an AlphaServer GS-140, Alpha EV MHz

Dynamic replay traps Relative number of replay traps on the sim-alpha simulator, modeling an Alpha EV

Talk outline Motivation Memory redundancy elimination (MRE) Evaluation Summary

A high percentage of memory operations are redundant Memory redundancy elimination (MRE) Removal of redundant memory operations  Load redundancy elimination (LRE) in a path-sensitive fashion –Based on path-sensitive memory disambiguation  Store redundancy elimination (SRE) –Including elimination of dead stores For executable code or link-time  Overcome limitations of traditional compilers Valuable results on real execution time Future directions Explore better alias analysis mechanism Additional techniques for MRE

Backup slides

Dynamic memory redundancy

Dynamic load redundancy

Dynamic store redundancy

Load redundancy elimination (LRE) I 1 loads a value from memory into r1 I 2 loads from the same location into r2 Location (p0) is not modified between I 1 and I 2 r1 can be safely bypassed to r2... I 1 load (p0), r1... I 2 load (p0), r2... move r1, r0 move r0, r I 2 can be removed!

LRE on executable code Is (p1) at I 1 the same memory location than (p2) at I 2 ? Is there any available register between I 1 and I 2 that can be used to bypass r1 to r2 ?... I 1 load (p1), r1... I 2 load (p2), r2... Alias analysis! Register liveness analysis! move r1, r0 move r0, r

LRE: Eliminating close redundancy For extended basic blocks (EBBs) Alias analysis: for disambiguation Register live analysis: for bypassing Profile-guided LRE There is not always a benefit in removing a redundant load Hot Path Need to evaluate cost-benefit of applying LRE! move r0, r I 1 load (p0), r1 move r1, r0... I 2 load (p0), r2...

LRE: Eliminating distant redundancy For eliminating fully- and partially- redundant loads Requires insertion of speculative loads Dataflow analysis [HorspoolHo97] Extended cost equation Complex search for available registers... I 2 load (p0),r1... I 1 store r1,(p0)... load (p0), r0 move r0,r move r1,r0 R. N. Horspool and H. C. Ho. Partial redundancy elimination driven by a cost-benefit analysis, CSSE’97

Load redundancy elimination (LRE) Fundamental problems Alias analysis for disambiguation Liveness analysis for register bypassing Cost-benefit analysis for applying LRE  Profile information is needed Eliminating close redundancy Within extended basic blocks (EBBs) Eliminating distant redundancy Intraprocedural dataflow analysis [HorspoolHo97] For fully/partially-redundant loads Partial-LRE requires insertion of speculative loads R. N. Horspool and H. C. Ho. Partial redundancy elimination driven by a cost-benefit analysis, CSSE’97 Hot Path move r0, r I 1 load (p0), r1 move r1, r0... I 2 load (p0), r2...

Path-sensitive LRE Path-sensitive redundancy Redundancy occurs only on some execution paths Partial-LRE is not sufficient Memory disambiguation Using register use-def chains Symbolic descriptors for every use Path-sensitive memory disambiguation is needed!... I 0 def p0... I 1 load (p0),r1... I 3 add p0,8,p0... I Ø Ø-def p0... I 2 load (p0),r2...

Path-sensitive information Disambiguation is established for only a subset of all the possible paths For detecting path-sensitive exact memory dependencies Partial-LRE Algorithm is now adapted for dealing with path-sensitive redundancies  Availability on edge (AVEDG ij ) Path-sensitive memory disambiguation... I 0 def p0... I 1 load (p0),r1 move r1, r0... I 3 add p0,8,p0 load (p0),r0... I Ø Ø-def p0... move r0, r2 I 2 load (p0),r √ x

A combined algorithm Short-distance MRE Basic  MRE within EBBs Long-distance MRE Full  Full-MRE Partial  Partial-MRE Complete  Path-sensitive LRE  Partial SRE  Dead store elimination Easy optimizations (including Basic-MRE) Function inlining Long-distance MRE (Full/Partial/Complete) Easy optimizations (including Basic-MRE)

Dynamic number of loads

Dynamic number of stores

Alpha results