Load-Reuse Analysis: Design and Evaluation Rastislav Bodik, Rajiv Gupta, Mary Lou Soffa PLDI'99 Presented by Sue Ann Hong 4/11/2006.

Slides:

Advertisements

Similar presentations

8. Static Single Assignment Form Marcus Denker. © Marcus Denker SSA Roadmap  Static Single Assignment Form (SSA)  Converting to SSA Form  Examples.

Advertisements

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

1 Optimization Optimization = transformation that improves the performance of the target code Optimization must not change the output must not cause errors.

Lecture 11: Code Optimization CS 540 George Mason University.

Context-Sensitive Interprocedural Points-to Analysis in the Presence of Function Pointers Presentation by Patrick Kaleem Justin.

D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.

1 Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

Partial Redundancy Elimination & Lazy Code Motion

ABCD: Eliminating Array-Bounds Checks on Demand Rastislav Bodík Rajiv Gupta Vivek Sarkar U of Wisconsin U of Arizona IBM TJ Watson recent experiments.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.

1 Liveness analysis and Register Allocation Cheng-Chia Chen.

Improving Code Generation Honors Compilers April 16 th 2002.

Introduction to Optimization Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Load-Reuse Analysis design and evaluation Rastislav Bodík Rajiv Gupta Mary Lou Soffa.

Data Flow Analysis Compiler Design Nov. 8, 2005.

Eliminating Memory References Joshua Dunfield Alina Oprea.

Science Vocabulary Bingo. Descriptive research Research based on observations.

Chocolate Bar! luqili. Milestone 3 Speed 11% of final mark 7%: path quality and speed –Some cleverness required for full marks –Implement some A* techniques.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Identifying Reversible Functions From an ROBDD Adam MacDonald.

CMPE 421 Parallel Computer Architecture

Stephen P. Carl - CS 2421 Recursion Reading : Chapter 4.

CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.

Operating systems, lecture 4 Team Viewer Tom Mikael Larsen, Thursdays in D A look at assignment 1 Brief rehearsal from lecture 3 More about.

Code Optimization 1 Course Overview PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture of a.

Threshold Phenomena and Fountain Codes Amin Shokrollahi EPFL Joint work with M. Luby, R. Karp, O. Etesami.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

1 June 4, June 4, 2016June 4, 2016June 4, 2016 Azusa, CA Sheldon X. Liang Ph. D. Azusa Pacific University, Azusa, CA 91702, Tel: (800)

Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:

1 CS 201 Compiler Construction Introduction. 2 Instructor Information Rajiv Gupta Office: WCH Room Tel: (951) Office.

ABCD: Eliminating Array-Bounds Checks on Demand Rastislav Bodík Rajiv Gupta Vivek Sarkar U of Wisconsin U of Arizona IBM TJ Watson recent experiments.

CS 232: Computer Architecture II Prof. Laxmikant (Sanjay) Kale.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

Cleaning up the CFG Eliminating useless nodes & edges C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,

CS 232: Computer Architecture II Prof. Laxmikant (Sanjay) Kale.

C++ crash course Class 9 flight times program, using gdb.

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

Ricochet Robots Mitch Powell Daniel Tilgner. Abstract Ricochet robots is a board game created in Germany in A player is given 30 seconds to find.

Listen and learn!. * “READ THE BOOKS. I don't understand why some kids think they can take a test on a book they have never read. That is actually crazy,

WHAT IS THE APPROPRIATE MATHEMATICS THAT COLLEGES STUDENTS SHOULD KNOW AMATYC Conference November 20, 2015 Phil Mahler & Rob Farinelli.

Can small quantum systems learn? NATHAN WIEBE & CHRISTOPHER GRANADE, DEC

JavaScript Introduction and Background. 2 Web languages Three formal languages HTML JavaScript CSS Three different tasks Document description Client-side.

Parallel Programming in Chess Simulations Part 2 Tyler Patton.

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Clustering Data Streams A presentation by George Toderici.

©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.

8 The Mathematics of Scheduling

Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

Introduction to Optimization

High-level optimization Jakub Yaghob

New Characterizations in Turnstile Streams with Applications

Lesson Objectives A note about notes: Aims

Optimization Code Optimization ©SoftMoore Consulting.

Threads and Memory Models Hal Perkins Autumn 2011

CSCI1600: Embedded and Real Time Software

CSCI206 - Computer Organization & Programming

Introduction to Optimization

Intractable Problems Time-Bounded Turing Machines Classes P and NP

Unit IV Code Generation

CS 201 Compiler Construction

Threads and Memory Models Hal Perkins Autumn 2009

Control unit extension for data hazards

Introduction to Optimization

Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.

Registers and Counters

Aliasing and Anti-Aliasing in Branch History Table Prediction

CSCI1600: Embedded and Real Time Software

Registers and Counters

Presentation transcript:

Load-Reuse Analysis: Design and Evaluation Rastislav Bodik, Rajiv Gupta, Mary Lou Soffa PLDI'99 Presented by Sue Ann Hong 4/11/2006

This paper: find as many reuses as possible Load-reuse Example Register Promotion 1.Load-reuse analysis Identify loads & stores to the same addr on a path a1 = a4 on path p1? a2 = a4 on path p2? 2.Alias analysis Make sure load value isn’t changed a0 = a4? 3.Program transformation e.g. partial redundancy elimination hoist ‘load a4’ to path p3 load a1store a2 store a0 load a4 path p1path p2path p3

Related Work Lexical load-reuse analysis Only loads with identical names Value numbering x = 5; t1 = x; t2 = x; Only copy assignments for (i=0; i < N-2; i++) { A[i+2] = A[i] } Remember the hash tables…

Paper Does This Its load-reuse algorithm The ideal run-time reuse finder “Profile-based Estimator” Compare: How many reuses they find, on SPEC95, of course… “ground truth”

Evaluating the Algorithm Comparing to Ideal Reuse Analysis Ideal Reuse Analysis (dynamic = run-time) –Generally undecidable  use simulation: (Simple) remember access history for each memory inst and find prior load or store –Want tight upper bound Ignore possible (input-dependent, sporadic) reuses as noise while ( c = read() ) { … = hashtbl[ hash(c) ]; } –Still, how input-independent is the simulation? Identified reuse level (SPEC95) –See p67. Tall bars… Something like 55% of overall loads are reuses. old history = expensive, tends to be a little bit of ≤ 18% So reuse-analysis is probably worth it. Note: they do show empirically that # of accesses in history > 1 doesn’t matter too much.

Load Reuse Analysis A must-alias analysis Value Name Graph (Data-flow analysis) An addr value flows between two addr exprs if they access the same addr (they’re equivalencies). 3 steps for 3 goods 1.Symbolic interpretation Find equivalences after algebraic simplification; Create synthetic names 2.Symbolic value numbering Use the synthetic names, and backward flow from temps, find equivalences due to assignment to temps 3.Data-flow analysis Connect the equivalences from prev steps along specific paths Remember the hash tables… store(2x+12);  ‘2x+12’ y = 2x + 8; - z = load (y+4);  ‘2x+12’

Profile-based Estimators Intuition –Reuse-analysis  which path contains what reuses f(p i ) є Z –Ideal analysis  how many reuses overall? n –n = Σ i [f(p i ) * how many times path is used] Crazy 5 different estimators  lower and upper bounds to compensate for edge profiling errors Estimator; use profiling

Experiments Figure 8 on p75. How do you interpret that thing?? How possible aliasing could make reuses useless. Ideal found ~55% of loads have reuse Their analysis found ~80% of those. Other than that, the paper doesn’t really have conclusions. What happened after this paper (1999)? blah Ask the next dude.

Discussions from class Bodik’s notion of defining and comparing to ideal performance is different from the usual approach of giving overall optimization performance. In fact, he’s famous for not giving numbers for run time optimization. Is this orthogonal to cache optimization? Yes. The paper doesn’t address cache/locality-related issues. I probably shouldn’t have laughed at the author for saying “Such an amount of registers [>34] will be soon available in general-purpose processors.” Peter’s PowerBook was able to display my presentation in contrast to my Sony.