Memory Optimizations & Post-Compilation Techniques CS 671 April 3, 2008.

Slides:



Advertisements
Similar presentations
Part IV: Memory Management
Advertisements

8. Static Single Assignment Form Marcus Denker. © Marcus Denker SSA Roadmap  Static Single Assignment Form (SSA)  Converting to SSA Form  Examples.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.
Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Peephole Optimization & Other Post-Compilation Techniques 1COMP 512, Rice University Copyright 2011, Keith D. Cooper, Linda Torczon, & Jason Eckhardt,
Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.
Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Understanding Operating Systems Fifth Edition
The Assembly Language Level
Components of representation Control dependencies: sequencing of operations –evaluation of if & then –side-effects of statements occur in right order Data.
Program Representations. Representing programs Goals.
The Last Lecture Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission.
August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.
Recap from last time We were trying to do Common Subexpression Elimination Compute expressions that are available at each program point.
Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.
Instruction Selection Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
4/23/09Prof. Hilfinger CS 164 Lecture 261 IL for Arrays & Local Optimizations Lecture 26 (Adapted from notes by R. Bodik and G. Necula)
Multiprocessing Memory Management
Memory Management 1 CS502 Spring 2006 Memory Management CS-502 Spring 2006.
CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
Intermediate Code. Local Optimizations
Improving Code Generation Honors Compilers April 16 th 2002.
Recap from last time: live variables x := 5 y := x + 2 x := x + 1 y := x y...
Introduction to Optimization Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Introduction to Code Generation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Prof. Fateman CS164 Lecture 211 Local Optimizations Lecture 21.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.
Code Generation CS 480. Can be complex To do a good job of teaching about code generation I could easily spend ten weeks But, don’t have ten weeks, so.
Instruction Selection Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
Chapter 91 Memory Management Chapter 9   Review of process from source to executable (linking, loading, addressing)   General discussion of memory.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
1 Code Generation Part II Chapter 9 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University, 2005.
Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
Instruction Selection and Scheduling. The Problem Writing a compiler is a lot of work Would like to reuse components whenever possible Would like to automate.
Compiler Principles Fall Compiler Principles Lecture 0: Local Optimizations Roman Manevich Ben-Gurion University.
Cleaning up the CFG Eliminating useless nodes & edges C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,
Introduction to Code Generation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.
Terminology, Principles, and Concerns, IV With examples from LIVE and global block positioning Copyright 2011, Keith D. Cooper & Linda Torczon, all rights.
Author: Haoyu Song, Murali Kodialam, Fang Hao and T.V. Lakshman Publisher/Conf. : IEEE International Conference on Network Protocols (ICNP), 2009 Speaker:
Register Allocation CS 471 November 12, CS 471 – Fall 2007 Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.
Cleaning up the CFG Eliminating useless nodes & edges This lecture describes the algorithm Clean, presented in Chapter 10 of EaC2e. The algorithm is due.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
2/22/2016© Hal Perkins & UW CSEP-1 CSE P 501 – Compilers Register Allocation Hal Perkins Winter 2008.
Profile-Guided Code Positioning See paper of the same name by Karl Pettis & Robert C. Hansen in PLDI 90, SIGPLAN Notices 25(6), pages 16–27 Copyright 2011,
Instruction Selection, Part I Selection via Peephole Optimization Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled.
Profile Guided Code Positioning C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.
CMSC 611: Advanced Computer Architecture
CS161 – Design and Architecture of Computer
Register Allocation Hal Perkins Autumn 2009
Chapter 9 – Real Memory Organization and Management
Optimization Code Optimization ©SoftMoore Consulting.
5.2 Eleven Advanced Optimizations of Cache Performance
Improving Program Efficiency by Packing Instructions Into Registers
Main Memory Management
Topic 10: Dataflow Analysis
Peephole Optimization & Other Post-Compilation Techniques COMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,
Optimizations using SSA
CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019
Presentation transcript:

Memory Optimizations & Post-Compilation Techniques CS 671 April 3, 2008

CS 671 – Spring The Problem The placement of program text in memory matters Large working sets  excessive TLB & page misses Bad placement increases instruction cache misses –Pettis & Hansen report that on some benchmarks, 1 of 3 cycles was a cache miss (for PA-RISC) Random placement leaves these effects to chance The plan Discover execution-time paths Rearrange the code to keep those paths in contiguous memory Make heavy use of execution profiles

CS 671 – Spring Does this work? Motivating examples within HP Pascal compiler Moved frequently executed blocks to top of procedure 40% reduction in instruction cache misses 5% improvement in running time Fortran compiler Rearranged object files before linking Attempt to improve locality on calls 20% throughput improvement

CS 671 – Spring Two Major Issues Procedure placement If A calls B, would like A & B in adjacent locations –On same page means smaller working set –adjacent locations limit I-cache conflicts Unfortunately, many procedures might call B (& A) This is an issue for the linker Block placement Same effects occur on a smaller scale Fall through branches create an additional incentive Rarely executed code fills up the cache, too! This is an issue for the compiler & optimizer

CS 671 – Spring Procedure Placement Simple principles Build the call graph Annotate edges with execution frequencies Use “closest is best” placement –A calls B most often  place A next to B –Keeps branches short (advantage on PA-RISC) –Direct mapped I-cache  A & B unlikely to overlap in I-cache Profiling the call graph Linker inserts a stub for each call that bumps a counter Counters are kept in statically initialized storage (set to zero) Adds overhead to execution, but only in training runs

CS 671 – Spring Procedure Placement Computing an order Combine all edges from A to B Select highest weight edge, say XY –Combine X & Y, along with their common edges, XZ & YZ –Place X next to Y Repeat until graph cannot be reduced further X Y Z May have disconnected subgraphs Must add new procedures at end > WX and YZ with WZ & XY > Use weights in original graph > Largest weight closest

CS 671 – Spring Block Placement Targets branches with unequal execution frequencies Make likely case the “fall through” case Move unlikely case out-of-line & out-of-sight Potential benefits Longer branch-free code sequences More executed operations per cache line Denser instruction stream  fewer cache misses Moving unlikely code  denser page use & fewer page faults

CS 671 – Spring Block Placement Moving infrequently executed code B1B1 B4B4 B2B2 B3B3 B1B1 B4B4 B2B2 B3B Unlikely path gets fall through (cheap) case Likely path gets an extra branch Would like this to become B1B1 B4B4 B3B3 B2B2 Long distance In another page,... This branch goes away Denser instruction stream

CS 671 – Spring Block Placement Principles Goal is to eliminate taken branches –Build up traces – single paths Work from profile data –Edges are better than blocks Use a greedy, bottom-up strategy to combine blocks Gathering profile data Insert code to count edges Split critical edges Use name mangling to separate data for different procedures

CS 671 – Spring Block Placement The Idea Form chains that should be placed as straight-line code The Algorithm 1.Make each block a degenerate chain & set its priority to # blocks 2.P  1 3.  edge e = in the CFG, in order by decreasing frequency if x is the tail of chain a and y is the head of chain b then merge a and b else set priority(y) to min(priority(y),P++) { Point is to place targets after their sources, to make forward branches

CS 671 – Spring Block Placement Now, to lay out the code WorkList  chain containing the entry node, n 0 While (WorkList ≠ Ø) Pick the chain c with lowest priority(c) from WorkList Place it next in the code  edge leaving c add z to WorkList Intuition Entry node first Tries to make edge from chain i to chain j a forward branch Predicted not-taken on target machine Edge remains only if it is lower probability choice

CS 671 – Spring Going Further – Procedure Splitting Any code that has zero profile is “fluff” Move fluff into the distance –It rarely executes –Get more useful operations into I cache –Increase effective density of I cache Slower execution for rarely executed code Implementation Create a linkage-less procedure with an invented name Give it a priority that the linker will sort to the code’s end Replace the branch with a call (a stub that does the call ) –Branch to call at the end of the procedure to maintain density

CS 671 – Spring Putting It Together Procedure placement is done in the linker Block placement is done in the optimizer –Allows branch elision due to fluff, other tailoring Speedups averaged from 2 to 26%, depending on cache size This idea became popular on early 1990s PCs –Long cache lines –Slow page faults –Microsoft insiders suggested it was most important optimization for codes like Office (Word, Excel) Why?

Peephole Optimization & Other Post-Compilation Techniques

CS 671 – Spring The Problem After compilation, the code still has some flaws Scheduling & allocation really are NP-Complete Optimizer may not implement every needed transformation Curing the problem More work on scheduling and allocation Implement more optimizations — or — Optimize after compilation –Peephole optimization –Link-time optimization

CS 671 – Spring Peephole Optimization The Basic Idea Discover local improvements by looking at a window on the code –A tiny window is good enough — a peephole Slide the peephole over the code and examine the contents –Pattern match with a limited set of patterns Examples storeAI r 1  r 0,8 loadAI r 0,8  r 15 storeAI r 1  r 0,8 cp r 1  r 15  addI r 2,0  r 7 mult r 4, r 7  r 10 mult r 4,r 2  r 10  jumpI  l 10 l 10: jumpI  l 11 jumpI  l 11 l 10: jumpI  l 11 

CS 671 – Spring Peephole Optimization Early Peephole Optimizers (McKeeman) Used limited set of hand-coded patterns Matched with exhaustive search Small window, small pattern set  quick execution They proved effective at cleaning up the rough edges –Code generation is inherently local –Boundaries between local regions are trouble spots Improvements in code gen, opt, & architecture should have let these fade into obscurity Much better allocation & scheduling today than in 1965 But, we have much more complex architectures Window of 2 to 5 ops

CS 671 – Spring Peephole Optimization Modern Peephole Optimizers (Davidson, Fraser) Larger, more complex ISAs  larger pattern sets This has produced a more systematic approach Expander Operation-by-operation expansion into LLIR Needs no context Captures full effect of an operation Expander ASM  LLIR Simplifier LLIR  LLIR Matcher LLIR  ASM ASMLLIR ASM LLIR

CS 671 – Spring Peephole Optimization Modern Peephole Optimizers (Davidson, Fraser) Larger, more complex ISAs  larger pattern sets This has produced a more systematic approach Simplifier Single pass over LLIR, moving the peephole Forward substitution, algebraic simplification, constant folding, & eliminating useless effects (must know what is dead ) Eliminate as many LLIR operations as possible Expander ASM  LLIR Simplifier LLIR  LLIR Matcher LLIR  ASM ASMLLIR ASM LLIR

CS 671 – Spring Peephole Optimization Modern Peephole Optimizers (Davidson, Fraser) Larger, more complex ISAs  larger pattern sets This has produced a more systematic approach Matcher Starts with reduced LLIR program Compares LLIR from peephole against pattern library Selects 1+ ASM patterns that “cover” the LLIR Expander ASM  LLIR Simplifier LLIR  LLIR Matcher LLIR  ASM ASMLLIR ASM LLIR

CS 671 – Spring Finding Dead Effects The simplifier must know what is useless (i.e., dead) Expander works in a context-independent fashion It can process the operations in any order –Use a backward walk and compute local LIVE information –Tag each operation with a list of useless values What about non-local effects? –Most useless effects are local — DEF & USE in same block –It can be conservative & assume LIVE until proven dead A SM mult r 5,r 9  r 12 add r 12,r 17  r 13 L LIR r 12  r 5 * r 9 cc  f(r 5 *r 9 ) r 13  r 12 + r 17 cc  f(r 12 +r 17 ) L LIR r 12  r 5 * r 9 cc  f(r 5 *r 9 ) r 13  r 12 + r 17 cc  f(r 12 +r 17 ) A SM madd r 5,r 9,r 17  r 13  This effect would prevent multiply-add from matching expandsimplifymatch

CS 671 – Spring Peephole Optimization Can use it to perform instruction selection Key issue in selection is effective pattern matching Using peephole system for instruction selection Have front-end generate LLIR directly Eliminates need for the Expander Keep Simplifier and Matcher –Add a simple register assigner, follow with real allocation This basic scheme is used in GCC Expander ASM  LLIR Simplifier LLIR  LLIR Matcher LLIR  ASM ASMLLIR ASM LLIR

CS 671 – Spring Peephole-Based Selection Optimizer LLIR  LLIR LLIR Simplifier LLIR  LLIR LLIR Matcher LLIR  ASM LLIR ASM Allocator ASM  ASM ASM Front End Source  LLIR Source Basic Structure of Compilers like GCC Uses RTL as its IR (very low level) Numerous optimization passes Quick translation into RTL limits what optimizer can do... Matcher generated from spec (hard-coded tree-pattern matcher)

CS 671 – Spring An Example Original Code w  x - 2 * y LLIR r 10  2 r 11 y r 12  r 0 + r 11 r 13  M(r 12 ) r 14  r 10 x r 13 r 15 x r 16  r 0 + r 15 r 17  M(r 16 ) r 18  r 17 - r 14 r 19 w r 20  r 0 + r 19 M(r 20 )  r 18  Translation Compiler’s IR  Expander — or —

CS 671 – Spring r 10  2 r 11 y r 12  r 0 + r 11 r 13  M (r 12 ) r 14  r 10 x r 13 r 15 x r 16  r 0 + r 15 r 17  M (r 16 ) r 18  r 17 - r 14 r 19 w r 20  r 0 + r 19 M (r 20 )  r 18 Simplification - 3 Operation Window r 10  2 r 11 y r 12  r 0 + r 11 r 10  2 r 12  r 0 y r 13  M (r 12 ) r 10  2 r 13  M (r 0 y) r 14  r 10 x r 13 r 13  M (r 0 y) r 14  2 x r 13 r 15 x No further improveme nt is found r 14  2 x r 13 r 17  M (r 0 x) r 18  r 17 - r 14 r 14  2 x r 13 r 16  r 0 x r 17  M (r 16 ) r 14  2 x r 13 r 15 x r 16  r 0 + r 15 r 17  M (r 0 x) r 18  r 17 - r 14 r 19 w r 18  r 17 - r 14 r 19 w r 20  r 0 + r 19 r 18  r 17 - r 14 r 20  r 0 w M (r 20 )  r 18 r 18  r 17 - r 14 M (r 0 w)  r 18 Original Code

CS 671 – Spring Example, Continued r 13  M (r 0 y) r 14  2 x r 13 r 17  M (r 0 x) r 18  r 17 - r 14 M (r 0 w)  r 18 r 10 y r 11  r 0 + r 10 r 12  M (r 11 ) r 13  2 r 14  r 12 x r 13 r 15 x r 16  r 0 + r 15 r 17  M (r 16 ) r 18  r 17 - r 14 r 19 w r 20  r 0 + r 19 M (r 20 )  r 18 Simplification shrinks the code significantly  Simplify Takes 5 operations instead of 12 Uses 4 registers instead of 11. loadAI r y  r 13 multI r 13,2  r 14 loadAI r x  r 17 sub r 17,r 14  r 18 storeAI r 18  r w  Match and, we’re done...

CS 671 – Spring Other Considerations Control-flow operations Can clear simplifier’s window at branch or label More aggressive approach: combine across branches –Must account for effects on all paths –Not clear that this pays off …. Same considerations arise with predication Physical versus logical windows Can run optimizer over a logical window –k operations connected by DEF-USE chains Expander can link DEFs &USEs Logical windows (within block) improve effectiveness Davidson & Fraser report 30% faster & 20% fewer ops with local logical window.

CS 671 – Spring Peephole Optimization So, … Peephole optimization remains viable –Post allocation improvements –Cleans up rough edges Peephole technology works for selection –Description driven matchers –Used in several important systems Simplification pays off late in process –Low-level substitution, identities, folding, & dead effects

CS 671 – Spring Other Post-Compilation Techniques What else makes sense to do after compilation? Profile-guided code positioning –Allocation intact, schedule intact Cross-jumping –Allocation intact, schedule changed Hoisting –Changes allocation & schedule, needs data-flow analysis Procedure abstraction –Changes allocation & schedule, really needs an allocator Register scavenging –Changes allocation & schedule, purely local transformation Bit-transition reduction –Schedule & allocation intact, assignment changed

CS 671 – Spring Register Scavenging Simple idea Global allocation does a good job on the big picture items Leaves behind blocks where some registers are unused Let’s scavenge those unused registers Compute LIVE information Walk each block to find underallocated region –Find spilled local subranges –Opportunistically promote them to registers A note of realism: Opportunities exist, but this is a 1% to 2% improvement T.J. Harvey, Reducing the Impact of Spill Code, MS Thesis, Rice University, May 1998

CS 671 – Spring Bit-transition Reduction Inter-operation bit-transitions relate to power consumption Large fraction of CMOS power is spent switching states Same op on same functional unit costs less power –All other things being equal Simple idea Reassign registers to minimize interoperation bit transitions Build some sort of weighted graph Use a greedy algorithm to pick names by distance Should reduce power consumption in fetch & decode hardware Toburen’s MS thesis

CS 671 – Spring Bit-transition Reduction Other transformations Swap operands on commutative operators –More complex than it sounds –Shoot for zero-transition pairs Swap operations within “fetch packets” –Works for superscalar, not VLIW Consider bit transitions in scheduling –Same ops to same functional unit –Nearby (Hamming distance) ops next, and so on… Factor bit transitions into instruction selection –Maybe use a BURS model with dynamic costs Again, most of this fits into a post-compilation framework…..

CS 671 – Spring Summary Memory hierarchy is often the bottleneck Memory optimizations are very important We’ve only scratched the surface Many optimizations can be applied “post- compile time” Procedure placement Peephole optimizations