The SGI Pro64 Compiler Infrastructure

Slides:



Advertisements
Similar presentations
CMPUT Compiler Design and Optimization1 CMPUT680 - Fall 2003 Topic B: Open Research Compiler José Nelson Amaral
Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
CSCI 4717/5717 Computer Architecture
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
The University of Adelaide, School of Computer Science
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston March, 2007 High Performance Computing and Tools Group
Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.
Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Introduction to Advanced Topics Chapter 1 Mooly Sagiv Schrierber
TM Pro64™: Performance Compilers For IA-64™ Jim Dehnert Principal Engineer 5 June 2000.
Cpeg421-08S/final-review1 Course Review Tom St. John.
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic 8: Minimum Register Instruction Sequence Problem José Nelson Amaral
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Improving Code Generation Honors Compilers April 16 th 2002.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
CS 404 Introduction to Compiler Design
Code Optimization Overview and Examples
Global Register Allocation Based on
Code Optimization.
Introduction to Advanced Topics Chapter 1 Text Book: Advanced compiler Design implementation By Steven S Muchnick (Elsevier)
Multiscalar Processors
Optimizing Compilers Background
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 14 Instruction Level Parallelism and Superscalar Processors
The HP OpenVMS Itanium® Calling Standard
For Example: User level quicksort program Three address code.
Instruction Scheduling for Instruction-Level Parallelism
Instructions - Type and Format
Intermediate Representations
Optimizing Transformations Hal Perkins Autumn 2011
A Practical Stride Prefetching Implementation in Global Optimizer
Register Pressure Guided Unroll-and-Jam
Loop Scheduling and Software Pipelining
Intermediate Representations
Optimizing Transformations Hal Perkins Winter 2008
Topic 5a Partial Redundancy Elimination and SSA Form
Computer Organization and Design Assembly & Compilation
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Chapter 12 Pipelining and RISC
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Dynamic Hardware Prediction
Intermediate Code Generation
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Lecture 17: Register Allocation via Graph Colouring
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
The SGI Pro64 Compiler Infrastructure
CMSC 611: Advanced Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Loop-Level Parallelism
Lecture 4: Instruction Set Design/Pipelining
rePLay: A Hardware Framework for Dynamic Optimization
CSC D70: Compiler Optimization Prefetching
Procedure Linkages Standard procedure linkage Procedure has
Presentation transcript:

The SGI Pro64 Compiler Infrastructure - A Tutorial Guang R. Gao (U of Delaware) J. Dehnert (SGI) J. N. Amaral (U of Alberta) R. Towle (SGI)

Acknowledgement The SGI Compiler Development Teams The MIPSpro/Pro64 Development Team University of Delaware CAPSL Compiler Team These individuals contributed directly to this tutorial A. Douillet (Udel) F. Chow (Equator) S. Chan (Intel) W. Ho (Routefree) Z. Hu (Udel) K. Lesniak (SGI) S. Liu (HP) R. Lo (Routefree) S. Mantripragada (SGI) C. Murthy (SGI) M. Murphy (SGI) G. Pirocanac (SGI) D. Stephenson (SGI) D. Whitney (SGI) H. Yang (Udel)

What is Pro64? A suite of optimizing compiler tools for Linux/ Intel IA-64 systems C, C++ and Fortran90/95 compilers Conforming to the IA-64 Linux ABI and API standards Open to all researchers/developers in the community Compatible with HP Native User Environment

Who Might Want to Use Pro64? Researchers : test new compiler analysis and optimization algorithms Developers : retarget to another architecture/system Educators : a compiler teaching platform

Outline Background and Motivation Part I: An overview of the SGI Pro64 compiler infrastructure Part II: The Pro64 code generator design Part III: Using Pro64 in compiler research & development SGI Pro64 support Summary

PART I: Overview of the Pro64 Compiler

Outline Logical compilation model and component flow WHIRL Intermediate Representation Inter-Procedural Analysis (IPA) Loop Nest Optimizer (LNO) and Parallelization Global optimization (WOPT) Feedback Design for debugability and testability

Logical Compilation Model driver (sgicc/sgif90/sgiCC) front end + IPA (gfec/gfecc/mfef90) back end (be, as) linker (ld) Src (.c/.C/.f) WHIRL (.B/.I) obj (.o) a.out/.so Data Path Fork and Exec

Components of Pro64 Code Generation Front end Interprocedural Analysis and Optimization Loop Nest Optimization and Parallelization Global Optimization Code Generation

Data Flow Relationship Between Modules -IPA Local IPA Main IPA LNO Lower to High W. .B Inliner gfec .I lower I/O gfecc (only for f90) WHIRL C .w2c.c f90 .w2c.h WHIRL fortran .w2f.f Take either path -O0 Lower all CG Very high WHIRL -phase: w=off High WHIRL Lower Mid W Main opt Mid WHIRL -O2/O3 Low WHIRL

Front Ends C front end based on gcc C++ front end based on g++ Fortran90/95 front end from MIPSpro

Intermediate Representation IR is called WHIRL Tree structured, with references to symbol table Maps used for local or sparse annotation Common interface between components Multiple languages, multiple targets Same IR, 5 levels of representation Continuous lowering during compilation Optimization strategy tied to level

IPA Main Stage Analysis Optimization (fully integrated) alias analysis array section code layout Optimization (fully integrated) inlining cloning dead function and variable elimination constant propagation

IPA Design Features User transparent No makefile changes Handles DSOs, unanalyzed objects Provide info (e.g. alias analysis, procedure properties) smoothly to: loop nest optimizer main optimizer code generator

Loop Nest Optimizer/Parallelizer All languages (including OpenMP) Loop level dependence analysis Uniprocessor loop level transformations Automatic parallelization

Loop Level Transformations Based on unified cost model Heuristics integrated with software pipelining Loop vector dependency info passed to CG Loop Fission Loop Fusion Loop Unroll and Jam Loop Interchange Loop Peeling Loop Tiling Vector Data Prefetching

Parallelization Automatic Directive based Array privatization Doacross parallelization Array section analysis Directive based OpenMP Integrated with automatic methods

Global Optimization Phase SSA is unifying technology Use only SSA as program representation All traditional global optimizations implemented Every optimization preserves SSA form Can reapply each optimization as needed

Pro64 Extensions to SSA Representing aliases and indirect memory operations (Chow et al, CC 96) Integrated partial redundancy elimination (Chow et al, PLDI 97; Kennedy et al, CC 98, TOPLAS 99) Support for speculative code motion Register promotion via load and store placement (Lo et al, PLDI 98)

Feedback Used throughout the compiler Instrumentation can be added at any stage Explicit instrumentation data incorporated where inserted Instrumentation data maintained and checked for consistency through program transformations.

Design for Debugability (DFD) and Testability (DFT) DFD and DFT built-in from start Can build with extra validity checks Simple option specification used to: Substitute components known to be good Enable/disable full components or specific optimizations Invoke alternative heuristics Trace individual phases

Where to Obtain Pro64 Compiler and its Support SGI Source download http://oss.sgi.com/projects/Pro64/ University of Delaware Pro64 Support Group http://www.capsl.udel.edu/~pro64 pro64@capsl.udel.edu

Overview of The Pro64 Code Generator PART II Overview of The Pro64 Code Generator

Outline Code generator flow diagram WHIRL/CGIR and TARG-INFO Hyperblock formation and predication (HBF) Predicate Query System (PQS) Loop preparation (CGPREP) and software pipelining Global and local instruction scheduling (IGLS) Global and local register allocation (GRA, LRA)

Flowchart of Code Generator WHIRL Control Flow Opt II EBO WHIRL-to-TOP Lowering EBO: Extended basic block optimization peephole, etc. CGIR: Quad Op List IGLS: pre-pass GRA, LRA, EBO IGLS: post-pass Control Flow Opt Control Flow Opt I EBO Hyperblock Formation Critical-Path Reduction PQS: Predicate Query System Code Emission Process Inner Loops: unrolling, EBO Loop prep, software pipelining

From WHIRL to CGIR An Example ST aa T1 = sp + &a; T2 = ld T1 T3 = sp + &i; T4 = ld T3 T5 = sxt T4 T6 = T5 << 2 T7 = T6 T8 = T2 + T7 T9 = ld T8 T10 = sp + &aa := st T10 T9 int *a; int i; int aa; aa = a[i]; LD + a * CVTL32 4 i (a) Source (b) WHIRL (c) CGIR

Code Generation Intermediate Representation (CGIR) TOPs (Target Operations) are “quads” Operands/results are TNs Basic block nodes in control flow graph Load/store architecture Supports predication Flags on TOPs (copy ops, integer add, load, etc.) Flags on operands (TNs)

From WHIRL to CGIR Information passed alias information Cont’d Information passed alias information loop information symbol table and maps

The Target Information Table (TARG_INFO) Objective: Parameterized description of a target machine and system architecture Separates architecture details from the compiler’s algorithms Minimizes compiler changes when targeting a new architecture

The Target Information Table (TARG_INFO) Cont’d Based on an extension of Cydra tables, with major improvements Architecture models have already targeted: Whole MIPS family IA-64 IA-32 SGI graphics processors (earlier version)

Flowchart of Code Generator WHIRL Control Flow Opt II EBO WHIRL-to-TOP Lowering EBO: Extended basic block optimization peephole, etc. CGIR: Quad Op List IGLS: pre-pass GRA, LRA, EBO IGLS: post-pass Control Flow Opt Control Flow Opt I EBO Hyperblock Formation Critical-Path Reduction PQS: Predicate Query System Code Emission Process Inner Loops: unrolling, EBO Loop prep, software pipelining

Hyperblock Formation and Predicated Execution Hyperblock single-entry multiple-exit control-flow region: loop body, hammock region, etc. Hyperblock formation algorithm Based on Scott Mahlke’s method [Mahlke96] But, less aggressive tail duplication

Hyperblock Formation Algorithm Hammock regions Innermost loops General regions (path based) Paths sorted by priorities (freq., size, length, etc.) Inclusion of a path is guided by its impact on resources, scheduling height, and priority level Internal branches are removed via predication Predicate reuse Region Identification Block Selection Tail Duplication If Conversion Objective: Keep the scheduling height close to that of the highest priority path.

Hyperblock Formation - An Example 1 1 aa = a[i]; bb = b[i]; switch (aa) { case 1: if (aa < tabsiz) aa = tab[aa]; case 2: if (bb < tabsiz) bb = tab[bb]; default: ans = aa + bb; 4 2 4 2 1 4,5 5 5 2 6’ 6 6 6,7 7’ 8 7 7 8 8 8’ H1 H2 (a) Source (c) Hyperblock formation with aggressive tail duplication (b) CFG

Hyperblock Formation - An Example Cont’d 1 1 1 4 2 4 2 4 2 H1 5 5 6 6’ 5 6 6 7’ 7 7 7 8 8’ H2 8 H1 H2 8 (b) Hyperblock formation with aggressive tail duplication (c) Pro64 hyperblock formation (a) CFG

Features of the Pro64 Hyperblock Formation (HBF) Algorithm Form “good” vs. “maximal” hyperblocks Avoid unnecessary duplication No reverse if-conversion Hyperblocks are not a barrier to global code motion later in IGLS

Predicate Query System (PQS) Purpose: gather information and provide interfaces allowing other phases to make queries regarding the relationships among predicate values PQS functions (examples) BOOL PQSCG_is_disjoint (PQS_TN tn1, PQS_TN tn2) BOOL PQSCG_is_subset (PQS_TN_SET& tns1, PQS_TN_SET& tns2)

Flowchart of Code Generator WHIRL Control Flow Opt II EBO WHIRL-to-TOP Lowering EBO: Extended basic block optimization peephole, etc. CGIR: Quad Op List IGLS: pre-pass GRA, LRA, EBO IGLS: post-pass Control Flow Opt Control Flow Opt I EBO Hyperblock Formation Critical-Path Reduction PQS: Predicate Query System Code Emission Process Inner Loops: unrolling, EBO Loop prep, software pipelining

Loop Preparation and Optimization for Software Pipelining Loop canonicalization for SWP Read/Write removal (register aware) Loop unrolling (resource aware) Recurrence removal or extension Prefetch Forced if-conversion

Pro64 Software Pipelining Method Overview Test for SWP-amenable loops Extensive loop preparation and optimization before application [DeTo93] Use lifetime sensitive SWP algorithm [Huff93] Register allocation after scheduling based on Cydra 5 [RLTS92, DeTo93] Handle both while and do loops Smooth switching to normal scheduling if not successful.

Pro64 Lifetime-Sensitive Modulo Scheduling for Software Pipelining Features Try to place an op ASAP or ALAP to minimize register pressure Slack scheduling Limited backtracking Operation-driven scheduling framework Compute Estart/Lstart for all unplaced ops Choose a good op to place into the current partial schedule within its Estart/Lstart range yes Register allocate Succeed no done Eject conflicting Ops

Flowchart of Code Generator WHIRL Control Flow Opt II EBO WHIRL-to-TOP Lowering EBO: Extended basic block optimization peephole, etc. CGIR: Quad Op List IGLS: pre-pass GRA, LRA, EBO IGLS: post-pass Control Flow Opt Control Flow Opt I EBO Hyperblock Formation Critical-Path Reduction PQS: Predicate Query System Code Emission Process Inner Loops: unrolling, EBO Loop prep, software pipelining

Integrated Global Local Scheduling (IGLS) Method The basic IGLS framework integrates global code motion (GCM) with local scheduling [MaJD98] IGLS extended to hyperblock scheduling Performs profitable code motion between hyperblock regions and normal regions

IGLS Phase Flow Diagram Hyperblock Scheduling (HBS) Block Priority Selection Motion Selection Target Selection Global Code Motion (GCM) Local Code Scheduling (LCS)

Advantages of the Extended IGLS Method - The Example Revisited 1 Advantages: No rigid boundaries between hyperblocks and non-hyperblocks GCM moves code into and out of a hyperblock according to profitability 1 4 2 H1 4 2 H1 5 5 6 6 7 7 8 8’ H2 H2 8 H3 (a) Pro64 hyperblock (b) Profitable duplication

Software Pipelining vs Normal Scheduling a SWP-amenable loop candidate ? No Yes Inner loop processing software pipelining IGLS GRA/LRA Failure/not profitable IGLS Code Emission Success

Flowchart of Code Generator WHIRL Control Flow Opt II EBO WHIRL-to-TOP Lowering EBO: Extended basic block optimization peephole, etc. CGIR: Quad Op List IGLS: pre-pass GRA, LRA, EBO IGLS: post-pass Control Flow Opt Control Flow Opt I EBO Hyperblock Formation Critical-Path Reduction PQS: Predicate Query System Code Emission Process Inner Loops: unrolling, EBO Loop prep, software pipelining

Global and Local Register Allocation (GRA/LRA) From prepass IGLS LRA-RQ provides an estimate of local register requirements Allocates global variables using a priority-based register allocator [ChowHennessy90,Chow83, Briggs92] Incorporates IA-64 specific extensions, e.g. register stack usage GRA LRA Register Request LRA-RQ Priority Based Register Allocation with IA-64 Extensions LRA To postpass IGLS

Local Register Allocation (LRA) Assign_registers using reverse linear scan Reordering: depth-first ordering on the DDG Assign_Registers failed succeed Fix_LRA first time Instruction reordering Spill global spill local

Future Research Topics for Pro64 Code Generator Hyperblock formation Predicate query system Enhanced speculation support

PART III: Using Pro64 in Compiler Research and Development Case Studies

Outline General Remarks Case Study I: Integration of new instruction reordering algorithm to minimize register pressure [Govind,Yang,Amaral,Gao2000] Case Study II: Design and evaluation of an induction pointer prefetching algorithm [Stouchinin,Douillet,Amaral,Dehnert,Gao2000]

Case I Introduction of the Minimum Register Instruction Sequence (MRIS) problem and a proposed solution Problem formulation The proposed algorithm Pro64 porting experience Where to start How to start Results Summary

Researchers R. Govindarajan (Indian Inst. Of Science) Hongbo Yang (Univ. of Delaware) Chihong Zhang (Conexant) José Nelson Amaral (Univ. of Alberta) Guang R. Gao (Univ. of Delaware)

The Minimum Register Instruction Sequence Problem Given a data dependence graph G, derive an instruction sequence S for G that is optimal in the sense that its register requirement is minimum.

A Motivating Example a: s1 = ld [x]; b: s2 = s1 + 4; c: s3 = s1 * 8; d: s4 = s1 - 4; e: s5 = s1 / 2; f: s6 = s2 * s3; g : s7 = s4 - s5; h: s8 = s6 * s7; s1 s2 s3 s4 s5 s6 s7 a: s1 = ld (x); d: s4 = s1 - 4; e: s5 = s1 / 2; g : s7 = s4 - s5; c: s3 = s1 * 8; b: s2 = s1 + 4; f: s6 = s2 * s3; h: s8 = s6 * s7; a b c d e h f g (a) DDG (b) Instruction Sequence 1 (c) Instruction Sequence 2 Observation: Register requirements drop 25% from (b) to (c) !

Motivation IA-64 style processors Out-of-order issue processor Reduce spills in local register allocation phase Reduce Local Register Allocation (LRA) requests in Global Register Allocation (GRA) phase Reduce overall register pressure on a per procedure basis Out-of-order issue processor Instruction reordering buffer Register renaming

How to Solve the MRIS Problem? a b c d e h f g Register lineages Live range of lineages Lineage interference L1 = (a, b, f, h); L2 = (c, f); L3 = (e, g, h); L4 = (d, g); (a) Concepts (b) DDG (c) Lineages

How to Solve the MRIS Problem? a Register lineages Live range of lineages Lineage interference L1 = (a, b, f, h); L2 = (c, f); L3 = (e, g, h); L4 = (d, g); b c d e h f g (a) Concepts (b) DDG (c) Lineages Questions: Can L1 and L2 share the same register?

How to Solve the MRIS Problem? a Register lineages Live range of lineages Lineage interference L1 = (a, b, f, h); L2 = (c, f); L3 = (e, g, h); L4 = (d, g); b c d e h f g (a) Concepts (b) DDG (c) Lineages Questions: Can L1 and L2 share the same register? Can L2 and L3 share the same register?

How to Solve the MRIS Problem? a Register lineages Live range of lineages Lineage interference L1 = (a, b, f, h); L2 = (c, f); L3 = (e, g, h); L4 = (d, g); b c d e h f g (a) Concepts (b) DDG (c) Lineages Questions: Can L1 and L2 share the same register? Can L2 and L3 share the same register? Can L1 and L4 share the same register? Can L2 and L4 share the same register?

Lineage Interference Graph L1 = (a, b, f, h); L2 = (c, f); L3 = (e, g, h); L4 = (d, g); a b c d e L1 L4 g f L2 L3 h (a) Original DDG (b) Lineage Interference Graph (LIG) Question: Is the lower bound of the required registers = 3? Challenge: Derive a “Heuristic Register Bound” (HRB)!

Extended list-scheduling Our Solution Method DDG A “good” construction algorithm for LIG An effective heuristic method to calculate the HRB An efficient scheduling method (do not backtrack) Form Lineage Interference Graph (LIG) Derive HRB Extended list-scheduling guided by HRB A good instruction sequence

Pro64 Porting Experience Porting plan and design Implementation Debugging and validation Evaluation

Implementation Dependence graph construction LIG formation LIG construction and coloring The reordering algorithm implementation

Porting Plan and Design ../common/targ_info/abi/ia64 Porting Plan and Design Understand the compiler infrastructure Understand the register model (mainly from targ_info) e.g.: register classes: (int, float, predicate, app, control) register save/restore conventions: caller/callee save, return value, argument passing, stack pointer, etc.

Register Allocation GRA Assign_Registers Fix_LRA_Blues LRA: At block level Assign_Registers Fix_LRA_Blues Succ? Fail? reschedule local code motion spill global or local registers

Implementation DDG construction: use native service routines: e.g. CG_DEP_Compute_Graph LIG coloring: using native support for set package (e.g. bitset.c) Scheduler implementation: vector package native support (e.g. cg_vector.cxx) Access dependence graph using native service functions ARC_succs, ARC_preds, ARC_kind

Debugging and Validation Trace file tt54:0x1. General trace of LRA tt45: 0x4. Dependence graph building tr53. Target Operations (TOP) before LRA tr54. TOP after LRA

Evaluation Static measurement Dynamic measurement Fat point -tt54: 0x40 Dynamic measurement Hardware counter in R12k and perfex

Evaluation For the MIPS R12K (SPEC95fp), the lineage-based algorithm reduce the number of loads executed by 12%, the number of stores by 14%, and the execution time by 2.5% over a baseline. It is slightly better than the algorithm in the MIPSPro compiler.

Case II Design and Evaluation of an Induction Pointer Prefetching Algorithm

Researchers Artour Stoutchinin (STMicroelectronics) José Nelson Amaral (Univ. of Alberta) Guang R. Gao (Univ. of Delaware) Jim Dehnert (Silicon Graphics Inc.) Suneel Jain (Narus Inc.) Alban Douillet (Univ. of Delaware)

Motivation The important loops of many programs are pointer-chasing loops that access recursive data structures through induction pointers. Example: max = 0; current = head; while(current != NULL) { if(current->key > max) max = current->key; current = current->next; }

Problem Statement How to identify pointer-chasing recurrences? How to decide whether there are enough processor resources and memory bandwidth to profitably prefetch an induction pointer? How to efficiently integrate induction pointer prefetching with loop scheduling based on the profitability analysis?

Prefetching Costs More instructions to issue More memory traffic Longer code (disruption in instruction cache) Displacement of potentially good data from cache Before prefetching: t226 = lw 0x34(t228) After prefetching: t226 = lw 0x34(t228) tmp = subu t226, t226s tmp = addu tmp, tmp tmp = addu t226, tmp pref 0x0(tmp) t226s = t226

What to Prefetch? When to Prefetch it? A good optimizing compiler should only prefetch data that will actually be referenced. It should prefetch far enough in advance to prevent a cache miss when the reference occurs. But, not too far in advance, because the data might be evicted from the cache before it is used, or might displace data that will be referenced again.

Prefetch Address In order to prefetch, the compiler must calculate addresses that will be referenced in future iterations of the loop. For loops that access regular data structures, such as vectors and matrices, compilers can use static analysis of the array indexes to compute the prefetching addresses. How can we predict future values of induction pointers?

Key Intuition Recursive data structures are often allocated at regular intervals. Example: curr = head = (item) malloc(sizeof(item)); while(curr->key = get_key()) != NULL) { curr->next = curr = (item)malloc(sizeof(item)); other_memory_allocations(); } curr -> next = NULL;

Pre-Fetching Technique Example: max = 0; current = head; tmp = current; while(current != NULL) { if(current->key > max) max = current->key; current = current->next; stride = current - tmp; prefetch(current + stride*k); }

Prefetch Sequence (R10K) In our implementation, the stride is recomputed in every iteration of the loop, making it tolerant of (infrequent) stride changes. stride = addr - addr.prev stride = stride * k addr.pref = addr + stride addr.prev = addr pref addr.pref

Identification of Pointer-Chasing Recurrences A surprisingly simple method works well: look in the intermediate code for recurrence circuits containing only loads with constant offsets. Examples: node = ptr->next; r1 <- load r2, offset_next ptr = node->ptr; r2 <- load r1, offset_ptr current = current->next; r2 <- load r1 r1 <- load r2, offset_next r1 r2

Profitability Analysis Goal: Balance the gains and costs of prefetching. Although we use resource estimates analogous to those done for software pipelining, we consider loop bodies with control flow. How to estimate the resources available for prefetching in a basic block B that belongs to many data dependence recurrences?

Software Pipelining What limits the speed of a loop? Data dependences: recurrence initiation interval (recMII) Processor resources: resource initiation interval (resMII) Memory accesses: memory initiation interval (memMII) Initiation interval ldf fadds stf sub cmp bg 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 time

Data Dependences(recMII) The recurrence minimum initiation interval (recMII) is given by: (0,2) a b c (1,2) for i = 0 to N - 1 do a: X[i] = X[i - 1] + R[i]; b: Y[i] = X[i] + Z[i - 1]; c: Z[i] = Y[i] + 1; end; (dist,lat)

The recMII for Loops with Control Flow An instruction of a basic block B, can belong to many recurrences (with distinct control paths). We define the recurrence MII of a load operation L as: L  c means that the operation L is part of the recurrence c. B8 B1 B2 B3 B6 B5 B4 B7 Control Flow Graph

Processor Resources(resMII) A basic block B may belong to multiple control paths. We define the resource constraint of a basic block B as the maximum over all control paths that execute B. B8 B1 B2 B3 B6 B5 B4 B7 Control Flow Graph

Available Memory Bandwidth Processors with non-blocking caches can support up to k outstanding cache misses without stalling. We define the available memory bandwidth of all control paths that execute a basic block B as where m(p) is the number of expected cache misses in each control path p. B8 B1 B2 B3 B6 B5 B4 B7 Control Flow Graph

Profitability Analysis Adding prefetch code for an induction pointer L in a basic block B is profitable if both: (1) the mii due to recurrences that contain L is greater than the resMII after prefetch insertion, and (2) there is enough memory bandwidth to enable another cache miss without causing stalls.

Computing Available Memory Bandwidth To compute the available memory bandwidth of a control path we need to estimate how many cache misses are expected in that control path. We use a graph coloring technique over a cache miss interference graph to predict which memory references are likely to incur a miss.

The Miss Interference Graph Two memory references interfere if: 1. They are both expected to miss the cache 2. They can both be issued in the same iteration of the loop 3. They do not fall into the same cache line Miss Interference Graph assumptions: 1. Loop invariant references are cache hits (global-pointer relative, stack-pointer relative, etc). 2. Memory references on mutually exclusive control paths do not interfere. 3. References relative to the same base address interfere only if their relative offset is larger than the cache line.

Prefetching Algorithm DoPrefetch(P,V,E) 1. C  pointer-chasing recurrences 2. R  Prioritized list of induction pointer loads in C 3. N  Prioritized list of other loads (not in C) 4. O  R + N 5. mark each L in O as a cache miss 6. for each L in O, L  B 7. do if recMIIP(B)  resMIIP(B) and S(B) 8. then add prefetch for L to B 9. mark L as cache hit 10. endif 11. endfor

An Example* 1 while (arcin){ 2 tail = arcin->tail; *mcf: minimal cost flow optimizer, (Konrad-Zuse Informatics Center, Berlin) 1 while (arcin){ 2 tail = arcin->tail; 3 if (tail->time + arcin->org_cost > latest){ 4 arcin = (arc_t *)tail->mark; 5 continue; } 6 arc_cost = tail->potential + head_potential; 7 if (red_cost < 0) { 8 if (new_arcs < MAX_NEW_ARCS){ 9 insert_new_arc(arcnew, new_arcs, tail, head, arc_cost, red_cost); 10 new_arcs++; 11 else if((cost_t)arcnew[0].flow > red_cost) 12 replace_weaker_arc(arcnew, tail, head, arc_cost, red_cost); 13 arcin = (arc_t *)tail->mark;

An Example 1 while (arcin){ 2 tail = arcin->tail; 3 if (tail->time + arcin->org_cost > latest){ 4 arcin = (arc_t *)tail->mark; 5 continue; } 6 arc_cost = tail->potential + head_potential; 7 if (red_cost < 0) { 8 if (new_arcs < MAX_NEW_ARCS){ 9 insert_new_arc(arcnew, new_arcs, tail, head, arc_cost, red_cost); 10 new_arcs++; 11 else if((cost_t)arcnew[0].flow > red_cost) 12 replace_weaker_arc(arcnew, tail, head, arc_cost, red_cost); 13 arcin = (arc_t *)tail->mark;

B1: B2: B3: B4: B5: B6: B7: B8: 1. t228 = lw 0x0(t226) 4. t231 = addu t229, t230 5. t232 = slt t220, 0 6. bne B3, t232, 0 B2: B3: 7. t226 = lw 0x34(t228) 8. b B8 9. t234 = lw 0x2c(t228) 10. t235 = subu t225, t234 11. t233 = addiu t235, 0x1e 12. bgez B7, t233 B4: 13. t236 = slt t209. t175 14. Beq B6, t236, 0 B5: insert_new_arc(); B6: replace_weaker_arc(); B7: 15. t226 = lw 0x34(t228) B8: 15. bne B1, t226, 0

B1: B2: B3: B4: B5: B6: B7: B8: 1. t228 = lw 0x0(t226) 4. t231 = addu t229, t230 5. t232 = slt t220, 0 6. bne B3, t232, 0 B2: B3: 7. t226 = lw 0x34(t228) 8. b B8 9. t234 = lw 0x2c(t228) 10. t235 = subu t225, t234 11. t233 = addiu t235, 0x1e 12. bgez B7, t233 B4: 13. t236 = slt t209. t175 14. Beq B6, t236, 0 B5: insert_new_arc(); B6: replace_weaker_arc(); B7: 15. t226 = lw 0x34(t228) B8: 15. bne B1, t226, 0

B1: B2: B3: B4: B5: B6: B7: B8: 1. t228 = lw 0x0(t226) 4. t231 = addu t229, t230 5. t232 = slt t220, 0 6. bne B3, t232, 0 B2: B3: 7. t226 = lw 0x34(t228) 8. b B8 9. t234 = lw 0x2c(t228) 10. t235 = subu t225, t234 11. t233 = addiu t235, 0x1e 12. bgez B7, t233 B4: 13. t236 = slt t209. t175 14. Beq B6, t236, 0 B5: insert_new_arc(); B6: replace_weaker_arc(); B7: 15. t226 = lw 0x34(t228) B8: 15. bne B1, t226, 0

B1: 1. t228 = lw 0x0(t226) 2. t229 = lw 0x14(t226) 3. t230 = lw 0x38(t228) 4. t231 = addu t229, t230 5. t232 = slt t220, 0 6. bne B3, t232, 0 B3 B4 B5 B6 B2: 7. t226 = lw 0x34(t228) 8. b B10 B7: 15. t226 = lw 0x34(t228) B8

B1: 1. t228 = lw 0x0(t226) 1. tmp = subu t228, t228s 1. tmp = addu tmp, tmp 1. tmp = addw t228, tmp 1. pref 0x34(tmp) 1. t228s = t228 2. t229 = lw 0x14(t226) 3. t230 = lw 0x38(t228) 4. t231 = addu t229, t230 5. t232 = slt t220, 0 6. bne B3, t232, 0 B3 B4 B5 B6 B2: 7. t226 = lw 0x34(t228) 7. tmp = subu t226, t226s 7. tmp = addu tmp, tmp 7. tmp = addu t226, tmp 7. pref 0x0(tmp) 7. t226s = t226 8. b B10 B7: 15. t226 = lw 0x34(t228) 15. tmp = subu t226, t226s 15. tmp = addu tmp, tmp 15. tmp = addu t226, tmp 15. pref 0x0(tmp) 15. t226s = t226 B8

When Pointer Prefetch Works

When Pointer Prefetch Does Not Help

Summary of Attributes Software-only implementation Simple candidate identification Simple code transformation No impact on user data structures Simple profitability analysis, local to loop Performance degradations are rare, minor

Open Questions How often is the speculated stride correct? Can instrumentation feedback help? How well does the speculative prefetch work with other recursive data structures: trees, graphs, etc? How well does this approach work for read/write recursive data structures?

Related Work (Software) Luk-Mowry (ASPLOS-96) Greedy prefetching; History-Pointer prefetching; Data Linearization Prefetching; Change the data structure storage; Lipatsi et al. (Micro-95) Prefetching pointers at procedure call sites; Liu-Dimitri-Kaeli (Journal of Syst. Arch.-99) Maintains a table of offsets for prefetching

Related Work (Hardware) Roth-Moshovos-Sohi (ASPLOS, 1998) Gonzales-Gonzales (ICS, 1997) Mehrotra (Urbana-Champaign, 1996) Chen-Baer (Trans. Computer, 1995) Charney-Reeves (Trans. Comp., 1994) Jegou-Teman (ICS, 1993) Fu-Patel (Micro, 1992)

Execution Time Measurements

Prefetch Improvement

L1 Cache Misses

L2 Cache Misses

TLB Misses

Benchmarks gcc GNU C compiler li Lisp interpreter mcf Minimal cost flow solver parser Syntactic parser of English twolf Place and route simulator mlp Multi-layer perceptron simulator ft Minimum spanning tree algorithm