The SGI Pro64 Compiler Infrastructure

The SGI Pro64 Compiler Infrastructure
- A Tutorial Guang R. Gao (U of Delaware) J. Dehnert (SGI) J. N. Amaral (U of Alberta) R. Towle (SGI)

Acknowledgement The SGI Compiler Development Teams
The MIPSpro/Pro64 Development Team University of Delaware CAPSL Compiler Team These individuals contributed directly to this tutorial A. Douillet (Udel) F. Chow (Equator) S. Chan (Intel) W. Ho (Routefree) Z. Hu (Udel) K. Lesniak (SGI) S. Liu (HP) R. Lo (Routefree) S. Mantripragada (SGI) C. Murthy (SGI) M. Murphy (SGI) G. Pirocanac (SGI) D. Stephenson (SGI) D. Whitney (SGI) H. Yang (Udel)

What is Pro64? A suite of optimizing compiler tools for Linux/ Intel IA-64 systems C, C++ and Fortran90/95 compilers Conforming to the IA-64 Linux ABI and API standards Open to all researchers/developers in the community Compatible with HP Native User Environment

Who Might Want to Use Pro64?
Researchers : test new compiler analysis and optimization algorithms Developers : retarget to another architecture/system Educators : a compiler teaching platform

Outline Background and Motivation
Part I: An overview of the SGI Pro64 compiler infrastructure Part II: The Pro64 code generator design Part III: Using Pro64 in compiler research & development SGI Pro64 support Summary

PART I: Overview of the Pro64 Compiler

Outline Logical compilation model and component flow
WHIRL Intermediate Representation Inter-Procedural Analysis (IPA) Loop Nest Optimizer (LNO) and Parallelization Global optimization (WOPT) Feedback Design for debugability and testability

Logical Compilation Model
driver (sgicc/sgif90/sgiCC) front end + IPA (gfec/gfecc/mfef90) back end (be, as) linker (ld) Src (.c/.C/.f) WHIRL (.B/.I) obj (.o) a.out/.so Data Path Fork and Exec

Components of Pro64 Code Generation Front end
Interprocedural Analysis and Optimization Loop Nest Optimization and Parallelization Global Optimization Code Generation

Data Flow Relationship Between Modules
-IPA Local IPA Main IPA LNO Lower to High W. .B Inliner gfec .I lower I/O gfecc (only for f90) WHIRL C .w2c.c f90 .w2c.h WHIRL fortran .w2f.f Take either path -O0 Lower all CG Very high WHIRL -phase: w=off High WHIRL Lower Mid W Main opt Mid WHIRL -O2/O3 Low WHIRL

Front Ends C front end based on gcc C++ front end based on g++
Fortran90/95 front end from MIPSpro

Intermediate Representation
IR is called WHIRL Tree structured, with references to symbol table Maps used for local or sparse annotation Common interface between components Multiple languages, multiple targets Same IR, 5 levels of representation Continuous lowering during compilation Optimization strategy tied to level

IPA Main Stage Analysis Optimization (fully integrated) alias analysis
array section code layout Optimization (fully integrated) inlining cloning dead function and variable elimination constant propagation

IPA Design Features User transparent
No makefile changes Handles DSOs, unanalyzed objects Provide info (e.g. alias analysis, procedure properties) smoothly to: loop nest optimizer main optimizer code generator

Loop Nest Optimizer/Parallelizer
All languages (including OpenMP) Loop level dependence analysis Uniprocessor loop level transformations Automatic parallelization

Loop Level Transformations
Based on unified cost model Heuristics integrated with software pipelining Loop vector dependency info passed to CG Loop Fission Loop Fusion Loop Unroll and Jam Loop Interchange Loop Peeling Loop Tiling Vector Data Prefetching

Parallelization Automatic Directive based Array privatization
Doacross parallelization Array section analysis Directive based OpenMP Integrated with automatic methods

Global Optimization Phase
SSA is unifying technology Use only SSA as program representation All traditional global optimizations implemented Every optimization preserves SSA form Can reapply each optimization as needed

Pro64 Extensions to SSA Representing aliases and indirect memory operations (Chow et al, CC 96) Integrated partial redundancy elimination (Chow et al, PLDI 97; Kennedy et al, CC 98, TOPLAS 99) Support for speculative code motion Register promotion via load and store placement (Lo et al, PLDI 98)

Feedback Used throughout the compiler
Instrumentation can be added at any stage Explicit instrumentation data incorporated where inserted Instrumentation data maintained and checked for consistency through program transformations.

Design for Debugability (DFD) and Testability (DFT)
DFD and DFT built-in from start Can build with extra validity checks Simple option specification used to: Substitute components known to be good Enable/disable full components or specific optimizations Invoke alternative heuristics Trace individual phases

Where to Obtain Pro64 Compiler and its Support
SGI Source download University of Delaware Pro64 Support Group

Overview of The Pro64 Code Generator
PART II Overview of The Pro64 Code Generator

Outline Code generator flow diagram
WHIRL/CGIR and TARG-INFO Hyperblock formation and predication (HBF) Predicate Query System (PQS) Loop preparation (CGPREP) and software pipelining Global and local instruction scheduling (IGLS) Global and local register allocation (GRA, LRA)

Flowchart of Code Generator
WHIRL Control Flow Opt II EBO WHIRL-to-TOP Lowering EBO: Extended basic block optimization peephole, etc. CGIR: Quad Op List IGLS: pre-pass GRA, LRA, EBO IGLS: post-pass Control Flow Opt Control Flow Opt I EBO Hyperblock Formation Critical-Path Reduction PQS: Predicate Query System Code Emission Process Inner Loops: unrolling, EBO Loop prep, software pipelining

From WHIRL to CGIR An Example
ST aa T1 = sp + &a; T2 = ld T1 T3 = sp + &i; T4 = ld T3 T5 = sxt T4 T6 = T5 << 2 T7 = T6 T8 = T2 + T7 T9 = ld T8 T10 = sp + &aa := st T10 T9 int *a; int i; int aa; aa = a[i]; LD + a * CVTL32 4 i (a) Source (b) WHIRL (c) CGIR

Code Generation Intermediate Representation (CGIR)
TOPs (Target Operations) are “quads” Operands/results are TNs Basic block nodes in control flow graph Load/store architecture Supports predication Flags on TOPs (copy ops, integer add, load, etc.) Flags on operands (TNs)

From WHIRL to CGIR Information passed alias information
Cont’d Information passed alias information loop information symbol table and maps

The Target Information Table (TARG_INFO)
Objective: Parameterized description of a target machine and system architecture Separates architecture details from the compiler’s algorithms Minimizes compiler changes when targeting a new architecture

The Target Information Table (TARG_INFO)
Cont’d Based on an extension of Cydra tables, with major improvements Architecture models have already targeted: Whole MIPS family IA-64 IA-32 SGI graphics processors (earlier version)

Hyperblock Formation and Predicated Execution
Hyperblock single-entry multiple-exit control-flow region: loop body, hammock region, etc. Hyperblock formation algorithm Based on Scott Mahlke’s method [Mahlke96] But, less aggressive tail duplication

Hyperblock Formation Algorithm
Hammock regions Innermost loops General regions (path based) Paths sorted by priorities (freq., size, length, etc.) Inclusion of a path is guided by its impact on resources, scheduling height, and priority level Internal branches are removed via predication Predicate reuse Region Identification Block Selection Tail Duplication If Conversion Objective: Keep the scheduling height close to that of the highest priority path.

Hyperblock Formation - An Example
1 1 aa = a[i]; bb = b[i]; switch (aa) { case 1: if (aa < tabsiz) aa = tab[aa]; case 2: if (bb < tabsiz) bb = tab[bb]; default: ans = aa + bb; 4 2 4 2 1 4,5 5 5 2 6’ 6 6 6,7 7’ 8 7 7 8 8 8’ H1 H2 (a) Source (c) Hyperblock formation with aggressive tail duplication (b) CFG

Hyperblock Formation - An Example
Cont’d 1 1 1 4 2 4 2 4 2 H1 5 5 6 6’ 5 6 6 7’ 7 7 7 8 8’ H2 8 H1 H2 8 (b) Hyperblock formation with aggressive tail duplication (c) Pro64 hyperblock formation (a) CFG

Features of the Pro64 Hyperblock Formation (HBF) Algorithm
Form “good” vs. “maximal” hyperblocks Avoid unnecessary duplication No reverse if-conversion Hyperblocks are not a barrier to global code motion later in IGLS

Predicate Query System (PQS)
Purpose: gather information and provide interfaces allowing other phases to make queries regarding the relationships among predicate values PQS functions (examples) BOOL PQSCG_is_disjoint (PQS_TN tn1, PQS_TN tn2) BOOL PQSCG_is_subset (PQS_TN_SET& tns1, PQS_TN_SET& tns2)

Loop Preparation and Optimization for Software Pipelining
Loop canonicalization for SWP Read/Write removal (register aware) Loop unrolling (resource aware) Recurrence removal or extension Prefetch Forced if-conversion

Pro64 Software Pipelining Method Overview
Test for SWP-amenable loops Extensive loop preparation and optimization before application [DeTo93] Use lifetime sensitive SWP algorithm [Huff93] Register allocation after scheduling based on Cydra 5 [RLTS92, DeTo93] Handle both while and do loops Smooth switching to normal scheduling if not successful.

Pro64 Lifetime-Sensitive Modulo Scheduling for Software Pipelining
Features Try to place an op ASAP or ALAP to minimize register pressure Slack scheduling Limited backtracking Operation-driven scheduling framework Compute Estart/Lstart for all unplaced ops Choose a good op to place into the current partial schedule within its Estart/Lstart range yes Register allocate Succeed no done Eject conflicting Ops

Integrated Global Local Scheduling (IGLS) Method
The basic IGLS framework integrates global code motion (GCM) with local scheduling [MaJD98] IGLS extended to hyperblock scheduling Performs profitable code motion between hyperblock regions and normal regions

IGLS Phase Flow Diagram
Hyperblock Scheduling (HBS) Block Priority Selection Motion Selection Target Selection Global Code Motion (GCM) Local Code Scheduling (LCS)

Advantages of the Extended IGLS Method - The Example Revisited
1 Advantages: No rigid boundaries between hyperblocks and non-hyperblocks GCM moves code into and out of a hyperblock according to profitability 1 4 2 H1 4 2 H1 5 5 6 6 7 7 8 8’ H2 H2 8 H3 (a) Pro64 hyperblock (b) Profitable duplication

Software Pipelining vs Normal Scheduling
a SWP-amenable loop candidate ? No Yes Inner loop processing software pipelining IGLS GRA/LRA Failure/not profitable IGLS Code Emission Success

Global and Local Register Allocation (GRA/LRA)
From prepass IGLS LRA-RQ provides an estimate of local register requirements Allocates global variables using a priority-based register allocator [ChowHennessy90,Chow83, Briggs92] Incorporates IA-64 specific extensions, e.g. register stack usage GRA LRA Register Request LRA-RQ Priority Based Register Allocation with IA-64 Extensions LRA To postpass IGLS

Local Register Allocation (LRA)
Assign_registers using reverse linear scan Reordering: depth-first ordering on the DDG Assign_Registers failed succeed Fix_LRA first time Instruction reordering Spill global spill local

Future Research Topics for Pro64 Code Generator
Hyperblock formation Predicate query system Enhanced speculation support

PART III: Using Pro64 in Compiler Research and Development
Case Studies

Outline General Remarks
Case Study I: Integration of new instruction reordering algorithm to minimize register pressure [Govind,Yang,Amaral,Gao2000] Case Study II: Design and evaluation of an induction pointer prefetching algorithm [Stouchinin,Douillet,Amaral,Dehnert,Gao2000]

Case I Introduction of the Minimum Register Instruction Sequence (MRIS) problem and a proposed solution Problem formulation The proposed algorithm Pro64 porting experience Where to start How to start Results Summary

Researchers R. Govindarajan (Indian Inst. Of Science)
Hongbo Yang (Univ. of Delaware) Chihong Zhang (Conexant) José Nelson Amaral (Univ. of Alberta) Guang R. Gao (Univ. of Delaware)

The Minimum Register Instruction Sequence Problem
Given a data dependence graph G, derive an instruction sequence S for G that is optimal in the sense that its register requirement is minimum.

A Motivating Example a: s1 = ld [x]; b: s2 = s1 + 4; c: s3 = s1 * 8; d: s4 = s1 - 4; e: s5 = s1 / 2; f: s6 = s2 * s3; g : s7 = s4 - s5; h: s8 = s6 * s7; s1 s2 s3 s4 s5 s6 s7 a: s1 = ld (x); d: s4 = s1 - 4; e: s5 = s1 / 2; g : s7 = s4 - s5; c: s3 = s1 * 8; b: s2 = s1 + 4; f: s6 = s2 * s3; h: s8 = s6 * s7; a b c d e h f g (a) DDG (b) Instruction Sequence (c) Instruction Sequence 2 Observation: Register requirements drop 25% from (b) to (c) !

Motivation IA-64 style processors Out-of-order issue processor
Reduce spills in local register allocation phase Reduce Local Register Allocation (LRA) requests in Global Register Allocation (GRA) phase Reduce overall register pressure on a per procedure basis Out-of-order issue processor Instruction reordering buffer Register renaming

How to Solve the MRIS Problem?
a b c d e h f g Register lineages Live range of lineages Lineage interference L1 = (a, b, f, h); L2 = (c, f); L3 = (e, g, h); L4 = (d, g); (a) Concepts (b) DDG (c) Lineages

a Register lineages Live range of lineages Lineage interference L1 = (a, b, f, h); L2 = (c, f); L3 = (e, g, h); L4 = (d, g); b c d e h f g (a) Concepts (b) DDG (c) Lineages Questions: Can L1 and L2 share the same register?

a Register lineages Live range of lineages Lineage interference L1 = (a, b, f, h); L2 = (c, f); L3 = (e, g, h); L4 = (d, g); b c d e h f g (a) Concepts (b) DDG (c) Lineages Questions: Can L1 and L2 share the same register? Can L2 and L3 share the same register?

a Register lineages Live range of lineages Lineage interference L1 = (a, b, f, h); L2 = (c, f); L3 = (e, g, h); L4 = (d, g); b c d e h f g (a) Concepts (b) DDG (c) Lineages Questions: Can L1 and L2 share the same register? Can L2 and L3 share the same register? Can L1 and L4 share the same register? Can L2 and L4 share the same register?

Lineage Interference Graph
L1 = (a, b, f, h); L2 = (c, f); L3 = (e, g, h); L4 = (d, g); a b c d e L1 L4 g f L2 L3 h (a) Original DDG (b) Lineage Interference Graph (LIG) Question: Is the lower bound of the required registers = 3? Challenge: Derive a “Heuristic Register Bound” (HRB)!

Extended list-scheduling
Our Solution Method DDG A “good” construction algorithm for LIG An effective heuristic method to calculate the HRB An efficient scheduling method (do not backtrack) Form Lineage Interference Graph (LIG) Derive HRB Extended list-scheduling guided by HRB A good instruction sequence

Pro64 Porting Experience
Porting plan and design Implementation Debugging and validation Evaluation

Implementation Dependence graph construction LIG formation
LIG construction and coloring The reordering algorithm implementation

Porting Plan and Design
../common/targ_info/abi/ia64 Porting Plan and Design Understand the compiler infrastructure Understand the register model (mainly from targ_info) e.g.: register classes: (int, float, predicate, app, control) register save/restore conventions: caller/callee save, return value, argument passing, stack pointer, etc.

Register Allocation GRA Assign_Registers Fix_LRA_Blues LRA:
At block level Assign_Registers Fix_LRA_Blues Succ? Fail? reschedule local code motion spill global or local registers

Implementation DDG construction: use native service routines: e.g. CG_DEP_Compute_Graph LIG coloring: using native support for set package (e.g. bitset.c) Scheduler implementation: vector package native support (e.g. cg_vector.cxx) Access dependence graph using native service functions ARC_succs, ARC_preds, ARC_kind

Debugging and Validation
Trace file tt54:0x1. General trace of LRA tt45: 0x4. Dependence graph building tr53. Target Operations (TOP) before LRA tr54. TOP after LRA

Evaluation Static measurement Dynamic measurement
Fat point -tt54: 0x40 Dynamic measurement Hardware counter in R12k and perfex

Evaluation For the MIPS R12K (SPEC95fp), the lineage-based algorithm reduce the number of loads executed by 12%, the number of stores by 14%, and the execution time by 2.5% over a baseline. It is slightly better than the algorithm in the MIPSPro compiler.

Case II Design and Evaluation of an Induction Pointer Prefetching Algorithm

Researchers Artour Stoutchinin (STMicroelectronics)
José Nelson Amaral (Univ. of Alberta) Guang R. Gao (Univ. of Delaware) Jim Dehnert (Silicon Graphics Inc.) Suneel Jain (Narus Inc.) Alban Douillet (Univ. of Delaware)

Motivation The important loops of many programs are pointer-chasing loops that access recursive data structures through induction pointers. Example: max = 0; current = head; while(current != NULL) { if(current->key > max) max = current->key; current = current->next; }

Problem Statement How to identify pointer-chasing recurrences?
How to decide whether there are enough processor resources and memory bandwidth to profitably prefetch an induction pointer? How to efficiently integrate induction pointer prefetching with loop scheduling based on the profitability analysis?

Prefetching Costs More instructions to issue More memory traffic
Longer code (disruption in instruction cache) Displacement of potentially good data from cache Before prefetching: t226 = lw 0x34(t228) After prefetching: t226 = lw 0x34(t228) tmp = subu t226, t226s tmp = addu tmp, tmp tmp = addu t226, tmp pref 0x0(tmp) t226s = t226

What to Prefetch? When to Prefetch it?
A good optimizing compiler should only prefetch data that will actually be referenced. It should prefetch far enough in advance to prevent a cache miss when the reference occurs. But, not too far in advance, because the data might be evicted from the cache before it is used, or might displace data that will be referenced again.

Prefetch Address In order to prefetch, the compiler must calculate addresses that will be referenced in future iterations of the loop. For loops that access regular data structures, such as vectors and matrices, compilers can use static analysis of the array indexes to compute the prefetching addresses. How can we predict future values of induction pointers?

Key Intuition Recursive data structures are often allocated at
regular intervals. Example: curr = head = (item) malloc(sizeof(item)); while(curr->key = get_key()) != NULL) { curr->next = curr = (item)malloc(sizeof(item)); other_memory_allocations(); } curr -> next = NULL;

Pre-Fetching Technique
Example: max = 0; current = head; tmp = current; while(current != NULL) { if(current->key > max) max = current->key; current = current->next; stride = current - tmp; prefetch(current + stride*k); }

Prefetch Sequence (R10K)
In our implementation, the stride is recomputed in every iteration of the loop, making it tolerant of (infrequent) stride changes. stride = addr - addr.prev stride = stride * k addr.pref = addr + stride addr.prev = addr pref addr.pref

Identification of Pointer-Chasing Recurrences
A surprisingly simple method works well: look in the intermediate code for recurrence circuits containing only loads with constant offsets. Examples: node = ptr->next; r1 <- load r2, offset_next ptr = node->ptr; r2 <- load r1, offset_ptr current = current->next; r2 <- load r1 r1 <- load r2, offset_next r1 r2

Profitability Analysis
Goal: Balance the gains and costs of prefetching. Although we use resource estimates analogous to those done for software pipelining, we consider loop bodies with control flow. How to estimate the resources available for prefetching in a basic block B that belongs to many data dependence recurrences?

Software Pipelining What limits the speed of a loop?
Data dependences: recurrence initiation interval (recMII) Processor resources: resource initiation interval (resMII) Memory accesses: memory initiation interval (memMII) Initiation interval ldf fadds stf sub cmp bg 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 time

Data Dependences(recMII)
The recurrence minimum initiation interval (recMII) is given by: (0,2) a b c (1,2) for i = 0 to N - 1 do a: X[i] = X[i - 1] + R[i]; b: Y[i] = X[i] + Z[i - 1]; c: Z[i] = Y[i] + 1; end; (dist,lat)

The recMII for Loops with Control Flow
An instruction of a basic block B, can belong to many recurrences (with distinct control paths). We define the recurrence MII of a load operation L as: L  c means that the operation L is part of the recurrence c. B8 B1 B2 B3 B6 B5 B4 B7 Control Flow Graph

Processor Resources(resMII)
A basic block B may belong to multiple control paths. We define the resource constraint of a basic block B as the maximum over all control paths that execute B. B8 B1 B2 B3 B6 B5 B4 B7 Control Flow Graph

Available Memory Bandwidth
Processors with non-blocking caches can support up to k outstanding cache misses without stalling. We define the available memory bandwidth of all control paths that execute a basic block B as where m(p) is the number of expected cache misses in each control path p. B8 B1 B2 B3 B6 B5 B4 B7 Control Flow Graph

Profitability Analysis
Adding prefetch code for an induction pointer L in a basic block B is profitable if both: (1) the mii due to recurrences that contain L is greater than the resMII after prefetch insertion, and (2) there is enough memory bandwidth to enable another cache miss without causing stalls.

Computing Available Memory Bandwidth
To compute the available memory bandwidth of a control path we need to estimate how many cache misses are expected in that control path. We use a graph coloring technique over a cache miss interference graph to predict which memory references are likely to incur a miss.

The Miss Interference Graph
Two memory references interfere if: 1. They are both expected to miss the cache 2. They can both be issued in the same iteration of the loop 3. They do not fall into the same cache line Miss Interference Graph assumptions: 1. Loop invariant references are cache hits (global-pointer relative, stack-pointer relative, etc). 2. Memory references on mutually exclusive control paths do not interfere. 3. References relative to the same base address interfere only if their relative offset is larger than the cache line.

Prefetching Algorithm
DoPrefetch(P,V,E) 1. C  pointer-chasing recurrences 2. R  Prioritized list of induction pointer loads in C 3. N  Prioritized list of other loads (not in C) 4. O  R + N 5. mark each L in O as a cache miss 6. for each L in O, L  B do if recMIIP(B)  resMIIP(B) and S(B) then add prefetch for L to B mark L as cache hit endif 11. endfor

An Example* 1 while (arcin){ 2 tail = arcin->tail;
*mcf: minimal cost flow optimizer, (Konrad-Zuse Informatics Center, Berlin) 1 while (arcin){ tail = arcin->tail; if (tail->time + arcin->org_cost > latest){ arcin = (arc_t *)tail->mark; continue; } arc_cost = tail->potential + head_potential; if (red_cost < 0) { if (new_arcs < MAX_NEW_ARCS){ insert_new_arc(arcnew, new_arcs, tail, head, arc_cost, red_cost); new_arcs++; else if((cost_t)arcnew[0].flow > red_cost) replace_weaker_arc(arcnew, tail, head, arc_cost, red_cost); arcin = (arc_t *)tail->mark;

An Example 1 while (arcin){ 2 tail = arcin->tail;
if (tail->time + arcin->org_cost > latest){ arcin = (arc_t *)tail->mark; continue; } arc_cost = tail->potential + head_potential; if (red_cost < 0) { if (new_arcs < MAX_NEW_ARCS){ insert_new_arc(arcnew, new_arcs, tail, head, arc_cost, red_cost); new_arcs++; else if((cost_t)arcnew[0].flow > red_cost) replace_weaker_arc(arcnew, tail, head, arc_cost, red_cost); arcin = (arc_t *)tail->mark;

B1: B2: B3: B4: B5: B6: B7: B8: 1. t228 = lw 0x0(t226)
4. t231 = addu t229, t230 5. t232 = slt t220, 0 6. bne B3, t232, 0 B2: B3: 7. t226 = lw 0x34(t228) 8. b B8 9. t234 = lw 0x2c(t228) 10. t235 = subu t225, t234 11. t233 = addiu t235, 0x1e 12. bgez B7, t233 B4: 13. t236 = slt t209. t175 14. Beq B6, t236, 0 B5: insert_new_arc(); B6: replace_weaker_arc(); B7: 15. t226 = lw 0x34(t228) B8: 15. bne B1, t226, 0

B1: 1. t228 = lw 0x0(t226) 2. t229 = lw 0x14(t226) 3. t230 = lw 0x38(t228) 4. t231 = addu t229, t230 5. t232 = slt t220, 0 6. bne B3, t232, 0 B3 B4 B5 B6 B2: 7. t226 = lw 0x34(t228) 8. b B10 B7: 15. t226 = lw 0x34(t228) B8

B1: 1. t228 = lw 0x0(t226) 1. tmp = subu t228, t228s 1. tmp = addu tmp, tmp 1. tmp = addw t228, tmp 1. pref 0x34(tmp) 1. t228s = t228 2. t229 = lw 0x14(t226) 3. t230 = lw 0x38(t228) 4. t231 = addu t229, t230 5. t232 = slt t220, 0 6. bne B3, t232, 0 B3 B4 B5 B6 B2: 7. t226 = lw 0x34(t228) 7. tmp = subu t226, t226s 7. tmp = addu tmp, tmp 7. tmp = addu t226, tmp 7. pref 0x0(tmp) 7. t226s = t226 8. b B10 B7: 15. t226 = lw 0x34(t228) 15. tmp = subu t226, t226s 15. tmp = addu tmp, tmp 15. tmp = addu t226, tmp 15. pref 0x0(tmp) 15. t226s = t226 B8

When Pointer Prefetch Works

When Pointer Prefetch Does Not Help

Summary of Attributes Software-only implementation
Simple candidate identification Simple code transformation No impact on user data structures Simple profitability analysis, local to loop Performance degradations are rare, minor

Open Questions How often is the speculated stride correct?
Can instrumentation feedback help? How well does the speculative prefetch work with other recursive data structures: trees, graphs, etc? How well does this approach work for read/write recursive data structures?

Related Work (Software)
Luk-Mowry (ASPLOS-96) Greedy prefetching; History-Pointer prefetching; Data Linearization Prefetching; Change the data structure storage; Lipatsi et al. (Micro-95) Prefetching pointers at procedure call sites; Liu-Dimitri-Kaeli (Journal of Syst. Arch.-99) Maintains a table of offsets for prefetching

Related Work (Hardware)
Roth-Moshovos-Sohi (ASPLOS, 1998) Gonzales-Gonzales (ICS, 1997) Mehrotra (Urbana-Champaign, 1996) Chen-Baer (Trans. Computer, 1995) Charney-Reeves (Trans. Comp., 1994) Jegou-Teman (ICS, 1993) Fu-Patel (Micro, 1992)

Execution Time Measurements

Prefetch Improvement

L1 Cache Misses

L2 Cache Misses

TLB Misses

Benchmarks gcc GNU C compiler li Lisp interpreter
mcf Minimal cost flow solver parser Syntactic parser of English twolf Place and route simulator mlp Multi-layer perceptron simulator ft Minimum spanning tree algorithm

The SGI Pro64 Compiler Infrastructure

Similar presentations

Presentation on theme: "The SGI Pro64 Compiler Infrastructure"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The SGI Pro64 Compiler Infrastructure

Similar presentations

Presentation on theme: "The SGI Pro64 Compiler Infrastructure"— Presentation transcript:

Similar presentations

About project

Feedback