Software Group © 2005 IBM Corporation Compiler Technology October 17, 2005 Array privatization in IBM static compilers -- technical report CASCON 2005.

Slides:



Advertisements
Similar presentations
CSC 4181 Compiler Construction Code Generation & Optimization.
Advertisements

Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.
Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
Overview Structural Testing Introduction – General Concepts
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
8. Static Single Assignment Form Marcus Denker. © Marcus Denker SSA Roadmap  Static Single Assignment Form (SSA)  Converting to SSA Form  Examples.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.
Context-Sensitive Interprocedural Points-to Analysis in the Presence of Function Pointers Presentation by Patrick Kaleem Justin.
Loops or Lather, Rinse, Repeat… CS153: Compilers Greg Morrisett.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
The Last Lecture Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission.
Dependence Analysis Kathy Yelick Bebop group meeting, 8/3/01.
Optimizing Compilers for Modern Architectures Preliminary Transformations Chapter 4 of Allen and Kennedy.
Preliminary Transformations Chapter 4 of Allen and Kennedy Harel Paz.
Optimizing Compilers for Modern Architectures More Interprocedural Analysis Chapter 11, Sections to end.
4 July 2005 overview Traineeship: Mapping of data structures in multiprocessor systems Nick de Koning
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Software Group © 2005 IBM Corporation Compilation Technology Controlling parallelization in the IBM XL Fortran and C/C++ parallelizing compilers Priya.
CMPUT680 - Fall 2006 Topic A: Data Dependence in Loops José Nelson Amaral
Cpeg421-08S/final-review1 Course Review Tom St. John.
Fast Effective Dynamic Compilation Joel Auslander, Mathai Philipose, Craig Chambers, etc. PLDI’96 Department of Computer Science and Engineering Univ.
Module: Definition ● A logical collection of related program entities ● Not necessarily a physical concept, e.g., file, function, class, package ● Often.
Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.
© 2002 IBM Corporation IBM Toronto Software Lab October 6, 2004 | CASCON2004 Interprocedural Strength Reduction Shimin Cui Roch Archambault Raul Silvera.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
Parallelizing Compilers Presented by Yiwei Zhang.
Lecture 1CS 380C 1 380C Last Time –Course organization –Read Backus et al. Announcements –Hadi lab Q&A Wed 1-2 in Painter 5.38N –UT Texas Learning Center:
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
PSUCS322 HM 1 Languages and Compiler Design II IR Code Optimization Material provided by Prof. Jingke Li Stolen with pride and modified by Herb Mayer PSU.
Register Allocation and Spilling via Graph Coloring G. J. Chaitin IBM Research, 1982.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Compiler Code Optimizations. Introduction Introduction Optimized codeOptimized code Executes faster Executes faster efficient memory usage efficient memory.
Chapter 9: Coupling & Cohesion Omar Meqdadi SE 273 Lecture 9 Department of Computer Science and Software Engineering University of Wisconsin-Platteville.
Optimization software for apeNEXT Max Lukyanov,  apeNEXT : a VLIW architecture  Optimization basics  Software optimizer for apeNEXT  Current.
Control Flow Resolution in Dynamic Language Author: Štěpán Šindelář Supervisor: Filip Zavoral, Ph.D.
Putting Pointer Analysis to Work Rakesh Ghiya and Laurie J. Hendren Presented by Shey Liggett & Jason Bartkowiak.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Toward Efficient Flow-Sensitive Induction Variable Analysis and Dependence Testing for Loop Optimization Yixin Shou, Robert A. van Engelen, Johnnie Birch,
Dataflow Analysis Topic today Data flow analysis: Section 3 of Representation and Analysis Paper (Section 3) NOTE we finished through slide 30 on Friday.
Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.
Reuse Distance as a Metric for Cache Behavior Kristof Beyls and Erik D’Hollander Ghent University PDCS - August 2001.
High-Level Transformations for Embedded Computing
Reduced Instruction Set Computers. Major Advances in Computers(1) The family concept —IBM System/ —DEC PDP-8 —Separates architecture from implementation.
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.
1 Software Testing & Quality Assurance Lecture 13 Created by: Paulo Alencar Modified by: Frank Xu.
Optimizing Compilers for Modern Architectures Interprocedural Analysis and Optimization Chapter 11, through Section
1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Software Group © 2004 IBM Corporation Compiler Technology October 6, 2004 Experiments with auto-parallelizing SPEC2000FP benchmarks Guansong Zhang CASCON.
Chapter 9: Coupling & Cohesion Omar Meqdadi SE 273 Lecture 9 Department of Computer Science and Software Engineering University of Wisconsin-Platteville.
An Offline Approach for Whole-Program Paths Analysis using Suffix Arrays G. Pokam, F. Bodin.
Code Optimization Overview and Examples
Code Optimization.
Chapter 11: Collaboration Diagram - PART1
Presented by: Huston Bokinsky Ying Zhang 25 April, 2013
A Practical Stride Prefetching Implementation in Global Optimizer
Preliminary Transformations
Compiler techniques for exposing ILP (cont)
Compiler Code Optimizations
Code Optimization Overview and Examples Control Flow Graph
Chapter 12 Pipelining and RISC
Lecture 19: Code Optimisation
Introduction to Optimization
Presentation transcript:

Software Group © 2005 IBM Corporation Compiler Technology October 17, 2005 Array privatization in IBM static compilers -- technical report CASCON 2005

Compiler technology © 2005 IBM Corporation October 17, 2005 Guansong Zhang, Erik Charlebois and Roch Archambault Compiler Development Team IBM Toronto Lab, Canada Authors

Compiler technology © 2005 IBM Corporation October 17, 2005 Overview  Introduction and motivation  Array data flow analysis  Array data privatization  Performance results  Future work  Possible usage

Compiler technology © 2005 IBM Corporation October 17, 2005 Expose limitations  Compare SPEC2000FP and SPECOMP  SPECOMP achieves good performance and scalability –Compare between explicit and auto-parallelization  Expose missed opportunities  10 common benchmarks –Compare on a loop-to-loop basis

Compiler technology © 2005 IBM Corporation October 17, 2005 Improved auto-parallelization performance

Compiler technology © 2005 IBM Corporation October 17, 2005 Array privatization example

Compiler technology © 2005 IBM Corporation October 17, 2005 Basic loop parallelizer

Compiler technology © 2005 IBM Corporation October 17, 2005 Pre-parallelization Phase  Induction variable identification  Scalar Privatization --- only scalar !  Reduction finding  Loop transformations favoring parallelism

Compiler technology © 2005 IBM Corporation October 17, 2005 The concept of data privatization  Data is local to each loop iteration Do I = 1, 10 Temp = = … Temp … Enddo  Purpose: eliminating loop carried dependences.

Compiler technology © 2005 IBM Corporation October 17, 2005 The concept of data privatization (cont.)  Array as temp data do J = 1, 10 do I = 1, 10 Temp(I) =... end do do I = 1, = … Temp (I) … enddo

Compiler technology © 2005 IBM Corporation October 17, 2005 Array data flow and its structure  Similar to data flow –MayDef: array elements that may be written. –MustDef: array elements that are definitely written. –UpExpUse: array elements that may have an upward exposed use a use not preceded by a definition along a path from the loop header –LiveOnExit: array elements that are used after the loop region.  GARs: Guarded Array Regions (GARs). –A GAR is a tuple(G,D), D is a bounded Regular Section Descriptor (RSD) for the accessed array section, G is a guard that specifies the condition under which D is accessed  Notes: many papers discussed the issue

Compiler technology © 2005 IBM Corporation October 17, 2005 Array privatization algorithm

Compiler technology © 2005 IBM Corporation October 17, 2005 Loop normalization and array data flow  Normalized loop

Compiler technology © 2005 IBM Corporation October 17, 2005 Alias analysis and array data flow  Ideal situation: no alias at all. –Other wise, you can not tell what is the precise intersection of the two array section involved  Alias coming from: –Structural members, e.g. scalar replacement –Function parameters, array is a shadow (not mapped data, alias to any global array)  Procedure summary may help –Alias as fall back

Compiler technology © 2005 IBM Corporation October 17, 2005 Possible parallelization results

Compiler technology © 2005 IBM Corporation October 17, 2005 Real case

Compiler technology © 2005 IBM Corporation October 17, 2005 NAS MG (-O3 –qhot –q64)

Compiler technology © 2005 IBM Corporation October 17, 2005 Summary  Challenges – Compilation time Work with other optimizations – Loop unroll Graph complexity – Number of branches Array section caculation accuracy – Memory usage Managing and reusing

Compiler technology © 2005 IBM Corporation October 17, 2005 Summary (cont.)  Further improvement: – Inter-procedural array data flow Procedure summary More accurate section information instead of using alias – Symbolic range analysis Expression simplifier: lot of room to be improved – Compilation efficiency  Possible usage – Auto parallelization – Array contraction – Array coalescing