Presented by: Huston Bokinsky Ying Zhang 25 April, 2013

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.
Advertisements

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
Programmability Issues
Bernstein’s Conditions. Techniques to Exploit Parallelism in Sequential Programming Hierarchy of levels of parallelism: Procedure or Methods Statements.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.
Idiom Recognition in the Polaris Parallelizing Compiler Bill Pottenger and Rudolf Eigenmann Presented by Vincent Yau.
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
Parallelizing Compilers Presented by Yiwei Zhang.
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Thread-Level Speculation Karan Singh CS
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
High-Level Transformations for Embedded Computing
Program Analysis & Transformations Loop Parallelization and Vectorization Toheed Aslam.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Memory-Aware Compilation Philip Sweany 10/20/2011.
Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.
Embedded Systems MPSoC Architectures OpenMP: Exercises Alberto Bosio
Lecture 38: Compiling for Modern Architectures 03 May 02
Advanced Algorithms Analysis and Design
Auburn University
Transfer Functions Convenient representation of a linear, dynamic model. A transfer function (TF) relates one input and one output: The following terminology.
TensorFlow– A system for large-scale machine learning
Compiler Design (40-414) Main Text Book:
Code Optimization.
Processes and threads.
Database Transaction Abstraction I
Introduction to OpenMP
Computer Architecture Principles Dr. Mike Frank
Conception of parallel algorithms
SOFTWARE DESIGN AND ARCHITECTURE
Topics Introduction to Repetition Structures
GC211Data Structure Lecture2 Sara Alhajjam.
Threads Cannot Be Implemented As a Library
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)
Computer Engg, IIT(BHU)
Task Scheduling for Multicore CPUs and NUMA Systems
Computer Science Department
Optimization Code Optimization ©SoftMoore Consulting.
Algorithm Analysis CSE 2011 Winter September 2018.
Topics Introduction to Repetition Structures
Algorithms Furqan Majeed.
Optimizing Transformations Hal Perkins Autumn 2011
Objective of This Course
Use of Mathematics using Technology (Maltlab)
Register Pressure Guided Unroll-and-Jam
CS4961 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 27, /25/2009 CS4961.
CIS 488/588 Bruce R. Maxim UM-Dearborn
Optimizing Transformations Hal Perkins Winter 2008
Maths Unit 1 - Algebra Order of operations - BIDMAS Algebraic Notation
Lecture 2 The Art of Concurrency
Topics Introduction to Repetition Structures
Lecture 19: Code Optimisation
Parallel Programming in C with MPI and OpenMP
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
6- General Purpose GPU Programming
Optimizing single thread performance
Design and Analysis of Algorithms
Mattan Erez The University of Texas at Austin
Running & Testing Programs :: Translators
Presentation transcript:

Presented by: Huston Bokinsky Ying Zhang 25 April, 2013 "Parallelizing and Vectorizing Compilers" Article by: Rudolf Eigenmann Jay Hoeflinger Presented by: Huston Bokinsky Ying Zhang 25 April, 2013

Questions article wants to answer: How does one write or rewrite code to execute serial algorithms on parallel devices? What algorithms will allow a compiler to automate design decisions and translation of serial code to parallel code? Classic compilers are written for serial architectures. Vector and parallel architectures offer clear advantages for time and efficiency, but how to take advantage of them?

Parallel Programming Models Vector Machines Loop Parallelism: Independent iterations = loop iterations that access separate data Also called fork/join parallelism because of way work of successive iterations forked off to parallel processors then joined back Static scheduling versus dynamic scheduling Parallel threads Single Program, Multiple Data (SPMD)

Program Analysis Dependence Analysis: Interprocedural Analysis Four types of data dependence Iteration spaces Dependence Direction, Distance Vectors Dependence Tests: Exact vs. Inexact, Runtime Tests Interprocedural Analysis Symbolic Analysis Abstract Interpretation (?) Data Flow Analysis

Dependence Analysis Cont'd Data dependence can be one of: input dependence = READ b/f READ anti-dependence = READ b/f WRITE flow dependence = WRITE b/f READ output dependence = WRITE b/f WRITE A true (exact) dependence test is NP-Complete; working dependence tests use simplifying conditions and special assumptions. These can give reliable determinations in many cases, but are nonetheless inexact by themselves.

Dependence Elimination: Data Privatization and Expansion Data Privatization and Expansion prevent anti- and output dependences by keeping one iteration from updating a data value before another iteration uses the old value. Privatization = assign new value to separate storage location Expansion = place occurrence of updated value into private storage of each parallelized processor

Dependence Elimination: Reductions, Inductions, Recurrences Code transformations to eliminate areas where we need to wait for one iteration to produce a value needed for a successive iteration. Generally, dependence elimination is realized by expressing a computation in a different way (without accounting for a previous value)...

Dependence Elimination: Inductions Induction Variable Substitution: find instances where variables are modified by each iteration of loop in predictable way (eg. geometric progression, successive incrementation, ...) Replace these variables with expressions involving loop indices.

Dependence Elimination: Reduction Variation on the theme of array reduction. Typical use = array summed to scalar variable. Elements of array manipulated by parallel processors, then summed by serial loop. Note that there is still a serial summing, but time is saved by distributing manipulation of addends across parallel structure.

Dependence Elimination: Recurrence Sometimes serial computations depend on several values preceding the current iteration. If a recurrence relation (from Discrete Math) can be found, the process may be parallelized by substituting the recurrence relation into the iteration.

Loop Transformations: Vector Arch. Loop Distribution: when iterations contain dependencies between multiple tasks, iterations can be spread out over a vector and the tasks within each iteration broken up.

Loop Transformations: Multiprocessors Loop Fusion Part of Increasing Granularity Series... Loop fusion combines two or more loops into one.

Loop Transformations: Multiprocessors Loop Coalescing Part of Increasing Granularity Series... Loop coalescing merges two nested loops into one. Note mutation being performed on i and j in each parallel iteration.

Loop Transformations: Multiprocessors Loop Interchange Part of Increasing Granularity Series... By moving an inner parallel loop to an outer loop, fork/join overhead is only incurred once during operation...

Where does it fit in our course content? (Data dependence analysis and loop structure analysis happen late in the compilation process.) Runtime organization: how to data structures and control variables to allow parallel processes to eliminate data dependences? Code generation: how to write (or rewrite) code from serial architecture to parallel architecture?