AlphaZ: A System for Design Space Exploration in the Polyhedral Model

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

Tintu David Joy. Agenda Motivation Better Verification Through Symmetry-basic idea Structural Symmetry and Multiprocessor Systems Mur ϕ verification system.

Beyond Shared Memory Loop Parallelism in the Polyhedral Model Tomofumi Yuki Ph.D Dissertation 10/

Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX.

Test Yaodong Bi.

INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.

Far Fetched Prefetching?

1 Optimizing compilers Managing Cache Bercovici Sivan.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.

Automatic Memory Management Noam Rinetzky Schreiber 123A /seminar/seminar1415a.html.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Derivation of Efficient FSM from Polyhedral Loop Nests Tomofumi Yuki, Antoine Morvan, Steven Derrien INRIA/Université de Rennes 1 1.

Paraμ A Partial and Higher-Order Mutation Tool with Concurrency Operators Pratyusha Madiraju AdVanced Empirical Software Testing and Analysis (AVESTA)

Compiler Challenges for High Performance Architectures

Carnegie Mellon Lessons From Building Spiral The C Of My Dreams Franz Franchetti Carnegie Mellon University Lessons From Building Spiral The C Of My Dreams.

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC pag. 1 Discovery of Locality-Improving Refactorings.

Vertically Integrated Analysis and Transformation for Embedded Software John Regehr University of Utah.

Optimizing General Compiler Optimization M. Haneda, P.M.W. Knijnenburg, and H.A.G. Wijshoff.

CSE 830: Design and Theory of Algorithms

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

Parallelizing Compilers Presented by Yiwei Zhang.

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

Lecture 1CS 380C 1 380C Last Time –Course organization –Read Backus et al. Announcements –Hadi lab Q&A Wed 1-2 in Painter 5.38N –UT Texas Learning Center:

+ Doing More with Less : Student Modeling and Performance Prediction with Reduced Content Models Yun Huang, University of Pittsburgh Yanbo Xu, Carnegie.

Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization Shixiong Xu, David Gregg University of Dublin, Trinity College

Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M.

Testing. Definition From the dictionary- the means by which the presence, quality, or genuineness of anything is determined; a means of trial. For software.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

EGEE is a project funded by the European Union under contract IST Testing processes Leanne Guy Testing activity manager JRA1 All hands meeting,

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

MMAlpha MMAlpha: A tool box for silicon compilation Patrice Quinton, Sanjay Rajopadhye, Tanguy Risset IRISA - Projet COSI

Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-

Memory Allocations for Tiled Uniform Dependence Programs Tomofumi Yuki and Sanjay Rajopadhye.

Introduction to Software Testing. Types of Software Testing Unit Testing Strategies – Equivalence Class Testing – Boundary Value Testing – Output Testing.

DEBUGGING. BUG A software bug is an error, flaw, failure, or fault in a computer program or system that causes it to produce an incorrect or unexpected.

ALG0183 Algorithms & Data Structures Lecture 4 Experimental Algorithmics 8/25/20091 ALG0183 Algorithms & Data Structures by Dr Andy Brooks Case study article:

Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.

CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.

CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.

Generating Software Documentation in Use Case Maps from Filtered Execution Traces Edna Braun, Daniel Amyot, Timothy Lethbridge University of Ottawa, Canada.

Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.

Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

CR18: Advanced Compilers L07: Beyond Loop Transformations Tomofumi Yuki.

Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.

CISC Machine Learning for Solving Systems Problems Presented by: Eunjung Park Dept of Computer & Information Sciences University of Delaware Solutions.

Machine Learning in Compiler Optimization By Namita Dave.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

Slide 1 NEMOVAR-LEFE Workshop 22/ Slide 1 Current status of NEMOVAR Kristian Mogensen.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Parallel Computing Presented by Justin Reschke

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 1/28 A. Saà-Garriga, D. Castells-Rufas and J. Carrabina {Albert.saa, David.castells,

The PLA Model: On the Combination of Product-Line Analyses 강태준.

Introduction to Algorithms: Brute-Force Algorithms.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Compiler Support for Better Memory Utilization in Scientific Code Rob Fowler, John Mellor-Crummey, Guohua Jin, Apan Qasem {rjf, johnmc, jin,

Advanced Computer Systems

Query Optimization for Object-Relational Database Systems

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Parallel Programming in C with MPI and OpenMP

Optimizing single thread performance

Presentation transcript:

AlphaZ: A System for Design Space Exploration in the Polyhedral Model Tomofumi Yuki, Gautam Gupta, DaeGon Kim, Tanveer Pathan, and Sanjay Rajopadhye

Polyhedral Compilation The Polyhedral Model Now a well established approach for automatic parallelization Based on mathematical formalism Works well for regular/dense computation Many tools and compilers: PIPS, PLuTo, MMAlpha, RStream, GRAPHITE(gcc), Polly (LLVM), ...

Design Space (still a subset) Space Time + Tiling: schedule + parallel loops Primary focus of existing tools Memory Allocation Most tools for general purpose processors do not modify the original allocation Complex interaction with space time Higher-level Optimizations Reduction detection Simplifying Reduction (complexity reduction) Note that this is still a subset You can quote Thies for difficulty of doing schedule + memory allocation all at once

AlphaZ Tool for Exploration Can be used as a push-button system Provides a collection of analyses, transformations, and code generators Unique Features Memory Allocation Reductions Can be used as a push-button system e.g., Parallelization à la PLuTo is possible Not our current focus I know there is much more to say about AlphaZ, but time is limited…

This Paper: Case Studies adi.c from PolyBench Re-considering memory allocation allows the program to be fully tiled Outperforms PLuTo that only tiles inner loops UNAfold (RNA folding application) Complexity reduction from O(n4) to O(n3) Application of the transformations is fully automatic

This Talk: Focus on Memory Tiling requires more memory e.g., Smith-Waterman dependence Sequential Tiled

ADI-like Computation Updates 2D grid with outer time loop PLuTo only tiles inner two dimensions Due to a memory based dependence With an extra scalar, becomes tilable in all three dimensions PolyBench implementation has a bug It does not correctly implement ADI ADI is not tilable in all dimensions The amount of memory required is not scalar, due to other statements. Mention the bug, and we exploit it, and we found it out after submitting the abstract, and we still use it because the story doesn’t change

adi.c: Original Allocation for (t=0; t < tsteps; t++) { for (i1 = 0; i1 < n; i1++) for (i2 = 0; i2 < n; i2++) X[i1][i2] = foo(X[i1][i2], X[i1][i2-1], …) … for (i2 = n-1; i2 >= 1; i2--) X[i1][i2] = bar(X[i1][i2], X[i1][i2-1], …) } for (t=0; t < tsteps; t++) { for (i1 = 0; i1 < n; i1++) for (i2 = 0; i2 < n; i2++) X[i1][i2] = foo(X[i1][i2], X[i1][i2-1], …) … for (i2 = n-1; i2 >= 1; i2--) X[i1][i2] = bar(X[i1][i2], X[i1][i2-1], …) } When presenting ADI, the tilability problem should be re-casted as being able to (re-)reverse the second i2 loop Not tilable because of the reverse loop Memory based dependence: (i1,i2 -> i1,i2+1) Require all dependences to be non-negative

adi.c: Original Allocation for (i2 = 0; i2 < n; i2++) S1: X[i1][i2] = foo(X[i1][i2], X[i1][i2-1], …) … for (i2 = n-1; i2 >= 1; i2--) S2: X[i1][i2] = bar(X[i1][i2], X[i1][i2-1], …) When presenting ADI, the tilability problem should be re-casted as being able to (re-)reverse the second i2 loop S1 X[i1] S2 X[i1]

adi.c: With Extra Memory Once the two loops are fused: Value of X only needs to be preserved for one iteration of i2 We don’t need a full array X’, just a scalar for (i2 = 0; i2 < n; i2++) S1: X[i1][i2] = foo(X[i1][i2], X[i1][i2-1], …) … for (i2 = 1; i2 < n; i2++) S2: X’[i1][i2] = bar(X[i1][i2], X[i1][i2-1], …) For simplicity of presentation, introducing big 2D array X[i1] X’[i1]

adi.c: Performance adi.c through pluto –tile –parallel (--noprevector for Cray) adi.ab extracted from adi.c with tiling + simple memory allocation. (The experiment actually uses 2N^2 more memory) gcc –O3 and craycc –O3 PLuTo does not scale because the outer loop is not tiled

UNAfold UNAfold [Markham and Zuker 2008] AlphaZ RNA secondary structure prediction algorithm O(n3) algorithm was known [Lyngso et al. 1999] too complicated to implement “good enough” workaround exists AlphaZ Systematically transform O(n4) to O(n3) Most of the process can be automated

UNAFold: Optimization Key: Simplifying Reductions [POPL 2006] Finds “hidden scans” in reductions Rare case: compiler can reduce complexity Almost automatic: The O(n4) section must be separated many boundary cases Require function to be inlined to expose reuse Transformations to perform the above is available; no manual modification of code

UNAfold: Performance Complexity reduction is empirically confirmed original is unmodified mfold, gcc –O3 simplified replaces a function (fill_matrices_1) with AlphaZ generated code memory and schedule is something I found in 5 min, not very optimized no loop transformation was performed Complexity reduction is empirically confirmed

AlphaZ System Overview Target Mapping: Specifies schedule, memory allocation, etc. Alpha C C+OpenMP Polyhedral Representation Transformations The main point in this slide is that the user interacts with the system, and has full control over the operations. One thing not visible in the figure but is important is that the user can specify TM. It is probably useful to mention the adi example did take the CtoAlphabets path. C+CUDA Analyses Target Mapping Code Gen C+MPI

Human-in-the-Loop Automatic parallelization—“holy grail” goal Current automatic tools are restrictive A strategy that works well is “hard-coded” difficult to pass domain specific knowledge Human-in-the-Loop Provide full control to the user Help finding new “good” strategies Guide the transformation with domain specific knowledge note sure where to place this slide

Conclusions There are more strategies worth exploring Case Studies some may currently be difficult to automate Case Studies adi.c: memory UNAfold: reductions AlphaZ: Tool for trying out new ideas future work related to higher level abstractions; semantic tiling, Galois-like abstractions.

Acknowledgements AlphaZ Developers/Users Members of Mélange at CSU Members of CAIRN at IRISA, Rennes Dave Wonnacott at Haverford University and his students

Key: Simplifying Reductions Simplifying Reductions [POPL 2006] Finds “hidden scans” in reductions Rare case: compiler can reduce complexity Main idea: can be written O(n2) O(n)