On the Interaction of Tiling and Automatic Parallelization Zhelong Pan, Brian Armstrong, Hansang Bae Rudolf Eigenmann Purdue University, ECE 2005.06.01.

Slides:

Advertisements

Similar presentations

Progress Status of Subproject 6 VMC-PPO VMC-PPO Project Investigator.

Advertisements

SPECs High-Performance Benchmark Suite, SPEC HPC Rudi Eigenmann Purdue University Indiana, USA.

HPC Benchmarking, Rudi Eigenmann, 2006 SPEC Benchmark Workshop this is important for machine procurements and for understanding where HPC technology is.

AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS Chirag Dave and Rudolf Eigenmann Purdue University.

Load Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops Nikolaos Drosinos and Nectarios Koziris National Technical University.

1 Optimizing compilers Managing Cache Bercovici Sivan.

© 2009 IBM Corporation July, 2009 | PADTAD Chicago, Illinois A Proposal of Operation History Management System for Source-to-Source Optimization.

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.

SIMD Single instruction on multiple data – This form of parallel processing has existed since the 1960s – The idea is rather than executing array operations.

Control Flow Analysis (Chapter 7) Mooly Sagiv (with Contributions by Hanne Riis Nielson)

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

Code Transformations to Improve Memory Parallelism Vijay S. Pai and Sarita Adve MICRO-32, 1999.

Compiler Challenges for High Performance Architectures

Introductory Courses in High Performance Computing at Illinois David Padua.

The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.

Trace-Based Automatic Parallelization in the Jikes RVM Borys Bradel University of Toronto.

Presented by Rengan Xu LCPC /16/2014

1 Cetus – An Extensible Compiler Infrastructure Sang Ik Lee Troy Johnson Rudolf Eigenmann ECE, Purdue University.

Compilation Techniques for Multimedia Processors Andreas Krall and Sylvain Lelait Technische Universitat Wien.

Nonlinear and Symbolic Data Dependence Testing Presented by Chen-Yong Cher William Blume, Rudolf Eigenmann.

Multiscale, Multigranular Land Cover Classification: Performance Optimization Vijay Gandhi, Abhinaya Sinha 1 st May, 2006.

Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.

Idiom Recognition in the Polaris Parallelizing Compiler Bill Pottenger and Rudolf Eigenmann Presented by Vincent Yau.

Parallelizing Compilers Presented by Yiwei Zhang.

The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran,

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization Shixiong Xu, David Gregg University of Dublin, Trinity College

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.

– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M.

Parallel Performance of Hierarchical Multipole Algorithms for Inductance Extraction Ananth Grama, Purdue University Vivek Sarin, Texas A&M University Hemant.

Introduction Hardware accelerators, such as General-Purpose Graphics Processing Units (GPGPUs), are promising parallel platforms for highperformance computing.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Antoine Monsifrot François Bodin CAPS Team Computer Aided Hand Tuning June 2001.

The Impact of Data Dependence Analysis on Compilation and Program Parallelization Original Research by Kleanthis Psarris & Konstantinos Kyriakopoulos Year.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK

Modeling GPU non-Coalesced Memory Access Michael Fruchtman.

Ohio State Univ Effective Automatic Parallelization of Stencil Computations * Sriram Krishnamoorthy 1 Muthu Baskaran 1, Uday Bondhugula 1, Atanas Rountev.

1/30/2003 BARC1 Profile-Guided I/O Partitioning Yijian Wang David Kaeli Electrical and Computer Engineering Department Northeastern University {yiwang,

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

1CPSD Software Infrastructure for Application Development Laxmikant Kale David Padua Computer Science Department.

RADE Project Generator. Outline Introduction Project Types Options Generator Files and Layout Source Control (Video 3 Min) Templates and Nifty Tools.

Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

ECE 1754 Loop Transformations by: Eric LaForest

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

MPI and OpenMP.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Junghee Lee, Chrysostomos Nicopoulos, Yongjae Lee, Hyung Gyu Lee.

Distributed Real-time Systems- Lecture 01 Cluster Computing Dr. Amitava Gupta Faculty of Informatics & Electrical Engineering University of Rostock, Germany.

Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.

Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)

DGrid: A Library of Large-Scale Distributed Spatial Data Structures Pieter Hooimeijer,

An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA.

Efficient Array Slicing on the Intel Xeon Phi Coprocessor

Antonia Zhai, Christopher B. Colohan,

Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.

Space Vectors Problem 3.19 Determine the moment about the origin of the coordinate system, given a force vector and its distance from the origin. This.

Multivector and SIMD Computers

التعلم بالإكتشاف المراجع:

Calpa: A Tool for Automating Dynamic Compilation

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Interprocedural Symbolic Range Propagation for Optimizing Compilers

An Evaluation of Auto-Scoping in OpenMP

Presentation transcript:

On the Interaction of Tiling and Automatic Parallelization Zhelong Pan, Brian Armstrong, Hansang Bae Rudolf Eigenmann Purdue University, ECE

2 Outline Motivation Tiling and Parallelism Tiling in concert with parallelization Experimental results Conclusion

3 Motivation Apply tiling in a parallelizing compiler (Polaris) Polaris generates parallelized programs in OpenMP Backend compilers generate executable Investigate performance on real benchmarks Parallelizing Compiler Source Code Backend Compiler Executable Tiled OpenMP Program

4 Issues Tiling interacts with parallelization passes Data dependence test, induction, reduction, … Load balancing is necessary Parallelism and locality are compromised

5 Outline Motivation Tiling and parallelism Tiling in concert with parallelization Experimental results Conclusion

6 Tiling (a) Matrix Multiply DO I = 1, N DO K = 1, N DO J = 1, N Z(J,I) = Z(J,I) + X(K,I) * Y(J,K) Loop strip-mining L i strip-mined into L i ’ and L i ’’ Cross-strip loops: L i ’ In-strip loops: L i ’’ Loop permutation (b) Tiled Matrix Multiply DO K2 = 1, N, B DO J2 = 1, N, B DO I = 1, N DO K1 = K2, MIN(K2+B-1,N) DO J1 = J2, MIN(J2+B-1,N) Z(J1,I) = Z(J1,I) + X(K1,I) * Y(J1,K1)

7 Possible Approaches Tiling before parallelization Possible performance degradation Tiling after parallelization Possible wrong result Our approach Tiling in concert with parallelization

8 Direction Vector after Strip-mining Lemma. Strip-mining may create more direction vectors, i.e. =  ==,  => or >* Cross-strip loop In-strip loop “<“ in-strip dependence,“=<” cross-strip dependence,“<>” cross-strip dependence,“<=” cross-strip dependence,“<<”

9 Parallelism after Tiling Theorem. After tiling, the in-strip loops have the same parallelism as the original ones, but some cross-strip loops may change to serial. “<” makes the corresponding cross-strip loop serial. DO L1 = 1, 4 DO L2 = 1, 4 A(L1,L2) = A(L1+1,L2+1) S P DO LC1 = 1, 4, 2 DO LC2 = 1, 4, 2 DO LI1 = LC1, LC1+1, 1 DO LI2 = LC2, LC2+1, 1 A(LI1,LI2) = A(LI1+1,LI2+1) S S S P L1 L2 A Tile Tiling after parallelization is unsafe

10 Outline Motivation Tiling and Parallelism Tiling in concert with parallelization Experimental results Conclusion

11 Trading off Parallelism and Locality Enhancing locality may reduce parallelism Tiling may change fork-join overhead [SP]  [SSP], increase fork-join overhead. [SP]  [PSP], decrease fork-join overhead. [PS]  [SPS], increase fork-join overhead. [SS]  [SSS], no change of fork-join overhead. [PP]  [PPP], no change of fork-join overhead. DO J=1,N DO I=1,N A(I,J) = A(I,J+1)

12 Tile Size Selection Data references in a tile should be close to the cache size. (d) Cross-strip loop is parallel. Machine has a shared cache. Ref T = CS / P (b) Cross-strip loop is parallel. Machine has distributed caches. Ref T = CS … (a) Loop is sequential. Ref T = CS (e) In-strip loop is parallel. Machine has a shared cache. Ref T = CS … (c) In-strip loop is parallel. Machine has distributed caches. Ref T = CS * P Cache Tile Ref T : Mem ref. in a tile CS : Cache size P: # of Proc.

13 Load Balancing Balance the parallel cross-strip loop (a) Before tiling (balanced) DO I = 1, 512 DO J = 1, 512 (b) After tiling (not balanced) DO J1 = 1, 512, 80 DO I = 1, 512 DO J = 1, MIN(J1+79,512) Balanced tile size S: Balanced tile size T: Tile size by LRW P: Number of processors I: Number of iterations T = 80 P = 4 I = 512  S = 64

14 Impact on parallelization passes Tiling does not change the loop body Limited effect on parallelization passes Induction variable substitution Privatization Reduction variable recognition

15 Tiling in Concert with Parallelization Find the best tiled version in favor of parallelism first and then locality Compute tile size based on parallelism and cache configuration Tune the tile size to balance load Update reduction/private variable attribute Generate two versions if iteration number I unknown: Original parallel version is used when I is small Otherwise, tiled version is used

16 Outline Motivation Tiling and Parallelism Tiling in concert with parallelization Experimental results Conclusion

17 Result on SPEC CPU 95

18 Result on SPEC CPU 2000

19 On the performance bound BenchmarkTotalReuseNestedw/o Call APPLU (97.60%) APSI (19.50%) FPPPP (5.80%) HYDRO2D (53.70%) MGRID382488(86.40%) SU2COR (14.90%) SWIM241533(60.10%) TOMCATV161455(95.90%) TURB3D (22.20%) WAVE (19.70%) Percentage of tilable loops based on reuse

20 Conclusion Tiling and parallelism Tiling in concert with parallelization Comprehensive evaluation