Optimal Parallelogram Selection for Hierarchical Tiling Authors: Xing Zhou, Maria J. Garzaran, David Padua University of Illinois Presenter: Wei Zuo.

Slides:

Advertisements

Similar presentations

Using the Iteration Space Visualizer in Loop Parallelization Yijun YU

Advertisements

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

Programmability Issues

Object Specific Compressed Sensing by minimizing a weighted L2-norm A. Mahalanobis.

*time Optimization Heiko, Diego, Thomas, Kevin, Andreas, Jens.

Efficient Realization of Hypercube Algorithms on Optical Arrays* Hong Shen Department of Computing & Maths Manchester Metropolitan University, UK ( Joint.

Convex Position Estimation in Wireless Sensor Networks

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

One-Shot Multi-Set Non-rigid Feature-Spatial Matching

Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.

Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

Spring 2010CS 2251 Graphs Chapter 10. Spring 2010CS 2252 Chapter Objectives To become familiar with graph terminology and the different types of graphs.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

UNC Chapel Hill Lin/Manocha/Foskey Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject.

Lecture outline Support vector machines. Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data.

VLSI DSP 2008Y.T. Hwang3-1 Chapter 3 Algorithm Representation & Iteration Bound.

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.

Jinhui Tang †, Shuicheng Yan †, Richang Hong †, Guo-Jun Qi ‡, Tat-Seng Chua † † National University of Singapore ‡ University of Illinois at Urbana-Champaign.

The sequence of graph transformation (P1)-(P2)-(P4) generating an initial mesh with two finite elements GENERATION OF THE TOPOLOGY OF INITIAL MESH Graph.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Surface Simplification Using Quadric Error Metrics Michael Garland Paul S. Heckbert.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

May 2004 Department of Electrical and Computer Engineering 1 ANEW GRAPH STRUCTURE FOR HARDWARE- SOFTWARE PARTITIONING OF HETEROGENEOUS SYSTEMS A NEW GRAPH.

A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Memory Allocations for Tiled Uniform Dependence Programs Tomofumi Yuki and Sanjay Rajopadhye.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

ICML2004, Banff, Alberta, Canada Learning Larger Margin Machine Locally and Globally Kaizhu Huang Haiqin Yang, Irwin King, Michael.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.

Understanding Network Concepts in Modules Dong J, Horvath S (2007) BMC Systems Biology 2007, 1:24.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:

Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.

Paper_topic: Parallel Matrix Multiplication using Vertical Data.

Computability and Complexity 2-1 Problems and Languages Computability and Complexity Andrei Bulatov.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.

Linear Programming Chapter 1 Introduction.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

1 A Methodology for automatic retrieval of similarly shaped machinable components Mark Ascher - Dept of ECE.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Auburn University

All-pairs Shortest paths Transitive Closure

Compressive Coded Aperture Video Reconstruction

I. E. Venetis1, N. Nikoloutsakos1, E. Gallopoulos1, John Ekaterinaris2

Conception of parallel algorithms

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

Data Locality Analysis and Optimization

Lecture 5: GPU Compute Architecture

Dynamical Statistical Shape Priors for Level Set Based Tracking

Pawan Lingras and Cory Butz

Lecture 5: GPU Compute Architecture for the last time

Objective of This Course

Presented by: Yang Yu Spatiotemporal GMM for Background Subtraction with Superpixel Hierarchy Mingliang Chen, Xing Wei, Qingxiong.

Asymmetric Transitivity Preserving Graph Embedding

Multidisciplinary Optimization

Presentation transcript:

Optimal Parallelogram Selection for Hierarchical Tiling Authors: Xing Zhou, Maria J. Garzaran, David Padua University of Illinois Presenter: Wei Zuo

Motivation and Background The importance of loop tiling Optimize multiple nested loops (most time consuming) Improve data locality Expose parallelism

What is Hierarchical tiling Tile the loop hierarchically to fit the organization of the target machine

Why Hierarchical Tiling The advantage of hierarchy-aware optimization To unleash the potential of hierarchically organized system

Challenges of (hierarchical) tiling Selection of tile sizes Selection of tile shapes Shapes have significant impact of execution time Shapes at different level interacts with each other Can not select tile shape at each level separately Need a global model considering hierarchical tiling

Contribution of the paper An automatic system for selection of tile shapes in a hierarchical system A model that compute execution time of the tile shapes Show that the problem of optimal tile shape selection is a nonlinear bi-level programming problem

Math Concepts Iteration Space Tiling Representation Dependence Vectors Execution Time Vector

Iteration space representation The edge matrix E of iteration space I An example: The function span(E) describes the iteration space I

Tiling transformation representation Tiling matrix: After tiling: E’ is the new edge matrix A is the transformation matrix We have: A=T -1 is the affine transformation of tiling with shape T

Hierarchical tiling Recursively tile an iteration space I Bottom-up. T 0: Finest -> T n : Original Space I k : is the iteration space of k-th level T k : Tile shape of k-th level E k : Edge matrix of k-th level

Dependence Vectors Dependence matrix A dependence vector d = (d 0, d 1,..., d n−1 ) indicates that any iteration i must finish before iteration i + d. Assume atomic computation (not communication overlap) It is possible to topologically sort all tiles Can be no cycles in the inter-tile dependence graph Hyperplanes defining the tiles must not be crossed by two dependence vectors with different directions.

Dependence Vectors No cycle => each dependence vector d must be covered by the cone spanned by the extension of t 0,…t n-1 Tiles be large => inter-tile dependences only exist between adjacent tiles Combine together: After transformation D k, the dependence at k-th level tiling

The sequential execution time of a loop with iteration space I is Consider the parallelism, Ideal execution time is the minimal execution time of an iteration space I that can be achieved by any valid schedule of iterations (E: edge matrix, D: dependence matrix ， L(E, D) denote the length of the longest path of dependent iterations in the iteration space) Example: After simplification: Execution Model

The Tile Size Selection Model Problem Statement The Optimization formation Compute the Longest Dependent Path Automated Framework

Tile Size Selection Model Problem statement: Selecting the tile shape for hierarchical tiling. Identifying the tile shapes defining an l-level hierarchical tiling that minimizes the execution time of the computation defined by giving an n- dimensional hyperparallelepiped-shaped iteration space I and m dependence vectors; i.e. Determing the sequency of tiling matrices T 0, T 1, … T l-1. Assumptions The model considers parallelogram tile shapes At a given level, all nonboundary tiles have the same shape Tiling is an affine transformation Computation within a tile is atomic Infinite resource for parallelism

Iteration space Execution time per-tile for bottom-level tile The recursion for upper tiles: Model Formulation Each tile at level below is considered as a single iteration The per-iteration execution time is Time(T k-1 ). D k is dependence at k-th level with D 0 = D t s k be the synchronization and communication overhead of each tile

Model Formulation Optimization: Select t 1, … t l-1 to minimize total execution time Constraints (dependence) Question: How to compute “L” ?

Contribution of the paper An automatic system for selection of tile shapes in a hierarchical system A model that compute execution time of the tile shapes Show that the problem of optimal tile shape selection is a nonlinear bi-level programming problem Computing L(T k, I n ) 0<k<n-1 Computing L(E, I n )

Computation of the L Computing L(T k, I n ) By affine transformation Since: To compute L(Tk,1n), we must find the longest path P (p 0, p 1,..., p L−1 ) Therefore: L(Tk, 1n) = max{L}

Computation of the L Computation of L(E, I n ) Since dependence vectors d can point in any direction, the longest dependent path does not necessarily start from origin (0,0,...,0) of the hypercube iteration space. Approximately estimate the L using binary optimization

The Automatic Framework Multidimensional non-linear optimization problem w/o a known analytical solution NOMAD

Experiments Platforms: Bluewater super computer First level: 256 nodes Second level: Each node has an NVIDIA Tesla GPU accelerator with 2688 CUDA cores Tiling Schemes Scheme 1 & 2: The common tile shapes The hierarchical overlapped tiling method Note: tiles shapes include Square, Diamond and Skewing1 & 2

Comparing the performance

Testing the model accuracy The accuracy of the analytical model for execution estimation 15% except for 1D-Jacobi Reasons cause inaccuracy: The variation of communication time and execution time for different program The hardware resource for parallelism is not unlimited

Conclusion An automatic system for selection of tile shapes in a hierarchical system A model that compute execution time of the tile shapes Show that the problem of optimal tile shape selection is a nonlinear bi-level programming problem Review the limitations, these can be future work Affine, regular parallelism, adding the hardware resource model, considering the different metrics, e.g. power, area …