Arun Kejariwal Paolo D’Alberto Alexandru Nicolau Paolo D’Alberto Alexandru Nicolau Constantine D. Polychronopoulos A Geometric Approach for Partitioning.

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Copyright, 1996 © Dale Carnegie & Associates, Inc. Dependence Testing Allen and Kennedy, Chapter 3 thru Section.
Advertisements

1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
COSC513 Operating System Research Paper Fundamental Properties of Programming for Parallelism Student: Feng Chen (134192)
CS 201 Compiler Construction
Using the Iteration Space Visualizer in Loop Parallelization Yijun YU
1 Optimizing compilers Managing Cache Bercovici Sivan.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Self-Adapting Scheduling for Tasks with Dependencies in Stochastic Environments Ioannis Riakiotakis, Florina M. Ciorba, Theodore Andronikos and George.
Adaptive Cyclic Scheduling of Nested Loops Florina M. Ciorba, Theodore Andronikos and George Papakonstantinou National Technical University of Athens Computing.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
Reducing the Communication Cost via Chain Pattern Scheduling Florina M. Ciorba, Theodore Andronikos, Ioannis Drositis, George Papakonstantinou and Panayotis.
ISPDC 2007, Hagenberg, Austria, 5-8 July On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of.
Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.
Idiom Recognition in the Polaris Parallelizing Compiler Bill Pottenger and Rudolf Eigenmann Presented by Vincent Yau.
Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse Grain and Fine Grain Optimizations.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Multiscalar processors
A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.
Dynamic Multi Phase Scheduling for Heterogeneous Clusters Florina M. Ciorba †, Theodore Andronikos †, Ioannis Riakiotakis †, Anthony T. Chronopoulos ‡
Parasol LaboratoryTexas A&M University IPDPS The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops Francis Dang, Hao Yu, and Lawrence.
A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems Karam S Chatha and Ranga Vemuri Department of ECECS University of Cincinnati.
DAC 2001: Paper 18.2 Center for Embedded Computer Systems, UC Irvine Center for Embedded Computer Systems University of California, Irvine
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.
Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.
Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.
1 CS4402 – Parallel Computing Lecture 7 Parallel Graphics – More Fractals Scheduling.
Chapter 1 Algorithm Analysis
1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Adaptive MPI Milind A. Bhandarkar
ParaScale : Exploiting Parametric Timing Analysis for Real-Time Schedulers and Dynamic Voltage Scaling Sibin Mohan 1 Frank Mueller 1,William Hawkins 2,
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Thread-Level Speculation Karan Singh CS
Experiences with Enumeration of Integer Projections of Parametric Polytopes Sven Verdoolaege, Kristof Beyls, Maurice Bruynooghe, Francky Catthoor Compiler.
Domain Decomposed Parallel Heat Distribution Problem in Two Dimensions Yana Kortsarts Jeff Rufinus Widener University Computer Science Department.
Ohio State Univ Effective Automatic Parallelization of Stencil Computations * Sriram Krishnamoorthy 1 Muthu Baskaran 1, Uday Bondhugula 1, Atanas Rountev.
Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,
Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.
High-Level Transformations for Embedded Computing
Séminaire COSI-Roscoff’011 Séminaire COSI ’01 Power Driven Processor Array Partitionning for FPGA SoC S.Derrien, S. Rajopadhye.
SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.
1 Partitioning Loops with Variable Dependence Distances Yijun Yu and Erik D’Hollander Department of Electronics and Information Systems University of Ghent,
Program Analysis & Transformations Loop Parallelization and Vectorization Toheed Aslam.
Data Structures and Algorithms in Parallel Computing Lecture 7.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
Physically Aware HW/SW Partitioning for Reconfigurable Architectures with Partial Dynamic Reconfiguration Sudarshan Banarjee, Elaheh Bozorgzadeh, Nikil.
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
DEPENDENCE-DRIVEN LOOP MANIPULATION Based on notes by David Padua University of Illinois at Urbana-Champaign 1.
Fakultät für informatik informatik 12 technische universität dortmund Prepass Optimizations - Session 11 - Heiko Falk TU Dortmund Informatik 12 Germany.
Dependence Analysis Important and difficult
Presented by: Huston Bokinsky Ying Zhang 25 April, 2013
Gary M. Zoppetti Gagan Agrawal
Cache-Aware Partitioning of Multi-Dimensional Iteration Spaces
Arun Kejariwal‡,¥, Xinmin Tian‡
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Introduction to Optimization
Presentation transcript:

Arun Kejariwal Paolo D’Alberto Alexandru Nicolau Paolo D’Alberto Alexandru Nicolau Constantine D. Polychronopoulos A Geometric Approach for Partitioning N-Dimensional Non-Rectangular Iteration Spaces 1 Center for Embedded Computer Systems University of California at Irvine Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign

2 Outline  Introduction  Terminology  Motivation  Problem statement  Uniform Partitioning  Processor Allocation  Our Approach  Experimental Results  Conclusion

3 Introduction  Scientific and numerical Applications  Computation intensive  Large amounts of parallelism  Multiprocessor systems  Exploit parallelism  Expose high-level loop parallelism  Loop spreading  Minimize communication overhead  Minimize the number of processors

4 Terminology Index point do i = 1, N do j = 1, N H(i, j) enddo 1,1 i j Iteration Space ( Γ ) ( Γ ) (5,5) (2,5) * Notation used in “Loop Transformations for Restructuring Compilers” [Banerjee’93] *

5 Motivating Example do i 1 = 1, N do i 2 = 1, i 1 do i 3 = 1, N H(i 1, i 2, i 3 ) end do i1i1 i2i2 i3i3 N = 6 Top View (i 1 – i 2 plane) : Triangular geometry Front View (i 1 – i 3 plane) : Rectangular geometry

6 Motivating Example 1,1 i1i1 i2i2 Top View Assume P = 3 S2S2 S1S1 S3S3 S1S1 S2S2 S3S3 Contiguous partitioning Non-contiguous partitioning Load imbalance Perfect load balance Multiple loops per set Loss of locality

7 Motivating Example Assume P = 3 1,1 i1i1 i3i3 Front View Loop permutation-based contiguous partitioning Perfect load balance Remapping of index expressions Finding a permutation for uniform partitioning is non-trivial

8 Motivating Example 1,1 i1i1 i2i2 Top View P = 4 S2S2 S1S1 S3S3 S4S4 1,1 i1i1 i2i2 P = 5 S3S3 S1S1 S4S4 S5S5 S2S2 Processor Allocation during Iteration Space Partitioning

9 Previous Work  Cyclic Partitioning  False sharing  Balanced Chunk Scheduling [Haghighat92]  Restricted to double loops  Canonical loop partitioning [Sakellariou96]  Non-contiguous partitioning  Communication minimization [Dion96, Koziris97] Do not address Processor Allocation

10 Our Model do i 1 = 1, N, s 1 do i 2 = f 1 (i 1 ), g 1 (i 1 ), s 2 · do i n = f n-1 (i 1, i 2, …, i n-1 ), g n-1 (i 1, i 2, …, i n-1 ), s n LOOP BODY enddo · enddo A Perfectly Nested DOALL Loop Non-Rectangular Iteration Spaces f r (i 1, i 2, …, i r-1 ) = a r0 + a r1 i 1 + … + a r(r-1) i r-1 g r (i 1, i 2, …, i r-1 ) = a r0 + a r1 i 1 + … + a r(r-1) i r-1 f r ≤ g r

11 Problem Statement 1,1 i N,1 Input : N-dimensional Iteration Space ( Γ ) P processors P processors j P1P1 P2P2 P Output : P partitions with “uniform” load Outermost Loop

12 Problem Statement I Uniform Partitioning Given : A partition with minimum execution time Objective : Minimize the number of processors for the given partition while maintaining the given partition while maintaining the performance the performance II Processor Allocation Given: An iteration space Γ and P processors Objective: Find a contiguous partition with uniform load across different processors load across different processors

13 Our Approach Basic Idea  Model the iteration space as a convex polytope  Partition the polytope into sets of equal volumes  Equal volumes Ξ Uniform distribution of index points  Each set of the partition is mapped to a different processor.

14 Our Approach 1: Compute the total volume V of Γ do i = 1, N do j = 1, i do k = 1, j LOOP BODY enddo 1,1,1 i j k N = 7 7,7,7

15 1,1,1 i j k Our Approach 2: Compute a partial volume V(x) of Γ 7,7,7 x 1,1,1 i j k P = 3 Each set has equal volume 7,7,7 3: Determine the breakpoints, for 1≤k≤ P-1 γkγkγkγk γ1γ1 γ2γ2

16 Our Approach 4: Eliminate void sets P = 5 1,1 i1i1 i2i2 S3S3 S1S1 S4S4 S2S2 Eliminate  Minimizes the number of processors  Size of the largest set remains constant

17 Our Approach 5: Determine loop bounds 1,1 i1i1 i2i2 S3S3 S1S1 S4S4 S2S2 γ1γ1γ1γ1 γ2γ2γ2γ2 γ3γ3γ3γ3 Given the breakpoints, compute lb i, ub i γkγkγkγk (lb 1, ub 1 ) = (1, 3) (lb 2, ub 2 ) = (4, 4) (lb 3, ub 3 ) = (5, 5) (lb 4, ub 4 ) = (6, 6) 6,1

18  Applications – Numerical packages (LINPACK etc.) and literature and literature  Platform – 4-way shared-memory multiprocessor  Problem size – N =1000 Results VOL : Our volume-based approach CAN : Canonical loop partitioning Setup

19 Results (contd.) Performance comparison Highlights : a) Yields better performance b) A generic approach b) A generic approach  Number of index points in the largest set # of Processors L1L1L1L1 L2L2L2L2 L3L3L3L3 L4L4L4L4 VOLCAN VOLCAN VOLCAN200000NA100000NA50000NA25000NA VOLCAN25000NA12500NA6250NA3150NA Loop Nest

20 Conclusions Geometric approach for Iteration Space Partitioning Geometric approach for Iteration Space Partitioning  Load balancing  Processor Allocation More general than existing techniques More general than existing techniques Future Work Run-time Partitioning Run-time Partitioning

21 Results (contd.) Performance comparison Highlights : a) Yields better performance b) A generic approach b) A generic approach  Number of index points in the largest set