Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multi-core Processors Muthu Baskaran 1 Naga Vydyanathan 1 Uday Bondhugula.

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.
Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
OpenFOAM on a GPU-based Heterogeneous Cluster
Trace-Based Automatic Parallelization in the Jikes RVM Borys Bradel University of Toronto.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Introduction CS 524 – High-Performance Computing.
Distributed Computations
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,
Parallelizing Compilers Presented by Yiwei Zhang.
Lecture 1CS 380C 1 380C Last Time –Course organization –Read Backus et al. Announcements –Hadi lab Q&A Wed 1-2 in Painter 5.38N –UT Texas Learning Center:
Task Based Execution of GPU Applications with Dynamic Data Dependencies Mehmet E Belviranli Chih H Chou Laxmi N Bhuyan Rajiv Gupta.
All-Pairs-Shortest-Paths for Large Graphs on the GPU Gary J Katz 1,2, Joe Kider 1 1 University of Pennsylvania 2 Lockheed Martin IS&GS.
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.
Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
SAGE: Self-Tuning Approximation for Graphics Engines
Authors: Weiwei Chen, Ewa Deelman 9th International Conference on Parallel Processing and Applied Mathmatics 1.
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.
Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.
Optimal Parallelogram Selection for Hierarchical Tiling Authors: Xing Zhou, Maria J. Garzaran, David Padua University of Illinois Presenter: Wei Zuo.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
L21: “Irregular” Graph Algorithms November 11, 2010.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Ohio State Univ Effective Automatic Parallelization of Stencil Computations * Sriram Krishnamoorthy 1 Muthu Baskaran 1, Uday Bondhugula 1, Atanas Rountev.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 J Ramanujam 2 Atanas Rountev 1 P Sadayappan 1 1 Department of Computer Science & Engineering.
09/07/2012CS4230 CS4230 Parallel Programming Lecture 7: Loop Scheduling cont., and Data Dependences Mary Hall September 7,
CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Static Process Scheduling
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
CS412/413 Introduction to Compilers Radu Rugina Lecture 18: Control Flow Graphs 29 Feb 02.
1 Control Flow Graphs. 2 Optimizations Code transformations to improve program –Mainly: improve execution time –Also: reduce program size Can be done.
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
1 Hardware-Software Co-Synthesis of Low Power Real-Time Distributed Embedded Systems with Dynamically Reconfigurable FPGAs Li Shang and Niraj K.Jha Proceedings.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
1 of 14 Lab 2: Design-Space Exploration with MPARM.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
June 13-15, 2010SPAA Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan.
Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Conception of parallel algorithms
Task Scheduling for Multicore CPUs and NUMA Systems
Store Recycling Function Experimental Results
Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors Qingda Lu1, Christophe Alias2, Uday Bondhugula1, Thomas Henretty1,
OPERATING SYSTEMS MEMORY MANAGEMENT BY DR.V.R.ELANGOVAN.
Accelerating Regular Path Queries using FPGA
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multi-core Processors Muthu Baskaran 1 Naga Vydyanathan 1 Uday Bondhugula 1 J. Ramanujam 2 Atanas Rountev 1 P. Sadayappan 1 1 The Ohio State University 2 Louisiana State University

 Ubiquity of multi-core processors  Need to utilize parallel computing power efficiently  Automatic parallelization of regular scientific programs on multi-core systems ◦ Polyhedral Compiler Frameworks  Support from compile-time and run-time systems required for parallel application development Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors, PPoPP 2009 Introduction

for (i=1; i<=7; i++) for (j=2; j<=6; j++) S1: a[i][j] = a[j][i] + a[i][j-1]; ijij x S1 = ₣ 3a (x S1 ) = ijij j i≥1 i≤7 i j≥2j≤6 Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors, PPoPP D S1 (x S1 ) = ij1ij1 ≥ Polyhedral Model

for (i=1; i<=7; i++) for (j=2; j<=6; j++) S1: a[i][j] = a[j][i] + a[i][j-1]; Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors, PPoPP 2009 j i flow Dependence Polytope An instance of statement t (i t ) depends on an instance of statement s (i s )  i s is a valid point in D s  i t is a valid point in D t  i s executed before i t  Access same memory location  h-transform : relates target instance to source instance involved in last conflicting access. isitn1isitn1 0 D t -I H D s 0 ≥ 0 = 0 Dependence Abstraction

Affine Transformations  Loop transformations defined using affine mapping functions ◦ Original iteration space => Transformed iteration space ◦ A one-dimensional affine transform (Φ) for a statement S is given by ◦ Φ represents a new loop in the transformed space ◦ Set of linearly independent affine transforms  Define the transformed space  Define tiling hyperplanes for tiled code generation Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors, PPoPP 2009 Φ S (x S ) =. xSn1xSn1 CSCS

Tiling  Tiled iteration space ◦ Higher-dimensional polytope ◦ Supernode iterators ◦ Intra-tile iterators Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors, PPoPP 2009 for (i=0; i<N; i++) x[i]=0; for (j=0; j<N; j++) S: x[i] += a[j][i] * y[j]; for ( it =0; it<=floor(N-1,32);it++) for ( jt =0; jt<=floord(N-1,32);jt++) … for ( i=max(32it,0); i<=min(32it+31,N-1); i++) for ( j=max(32jt,1); j<=min(32jt+31,N-1);j++) S: x[i] += a[j][i] * y[j];. ijN1ijN1 Ds Ds ≥ 0. it jt i j N 1 DT s ≥ 0. iT s i s n 1 0 D s T s 0 ≥ 0

PLUTO  State-of-the-art polyhedral model based automatic parallelization system  First approach to explicitly model tiling in a polyhedral transformation framework  Finds a set of “good” affine transforms or tiling hyperplanes to address two key issues ◦ effective extraction of coarse-grained parallelism ◦ data locality optimization  Handles sequences of imperfectly nested loops  Uses state-of-the-art code generator CLooG ◦ Takes original statement domains and affine transforms to generate transformed code Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors, PPoPP 2009

Affine Compiler Frameworks  Pros ◦ Powerful algebraic framework for abstracting dependences and transformations ◦ Enables the feasibility of automatic parallelization  Eases the burden of programmers  Cons ◦ Generated parallel code may have excessive barrier synchronization due to affine schedules  Loss of efficiency on multi-core systems due to load imbalance and poor scalability! Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors, PPoPP 2009

Aim and Approach  Can we develop an automatic parallelization approach for asynchronous, load-balanced parallel execution?  Utilize the powerful polyhedral model ◦ To generate tiling hyperplanes (generate tiled code) ◦ To derive inter-tile dependences  Effectively schedule the tiles for parallel execution on the processor cores of a multi-core system ◦ Dynamic (run-time) scheduling Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors, PPoPP 2009

Aim and Approach  Each tile identified by affine framework is a “task” that is scheduled for execution  Compile-time generation of following code segments ◦ Code executed within a tile or task ◦ Code to extract inter-tile (inter-task) dependences in the form of task dependence graph (TDG) ◦ Code to dynamically schedule the tasks using critical path analysis on TDG to prioritize tasks  Run-time execution of the compile-time generated code for efficient asynchronous parallelism Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors, PPoPP 2009

Pluto Optimization System Task Dependence Graph Generator Task Scheduler Input codeTiled code TDG gen code Meta-info (Dependence, Transformation) Parallel Task code Inter-tile Dependence Extractor TDG Code Generator Tiled code Inter-tile (Task ) dependences TDG gen code Task Dependence Graph Generator Meta-info (Dependence, Transformation) Parallel Task code Output code (TDG, Task, Scheduling code) System Overview

Task Dependence Graph Generation  Task Dependence Graph ◦ DAG  Vertices – Tiles or tasks  Edges – Dependence between the corresponding tasks ◦ Vertices and edges may be assigned weights  Vertex weight – based on task execution  Edge weight – based on communication between tasks  Current implementation  Unit weights for vertices and zero weights for edges Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors, PPoPP 2009

Task Dependence Graph Generation  Compile-time generation of TDG generation code ◦ Code to enumerate the vertices  Scan the iteration space polytopes of all statements in the tiled domain, projected to contain only the supernode iterators  CLooG loop generator is used for scanning the polytopes ◦ Code to create the edges  Requires extraction of inter-tile dependences Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors, PPoPP 2009

Inter-tile Dependence Abstraction

 Dependences expressed in tiled domain using a higher- dimensional dependence polytope Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors, PPoPP isitn1isitn1 0 D t -I H D s 0 ≥ 0 = 0. iT s i s iT t i t n 1 0 -I 0 H T s ≥ 0 = 0 0 D s T t D t. iT’ s iT’ t n 1 0 D’ t D’ s 0 ≥ 0  Project out intra-tile iterators from the system  Resulting system characterizes inter-tile dependences  Scan the system using CLooG to generate loop structure that has  Source tile iterators as outer loops  Target tile iterators as inner loops  Loop structure gives code to generate edges in TDG

Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors, PPoPP 2009 t(0,0) t(0,1) t(1,1) t(0,2) t(1,2)t(0,3) t(2,2)t(1,3) t(0,4) t(2,3)t(1,4) t(3,3)t(2,4) t(3,4) t(4,4) C1C2 Affine Schedule time Static Affine Scheduling

 Scheduling strategy: critical path analysis for prioritizing tasks in TDG  Priority metrics associated with vertices ◦ topL(v) - length of the longest path from the source vertex (i.e., the vertex with no predecessors) in G to v, excluding the vertex weight of v ◦ bottomL(v) - length of the longest path from v to the sink (vertex with no children), including the vertex weight of v  Tasks are prioritized based on ◦ sum of their top and bottom levels or ◦ just the bottom level Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors, PPoPP 2009 Dynamic Scheduling

Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors, PPoPP 2009 t(0,0) t(0,1)t(0,2)t(0,3) t(1,1) t(1,2) t(2,2) t(0,4) t(1,3)t(1,4) t(2,4)t(2,3) t(3,3) t(4,4) t(3,4)  Tasks are scheduled for execution based on ◦ completion of predecessor tasks ◦ bottomL(v) priority ◦ availability of processor core Dynamic Scheduling

Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors, PPoPP 2009 t(0,0) t(0,1) t(1,1) t(0,2) t(1,2)t(0,3) t(2,2)t(1,3) t(0,4) t(2,3)t(1,4) t(3,3)t(2,4) t(3,4) t(4,4) C1C2 Affine Schedule time t(0,0) t(0,1)t(0,2)t(0,3) t(1,1) t(1,2) t(2,2) t(0,4) t(1,3)t(1,4) t(2,4)t(2,3) t(3,3) t(4,4) t(3,4) t(0,0) t(0,1) t(1,1) t(0,2) t(1,2) t(0,3) t(2,2)t(1,3) t(0,4) t(2,3)t(1,4) t(3,3)t(2,4) t(3,4) t(4,4) C1C2 Dynamic Schedule time Affine vs. Dynamic Scheduling

1: Execute TDG generation code to create a DAG G 2: Calculate topL(v) and bottomL(v) for each vertex v in G, to prioritize the vertices 3: Create a Priority Queue PQ 4: PQ.insert ( vertices with no parents in G) 5: while not all vertices in G are processed do 6: taskid = PQ.extract( ) 7: Execute taskid // Compute code 8: Remove all outgoing edges of taskid from G 9: PQ.insert ( vertices with no parents in G) 10: end while Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors, PPoPP 2009 Run-time Execution

Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors, PPoPP 2009 Experiments  Experimental Setup ◦ a quad-core Intel Core 2 Quad Q6600 CPU  clocked at 2.4 GHz (1066 MHz FSB)  8MB L2 cache (4MB shared per core pair)  Linux kernel version (x86-64) ◦ a dual quad core Intel Xeon(R) E5345 CPU  clocked at 2.33 GHz  each chip having a 8MB L2 cache (4MB shared per core pair)  Linux kernel version ◦ ICC 10.x compiler  Options: -fast –funroll-loops (-openmp for parallelized code)

Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors, PPoPP 2009 Experiments

Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors, PPoPP 2009 Experiments

Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors, PPoPP 2009 Experiments

Discussion  Absolute achieved GFLOPs performance is currently lower than the machine peak by over a factor of 2  Single-node performance sub-optimal ◦ No pre-optimized tuned kernels  E.g. BLAS kernels like DGEMM in LU code  Work in progress to provide ◦ Identification of tiles where pre-optimized kernels can be substituted

Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors, PPoPP 2009 Related Work  Dongarra et al. - PLASMA (Parallel Linear Algebra for Scalable Multi-core Architectures) ◦ LAPACK codes optimization  Manual rewriting of LAPACK routines  Run-time scheduling framework  Robert van de Geijn et al. - FLAME  Dynamic run-time parallelization [LRPD, Mitosis, etc.] ◦ Basic difference: Dynamic scheduling of loop computations amenable to compile-time characterization of dependences  Plethora of work on DAG scheduling

Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors, PPoPP 2009 Summary  Developed a fully-automatic approach for asynchronous load balanced parallel execution on multi-core systems  Basic idea ◦ To automatically generate tiled code along with additional helper code at compile time ◦ Role of helper code at run time  to dynamically extract inter-tile data dependences  to dynamically schedule the tiles on the processor cores  Achieved significant improvement over programs automatically parallelized using affine compiler frameworks

Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors, PPoPP 2009 Thank You