Software Group © 2005 IBM Corporation Compilation Technology Controlling parallelization in the IBM XL Fortran and C/C++ parallelizing compilers Priya.

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.
Advertisements

INTEL CONFIDENTIAL Threading for Performance with Intel® Threading Building Blocks Session:
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Parallel Processing with OpenMP
Introduction to Openmp & openACC
Scheduling and Performance Issues for Programming using OpenMP
A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston March, 2007 High Performance Computing and Tools Group
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
Software Group © 2005 IBM Corporation Compiler Technology October 17, 2005 Array privatization in IBM static compilers -- technical report CASCON 2005.
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
Cc Compiler Parallelization Options CSE 260 Mini-project Fall 2001 John Kerwin.
Code Generation Mooly Sagiv html:// Chapter 4.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Introduction to Code Generation Mooly Sagiv html:// Chapter 4.
Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
© 2002 IBM Corporation IBM Toronto Software Lab October 6, 2004 | CASCON2004 Interprocedural Strength Reduction Shimin Cui Roch Archambault Raul Silvera.
OpenMPI Majdi Baddourah
Introduction to Code Generation Mooly Sagiv html:// Chapter 4.
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.
Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
Parallel Programming in Java with Shared Memory Directives.
Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.
2 3 Parent Thread Fork Join Start End Child Threads Compute time Overhead.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
OpenMP Blue Waters Undergraduate Petascale Education Program May 29 – June
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
OpenMP fundamentials Nikita Panov
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
Threaded Programming Lecture 4: Work sharing directives.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
MPI and OpenMP.
Threaded Programming Lecture 2: Introduction to OpenMP.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
Software Group © 2004 IBM Corporation Compiler Technology October 6, 2004 Experiments with auto-parallelizing SPEC2000FP benchmarks Guansong Zhang CASCON.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 1/28 A. Saà-Garriga, D. Castells-Rufas and J. Carrabina {Albert.saa, David.castells,
Distributed and Parallel Processing George Wells.
OpenMP An API : For Writing Portable SMP Application Software Rider NCHC GTD.
Shared Memory Parallelism - OpenMP
Lecture 5: Shared-memory Computing with Open MP
SHARED MEMORY PROGRAMMING WITH OpenMP
CS427 Multicore Architecture and Parallel Computing
September 4, 1997 Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Parallel Processing.
Exploiting Parallelism
Multi-core CPU Computing Straightforward with OpenMP
September 4, 1997 Parallel Processing (CS 730) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Wed. Jan. 31, 2001 *Parts.
September 4, 1997 Parallel Processing (CS 730) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson *Parts of this lecture.
M4 and Parallel Programming
CUDA Fortran Programming with the IBM XL Fortran Compiler
Shared-Memory Paradigm & OpenMP
Presentation transcript:

Software Group © 2005 IBM Corporation Compilation Technology Controlling parallelization in the IBM XL Fortran and C/C++ parallelizing compilers Priya Unnikrishnan IBM Toronto Lab CASCON 2005

Software Group © 2005 IBM Corporation October 2005 Overview  Parallelization in IBM XL compilers  Outlining  Automatic parallelization  Cost analysis  Controlled parallelization  Future work

Software Group © 2005 IBM Corporation October 2005 Parallelization  IBM XL compilers support Fortran 77/90/95, C and C++  Implements both OpenMP and Auto-parallelization.  Both target SMP (shared memory parallel) machines  Non-threadsafe code generated by default –Use the _r invocation (xlf_r, xlc_r … ) to generate threadsafe code

Software Group © 2005 IBM Corporation October 2005 Parallelization options -qsmp=nooptParallelizes code with minimal optimization to allow for better debugging of OpenMP applications. -qsmp=ompParallelizes code containing OpenMP directives -qsmp=autoAutomatically parallelizes loops -qsmp=noautoNo auto-parallelization. Processes IBM and OpenMP parallel directives.

Software Group © 2005 IBM Corporation October 2005 Outlining  Parallelization transformation

Software Group © 2005 IBM Corporation October 2005 Outlining long =_xlsmpInitializeRTE(); if (n > 0) then endif return main; } int main{}{ #pragma omp parallel for for(int i=0; i<n; i++) { a[i] = const; …… } Subroutine void unsigned =0; do{ + CIV1] = const; + 1; < return; } + Runtime call Outlined routine

Software Group © 2005 IBM Corporation October 2005 SMP parallel runtime The outlined function is parameterized – can be invoked for different ranges in the iteration space

Software Group © 2005 IBM Corporation October 2005 Auto-parallelization  Integrated framework for OpenMP and auto-parallelization  Auto-parallelization is restricted to loops.  Auto-parallelization is done in the link step when possible.  This allows us to perform various interprocedural analysis and optimizations before automatic parallelization

Software Group © 2005 IBM Corporation October 2005 Auto-parallelization transformation int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… } + int main{}{ #auto-parallel-loop for(int i=0; i<n; i++) { a[i] = const; …… } Outlining

Software Group © 2005 IBM Corporation October 2005 We can auto-parallelize OpenMP applications – skipping user-parallel code – good thing!! int main{}{ for(int i=0; i<n; i++){ a[i] = const; …… } #pragma omp parallel for for (int j=0; j<n; j++){ b[j] = a[i]; } + Outlining int main{}{ #auto-parallel-loop for(int i=0; i<n; i++){ a[i] = const; …… } #pragma omp parallel for for (int j=0; j<n; j++){ b[j] = a[i]; }

Software Group © 2005 IBM Corporation October 2005 Pre-parallelization phase  Loop Normalization (normalize countable loops)  Scalar privatization  Array privatization  Reduction variable analysis  Loop interchange (that helps parallelization)

Software Group © 2005 IBM Corporation October 2005 Cost Analysis  Automatic parallelization tests –Dependence analysis : Is it safe to parallelize ?? –Cost analysis : Is it worthwhile to parallelize ??  Cost analysis: Estimates the total workload of the loop  LoopCost = ( IterationCount * ExecTimeOfLoopBody )  Cost known at compile time – trivial  Runtime cost analysis is more complex

Software Group © 2005 IBM Corporation October 2005 Conditional Parallelization long =_xlsmpInitializeRTE(); if (n > 0) then if(loop_cost > threshold){ } else endif return main; } int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… } Subroutine void + 1; < return; } + Runtime check

Software Group © 2005 IBM Corporation October 2005 Runtime cost analysis challenges  Runtime checks should be –Light weight : should not introduce large overhead in applications that are mostly serial –Overflow problems : leads to incorrect decision – costly!! loopcost = ((( c1*n1 ) + (c2*n2) + const)*n3)* … –Restricted to integer operations –Should be accurate  Balance all the above factors

Software Group © 2005 IBM Corporation October 2005 Runtime dependence test long =_xlsmpInitializeRTE(); if (n > 0) then if( && loop_cost>threshold){ } else endif return main; } int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… } Subroutine void + 1; < return; } + Runtime dependence Work by Peng Zhao

Software Group © 2005 IBM Corporation October 2005

Software Group © 2005 IBM Corporation October 2005 Controlled parallelization  Cost analysis  selects big loops  Controlled parallelization –Selection is not enough –Parallel performance dependent on ( amount of work + number of processors used) –Using large number of processors for a small loop  huge degradations !!

Software Group © 2005 IBM Corporation October 2005 Measured on a 64-way Power5 processor Small is good !!!

Software Group © 2005 IBM Corporation October 2005 Controlled parallelization  Introduce another runtime parameter IPT (minimum iterations per thread)  The IPT is passed to the SMP runtime  SMP runtime limits the number of threads working on the parallel loop based on IPT  IPT = function( loop_cost, mem access info.. )

Software Group © 2005 IBM Corporation October 2005 Controlled Parallelization long =_xlsmpInitializeRTE(); if (n > 0) then if(loop_cost > threshold){ IPT = func(loop_cost) endif } else } return main; } int main{}{ for(int i=0; i<n; i++) { a[i] = const; …… } Subroutine void + 1; < return; } + Runtime parameter

Software Group © 2005 IBM Corporation October 2005 SMP parallel runtime { threadsUsed = IterCount/IPT if (threadsUsed > threadsAvailable) threadsUsed = threadsAvailable ….. }

Software Group © 2005 IBM Corporation October 2005 Controlled parallelization for OpenMP  Improves performance and scalability  Allows fine grained control at loop level granularity  Can be applied to OpenMP loops as well  Adjust number of threads when ENV variable OMP_DYNAMIC is turned on.  Issues with threadprivate data  Encouraging results in galgel

Software Group © 2005 IBM Corporation October 2005 Measured on a 64-way Power5 processor

Software Group © 2005 IBM Corporation October 2005 Future work  Improve cost analysis algorithm and fine tune heuristics  Implement interprocedural cost analysis.  Extend cost analysis and controlled parallelization to non loops in user-parallel code – for scalability  Implement interprocedural dependence analysis