First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.

Slides:

Advertisements

Similar presentations

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.

Advertisements

ParCeL-5/ParSSAP: A Parallel Programming Model and Library for Easy Development and Fast Execution of Simulations of Situated Multi-Agent Systems Stéphane.

Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.

Lab6 – Debug Assembly Language Lab

Intro to Threading CS221 – 4/20/09. What we’ll cover today Finish the DOTS program Introduction to threads and multi-threading.

Reference: Message Passing Fundamentals.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.

Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Quantitative.

Parallel Programming Models and Paradigms

Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.

Introduction to Scientific Computing on Linux Clusters Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

ECE 1747H : Parallel Programming Lecture 1-2: Overview.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Rechen- und Kommunikationszentrum (RZ) Parallelization at a Glance Christian Terboven / Aachen, Germany Stand: Version 2.3.

Chocolate Bar! luqili. Milestone 3 Speed 11% of final mark 7%: path quality and speed –Some cleverness required for full marks –Implement some A* techniques.

A Bridge to Your First Computer Science Course Prof. H.E. Dunsmore Concurrent Programming Threads Synchronization.

Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.

Performance Evaluation of Parallel Processing. Why Performance?

Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.

CSC-115 Introduction to Computer Programming

Algorithms and their Applications CS2004 ( ) Dr Stephen Swift 1.2 Introduction to Algorithms.

Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.

1 Computer Programming (ECGD2102 ) Using MATLAB Instructor: Eng. Eman Al.Swaity Lecture (1): Introduction.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

CS 320 Assignment 1 Rewriting the MISC Osystem class to support loading machine language programs at addresses other than 0 1.

Compiled by Maria Ramila Jimenez

09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,

Parallel Processing Sharing the load. Inside a Processor Chip in Package Circuits Primarily Crystalline Silicon 1 mm – 25 mm on a side 100 million to.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Web and Grid Services from Pitt/CMU Andrew Connolly Department of Physics and Astronomy University of Pittsburgh Jeff Gardner, Alex Gray, Simon Krughoff,

E-science grid facility for Europe and Latin America E2GRIS1 Gustavo Miranda Teixeira Ricardo Silva Campos Laboratório de Fisiologia Computacional.

Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.

Parallel Programming with MPI and OpenMP

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

PROGRAM DEVELOPMENT CYCLE. Problem Statement: Problem Statement help diagnose the situation so that your focus is on the problem, helpful tools at this.

Lab 2 Parallel processing using NIOS II processors

Thinking in Parallel – Implementing In Code New Mexico Supercomputing Challenge in partnership with Intel Corp. and NM EPSCoR.

CS 351/ IT 351 Modeling and Simulation Technologies HPC Architectures Dr. Jim Holten.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

MA/CS 471 Lecture 15, Fall 2002 Introduction to Graph Partitioning.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.

C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.

April 24, 2002 Parallel Port Example. April 24, 2002 Introduction The objective of this lecture is to go over a simple problem that illustrates the use.

Parallel Computing Presented by Justin Reschke

Concurrency and Performance Based on slides by Henri Casanova.

Evolution of C and C++ n C was developed by Dennis Ritchie at Bell Labs (early 1970s) as a systems programming language n C later evolved into a general-purpose.

Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.

OpenMP Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

CEng 713, Evolutionary Computation, Lecture Notes parallel Evolutionary Computation.

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Andrew Hanushevsky: Exercises

Parallel Programming By J. H. Wang May 2, 2017.

Programming Models for SimMillennium

Parallelism and Amdahl's Law

Multithreading Why & How.

Mattan Erez The University of Texas at Austin

LCPC02 Wei Du Renato Ferreira Gagan Agrawal

Presentation transcript:

First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B. – Bertinoro – Italy, 12 – 17 October 2009 Exercise on Parallelization Caveats: what I suggest here is my way to proceed, but I’m far to be an expert of parallelization! So, of course, it is possible to better… 15/10/2009 ESC09, Exercise Session 1

When we want to parallelize Reduction of the wall-time: we want to achieve better performance, defined as (results response/execution) times Memory problem: large data sample, so we want to split in different sub-samples Remember the two strategies:  SPMD: Same program, different data  MIMD: Different programs, different data 15/10/2009 ESC09, Exercise Session 2

Typical problem suitable for parallelization The problem can be broken down into subparts:  Each subpart is independent of the others  No communication is required, except to split up the problem and combine the final results  Ex: Monte-Carlo simulations Regular and Synchronous Problems:  Same instruction set (regular algorithm) applied to all data  Synchronous communication (or close to): each processor finishes its task at the same time  Local (neighbor to neighbor) and collective (combine final results) communication  Ex: Algebra (matrix-vector products), Fast Fourier transforms 15/10/2009 ESC09, Exercise Session 3

Before parallelization Parallelization is not the first solution when your program is slow  Look if you can improve the performance improving the code  Sometimes compiler optimizations can make the difference (and they do the work for you!)  Try to understand if there are better implementations of your problem (see our example)  Do not re-invent the wheel Use parallel libraries and look if there are already similar parallel implementations  Remember that parallel implementations are more difficult to debug than serial ones 15/10/2009 ESC09, Exercise Session 4

Parallelization Suggestions General Law: THINK PARALLLEL!  Start to write your program directly thinking parallel implementations Can be challenging for beginners  Write a serial version of the code and then move to parallelization Good for beginners (use serial as reference) Anyway it is wasting time! It can be not so straightforward to move from a serial to a parallel implementation of the code So, again, THINK PARALLLEL 15/10/2009 ESC09, Exercise Session 5

Real-life case Usually we start to think in parallel when our serial implementations are too slow (or in general we want to achieve better results response/execution times)  In this case we start with a serial implementation (or in general we “inherit” the code from previous users)  Worst situation: sometimes it can be useful to write the code from scratch (when convenient) Many complex serial code implementations are strictly serial, very difficult to parallelize (no thread-safe, complex data structure,…) 15/10/2009 ESC09, Exercise Session 6

More practical suggestions (1) Either if you are writing a new program or you have a serial implementation:  Understand which part of the code is useful to parallelize  Understand your data structure Which data you want to share, which data are private  Consider the communications and synchronizations Keep low the communication-time/calculation-time  Start with a simpler parallel implementation, for example reducing the data structure  Each parallel implementation MUST have the possibility to run in serial (a single process) Make sure that when run in parallel it gives the same results 15/10/2009 ESC09, Exercise Session 7

More practical suggestions (2)  Remember to balance the load between the processes Final time is given by the slowest process!  Scalability: Depends on your problem, usually on data decomposition  Ex. a parallelization of a simulation of 10 particles, you can have a limit of 10 processors  Speed-up: Do not expect to run a program of 1 week in 1 second! Remember the Amdahl’s Law: S → speedup P → portion of code which is parallelized N → number of simultaneous process Need to find good algorithms to be parallelized! 15/10/2009 ESC09, Exercise Session 8

Amdahl’s Law 15/10/2009 ESC09, Exercise Session 9

The exercise Simulation of N interacting particles in a 1D box  Example from Par Lab Boot Camp  Short-range interaction Common simulation problem  Same implementation can be applied in several other cases 15/10/2009 ESC09, Exercise Session 10

How to proceed Copy the directory /nfsmaster/innocente/parallel in your area Inside this directory you find a README.txt file 3 proposed serial implementations (see corresponding directories) Make copies of the serial.cxx, renaming in openmp.cxx, mpi.cxx, and thread.cxx  Look at the comments inside the file to understand where to apply parallelization (essentially require modifications only inside these files)  You find all “solutions” in our lectures, but you can look in the web or ask me to find better solutions 15/10/2009 ESC09, Exercise Session 11

case1 Compile the code with make serial Run./serial –h Options: -h to see this help -d draw the particles -n to set the number of particles -o to specify the output file name 15/10/2009 ESC09, Exercise Session 12

case1 Basically two loops for (int i=0; i<N; i++) for (int j=0; j<N j++) // interaction between [i, j] Example (serial execution): N = > 5.49 seconds N = > 1.38 seconds ---> x3.98 N = > 0.22 seconds ---> x24.95 The parallel implementation in this case is easy... Example (OpenMP) P = 2, N = > 2.78 seconds ---> x1.97 P = 3, N = > 1.86 seconds ---> x2.95 P = 4, N = > 1.40 seconds ---> x3.92 P = 8, N = > 0.72 seconds ---> x /10/2009 ESC09, Exercise Session 13 SCALE as N 2 !!! SCALE as P processors

case2 Note that the interaction between B and A is the opposite of between A and B  We can calculate an half of the interactions for (int i=0; i<N-1; i++) for (int j=i+1; j<N j++) // interaction betwen [i, j] and [j, i] Example (serial execution): N = > 2.66 seconds ---> x2.06 N = > 0.70 seconds ---> x7.84 N = > 0.11 seconds ---> 49.91x Requires some attention to avoid race conditions 15/10/2009 ESC09, Exercise Session 14 SCALE as N 2 !!!

case3 How can we scale as N (not N 2 )? 15/10/2009 ESC09, Exercise Session 15

case3 How can we scale as N (not N 2 )?  Hint: remember that we have a short-range interaction, i.e. do not need interaction between all particles 15/10/2009 ESC09, Exercise Session 16

case3 How can we scale as N (not N 2 )?  Hint: remember that we have a short-range interaction, i.e. do not need interaction between all particles 15/10/2009 ESC09, Exercise Session 17 Do a mesh (decomposition of the data sample), where the size of the cells is the range of the interaction

case3 How can we scale as N (not N 2 )?  Hint: remember that we have a short-range interaction, i.e. do not need interaction between all particles 15/10/2009 ESC09, Exercise Session 18 Loop over the cells. For each cell, loop over his particles and make the interactions with the particles of the neighboring cells

case3 How can we scale as N (not N 2 )?  Hint: remember that we have a short-range interaction, i.e. do not need interaction between all particles 15/10/2009 ESC09, Exercise Session 19 What do you expect as speed-up for the serial implementation?

case3 N = > 0.19 seconds ---> x28.89 N = > 0.39 seconds ---> x14.08 N = > 0.79 seconds ---> x6.95 Better performance are still possible… Decomposition problem are common in many problems  You can split the data over the processors  To have a good balance you can do an adaptive mesh (or more complex adaptive mesh refinement) 15/10/2009 ESC09, Exercise Session 20 SCALE as N

15/10/2009 ESC09, Exercise Session 21 Galaxy formation (example from  a total of about one billion individual grid cells  adaptive mesh refinement The 3D domain (2 billion light years of side). Colors represent the density of the gas

And now enjoy the exercise! 15/10/2009 ESC09, Exercise Session 22