Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Slides:

Advertisements

Similar presentations

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Advertisements

Scheduling and Performance Issues for Programming using OpenMP

Distributed Systems CS

Computer Abstractions and Technology

Scheduling. Main Points Scheduling policy: what to do next, when there are multiple threads ready to run – Or multiple packets to send, or web requests.

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

Computer Organization and Architecture 18 th March, 2008.

Cc Compiler Parallelization Options CSE 260 Mini-project Fall 2001 John Kerwin.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

Chapter 9. Concepts in Parallelisation An Introduction

Parallelizing Compilers Presented by Yiwei Zhang.

Fall 2001CS 4471 Chapter 2: Performance CS 447 Jason Bakos.

A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.

CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.

Rechen- und Kommunikationszentrum (RZ) Parallelization at a Glance Christian Terboven / Aachen, Germany Stand: Version 2.3.

A Bridge to Your First Computer Science Course Prof. H.E. Dunsmore Concurrent Programming Threads Synchronization.

Programming with Shared Memory Introduction to OpenMP

Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK

1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.

OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.

Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

Threaded Programming Lecture 4: Work sharing directives.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Parallel Programming with MPI and OpenMP

Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.

Lab 2 Parallel processing using NIOS II processors

TM Parallel Concepts An introduction. TM The Goal of Parallelization Reduction of elapsed time of a program Reduction in turnaround time of jobs Overhead:

Threaded Programming Lecture 2: Introduction to OpenMP.

1 OS Review Processes and Threads Chi Zhang

Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.

Computer Science 320 Load Balancing. Behavior of Parallel Program Why do 3 threads take longer than two?

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Concurrency and Performance Based on slides by Henri Casanova.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

Tuning Threaded Code with Intel® Parallel Amplifier.

Distributed and Parallel Processing George Wells.

Embedded Systems MPSoC Architectures OpenMP: Exercises Alberto Bosio

B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.

Performance. Moore's Law Moore's Law Related Curves.

Introduction to OpenMP

Sujata Ray Dey Maheshtala College Computer Science Department

CS427 Multicore Architecture and Parallel Computing

Multithreaded Programming in Java

Computer Engg, IIT(BHU)

Introduction to OpenMP

Multi-core CPU Computing Straightforward with OpenMP

Parallel Programming.

CMSC 611: Advanced Computer Architecture

Programming with Shared Memory Introduction to OpenMP

Sujata Ray Dey Maheshtala College Computer Science Department

Introduction to OpenMP

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Parallel Computing Explained How to Parallelize a Code

CMSC 611: Advanced Computer Architecture

Presentation transcript:

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida International University March 2009 Parallel Computing Explained Parallel Code Tuning

Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 5.1 Sequential Code Limitation 5.2 Parallel Overhead 5.3 Load Balance Loop Schedule Types Chunk Size

Parallel Code Tuning This chapter describes several of the most common techniques for parallel tuning, the type of programs that benefit, and the details for implementing them. The majority of this chapter deals with improving load balancing.

Sequential Code Limitation Sequential code is a part of the program that cannot be run with multiple processors. Some reasons why it cannot be made data parallel are: The code is not in a do loop. The do loop contains a read or write. The do loop contains a dependency. The do loop has an ambiguous subscript. The do loop has a call to a subroutine or a reference to a function subprogram. Sequential Code Fraction As shown by Amdahl’s Law, if the sequential fraction is too large, there is a limitation on speedup. If you think too much sequential code is a problem, you can calculate the sequential fraction of code using the Amdahl’s Law formula.

Sequential Code Limitation Measuring the Sequential Code Fraction Decide how many processors to use, this is p. Run and time the program with 1 processor to give T(1). Run and time the program with p processors to give T(2). Form a ratio of the 2 timings T(1)/T(p), this is SP. Substitute SP and p into the Amdahl’s Law formula: f=(1/SP-1/p)/(1-1/p), where f is the fraction of sequential code. Solve for f, this is the fraction of sequential code. Decreasing the Sequential Code Fraction The compilation optimization reports list which loops could not be parallelized and why. You can use this report as a guide to improve performance on do loops by: Removing dependencies Removing I/O Removing calls to subroutines and function subprograms

Parallel Overhead Parallel overhead is the processing time spent creating threads spin/blocking threads starting and ending parallel regions synchronizing at the end of parallel regions When the computational work done by the parallel processes is too small, the overhead time needed to create and control the parallel processes can be disproportionately large limiting the savings due to parallelism. Measuring Parallel Overhead To get a rough under-estimate of parallel overhead: Run and time the code using 1 processor. Parallelize the code. Run and time the parallel code using only 1 processor. Subtract the 2 timings.

Parallel Overhead Reducing Parallel Overhead To reduce parallel overhead: Don't parallelize all the loops. Don't parallelize small loops. To benefit from parallelization, a loop needs about 1000 floating point operations or 500 statements in the loop. You can use the IF modifier in the OpenMP directive to control when loops are parallelized. !$OMP PARALLEL DO IF(n > 500) do i=1,n... body of loop... end do !$OMP END PARALLEL DO Use task parallelism instead of data parallelism. It doesn't generate as much parallel overhead and often more code runs in parallel. Don't use more threads than you need. Parallelize at the highest level possible.

Load Balance Load balance is the even assignment of subtasks to processors so as to keep each processor busy doing useful work for as long as possible. Load balance is important for speedup because the end of a do loop is a synchronization point where threads need to catch up with each other. If processors have different work loads, some of the processors will idle while others are still working. Measuring Load Balance On the SGI Origin, to measure load balance, use the perfex tool which is a command line interface to the R10000 hardware counters. The command perfex -e16 -mp a.out > results reports per thread cycle counts. Compare the cycle counts to determine load balance problems. The master thread (thread 0) always uses more cycles than the slave threads. If the counts are vastly different, it indicates load imbalance.

Load Balance For linux systems, the thread cpu times can be compared with ps. A thread with unusually high or low time compared to the others may not be working efficiently [high cputime could be the result of a thread spinning while waiting for other threads to catch up]. ps uH Improving Load Balance To improve load balance, try changing the way that loop iterations are allocated to threads by changing the loop schedule type changing the chunk size These methods are discussed in the following sections.

Loop Schedule Types On the SGI Origin2000 computer, 4 different loop schedule types can be specified by an OpenMP directive. They are: Static Dynamic Guided Runtime If you don't specify a schedule type, the default will be used. Default Schedule Type The default schedule type allocates 20 iterations on 4 threads as:

Loop Schedule Types Static Schedule Type The static schedule type is used when some of the iterations do more work than others. With the static schedule type, iterations are allocated in a round-robin fashion to the threads. An Example Suppose you are computing on the upper triangle of a 100 x 100 matrix, and you use 2 threads, named t0 and t1. With default scheduling, workloads are uneven.

Loop Schedule Types Whereas with static scheduling, the columns of the matrix are given to the threads in a round robin fashion, resulting in better load balance.

Loop Schedule Types Dynamic Schedule Type The iterations are dynamically allocated to threads at runtime. Each thread is given a chunk of iterations. When a thread finishes its work, it goes into a critical section where it’s given another chunk of iterations to work on. This type is useful when you don’t know the iteration count or work pattern ahead of time. Dynamic gives good load balance, but at a high overhead cost. Guided Schedule Type The guided schedule type is dynamic scheduling that starts with large chunks of iterations and ends with small chunks of iterations. That is, the number of iterations given to each thread depends on the number of iterations remaining. The guided schedule type reduces the number of entries into the critical section, compared to the dynamic schedule type. Guided gives good load balancing at a low overhead cost.

Chunk Size The word chunk refers to a grouping of iterations. Chunk size means how many iterations are in the grouping. The static and dynamic schedule types can be used with a chunk size. If a chunk size is not specified, then the chunk size is 1. Suppose you specify a chunk size of 2 with the static schedule type. Then 20 iterations are allocated on 4 threads: The schedule type and chunk size are specified as follows: !$OMP PARALLEL DO SCHEDULE(type, chunk) … !$OMP END PARALLEL DO Where type is STATIC, or DYNAMIC, or GUIDED and chunk is any positive integer.