Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.

Slides:

Advertisements

Similar presentations

List Ranking and Parallel Prefix

Advertisements

Concurrency Important and difficult (Ada slides copied from Ed Schonberg)

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.

May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.

Lecture 2 The Art of Concurrency 张奇复旦大学 COMP Data Intensive Computing 1.

Rules for Designing Multithreaded Applications CET306 Harry R. Erwin University of Sunderland.

PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

INTEL CONFIDENTIAL OpenMP for Domain Decomposition Introduction to Parallel Programming – Part 5.

A. Frank - P. Weisberg Operating Systems Introduction to Cooperating Processes.

– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

A Bridge to Your First Computer Science Course Prof. H.E. Dunsmore Concurrent Programming Threads Synchronization.

Programming with Shared Memory Introduction to OpenMP

Parallel Programming in Java with Shared Memory Directives.

L15: Putting it together: N-body (Ch. 6) October 30, 2012.

Recursion Chapter 7 Copyright ©2012 by Pearson Education, Inc. All rights reserved.

C. – C. Yao Data Structure. C. – C. Yao Chap 1 Basic Concepts.

Lecture 8. How to Form Recursive relations 1. Recap Asymptotic analysis helps to highlight the order of growth of functions to compare algorithms Common.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.

Analysis of Algorithms

1 7.Algorithm Efficiency What to measure? Space utilization: amount of memory required  Time efficiency: amount of time required to process the data Depends.

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

Recursion. What is recursion? Rules of recursion Mathematical induction The Fibonacci sequence Summary Outline.

Java Programming: Guided Learning with Early Objects Chapter 11 Recursion.

Data Structure Introduction.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,

University of Washington What is parallel processing? Spring 2014 Wrap-up When can we execute things in parallel? Parallelism: Use extra resources to solve.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Concurrency Control 1 Fall 2014 CS7020: Game Design and Development.

9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.

Parallel Computation of Skyline Queries Verification COSC6490A Fall 2007 Slawomir Kmiec.

Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.

1/46 PARALLEL SOFTWARE ( SECTION 2.4). 2/46 The burden is on software From now on… In shared memory programs: Start a single process and fork threads.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

Imperative Programming Statements and invariants.

Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.

1 Parallel Processing Fundamental Concepts. 2 Selection of an Application for Parallelization Can use parallel computation for 2 things: –Speed up an.

CPE 779 Parallel Computing - Spring Creating and Using Threads Based on Slides by Katherine Yelick

Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.

Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

Auburn University

Introduction to OpenMP

Conception of parallel algorithms

Parallel Programming By J. H. Wang May 2, 2017.

Computer Engg, IIT(BHU)

CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.

Presented by: Huston Bokinsky Ying Zhang 25 April, 2013

Optimizing Transformations Hal Perkins Autumn 2011

Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.

Algorithm Efficiency Chapter 10.

Optimizing Transformations Hal Perkins Winter 2008

Background and Motivation

Introduction to OpenMP

Lecture 2 The Art of Concurrency

Parallel Programming in C with MPI and OpenMP

Shared-Memory Paradigm & OpenMP

Presentation transcript:

Design Issues

How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization

Jaruloj ChongstitvatanaParallel Programming: Parallelization 3 Task Decomposition

Task decomposition  Identify tasks  Decompose the serial code into parts which can be paralleled.  These parts need to be completely independent.  Dependency: interaction between tasks.  Sequential consistency property  The parallel code gives the same result, for the same input, as the serial code does.  Test for parallelizable loop  If running the loop in reverse order gives the same result as the original loop, it is possibly parallelizable. Jaruloj ChongstitvatanaParallel Programming: Parallelization 4

Design Consideration  What are the tasks & how are tasks defined?  Dependencies between tasks & how to satisfy the dependencies.  How to assign tasks to threads/ processord. Jaruloj ChongstitvatanaParallel Programming: Parallelization 5

Task Definition  Tasks mostly are related to activities.  Used in GUI applications.  Example: multimedia web application  Play background music  Display animation.  Read input from user.  Which part of code should be parallelized?  Hotspot: the part that gets executed often. Jaruloj ChongstitvatanaParallel Programming: Parallelization 6

Criteria for decomposition  More tasks then threads (or cores).  Why?  Granularity (Fine/coarse-grained decomposition)  The amount of computation in tasks or the time between synchronizations.  Tasks are big enough, comparing to overhead of handling tasks and threads.  Overhead contains thread management, synchronization, etc. Jaruloj ChongstitvatanaParallel Programming: Parallelization 7

Fine-grained Coarse-grained Jaruloj ChongstitvatanaParallel Programming: Parallelization 8

Dependency between tasks  Order dependency  Execution order.  Can be enforced by:  Put dependent tasks in the same thread.  Add synchronization.  Data dependency  Variables shared between tasks.  Can be solved by:  shared and private variables.  locks and critical regions. Jaruloj ChongstitvatanaParallel Programming: Parallelization 9 A D C B AD C B sum=0; suma=0; for (i=0; i<m; i++) suma=suma+a[i]; sumb=0; for (j=0; j<n; j++) sumb=sumb+b[j]; sum=suma+sumb; sum=0; for (i=0; i<m; i++) sum=sum+a[i]; for (i=0; i<n; i++) sum=sum+b[i];

Task scheduling  Static scheduling  Simple.  Work well if the amount of work can be estimated before execution.  Dynamic scheduling  Divide more tasks than processing elements.  Assign a task to a processing element whenever it is free. Jaruloj ChongstitvatanaParallel Programming: Parallelization 10

Data Decomposition Jaruloj ChongstitvatanaParallel Programming: Parallelization 11

Data decomposition  Divide data into chunks & each task works on a chunk.  Considerations  How to divide data  Make sure each task have access to require data.  Where each chuck goes? Jaruloj ChongstitvatanaParallel Programming: Parallelization 12

How to divide data  Roughly equally  Except when computation is not the same for all data.  Shape of the chunks  The number of neighbor chunks  amount of data exchange Jaruloj ChongstitvatanaParallel Programming: Parallelization 13

Data access for each task  Make local copy of data for each task.  Data duplication  Waste memory  Synchronization for data consistency  No synchronization if data are read only.  Not worth if used only few times.  No duplication needed in shared memory model. Jaruloj ChongstitvatanaParallel Programming: Parallelization 14

Assign chunks to threads/cores  Static scheduling  In distributed memory model, shared data need to be considered to reduce synchronization.  Dynamic scheduling  Do not know ahead of time. Jaruloj ChongstitvatanaParallel Programming: Parallelization 15

Example void computeNextGen (Grid curr, Grid next, int N, int M) { int count; for (int i = 1; i <= N; i++) { for (int j = 1; j <= M; j++) { count = 0; if (curr[i-1][j-1] == ALIVE) count++; if (curr[i-1][j] == ALIVE) count++; … if (curr[i+1][j+1] == ALIVE) count++; if (count = 4) next[i][j] = DEAD; else if (curr[i][j] == ALIVE && (count == 2 || count == 3)) next[i][j] = ALIVE; else if (curr[i][j] == DEAD && count == 3) next[i][j] = ALIVE; else next[i][j] = DEAD; } } return; } Jaruloj ChongstitvatanaParallel Programming: Parallelization 16

Dataflow decomposition  Break up problems based on how data flows between tasks.  Producer/consumer Jaruloj ChongstitvatanaParallel Programming: Parallelization 17

What not to parallelize  Algorithms with state  Example: Finite state machine simulation  Recurrence relations  Examples: convergence loop, calculating fibonacci  Induction variables  Variables incremented once in each iteration of loop  Reduction  Do something from a collection of data, e.g. sum.  Loop-carried dependence  Results of previous iteration used in current iteration Jaruloj ChongstitvatanaParallel Programming: Parallelization 18

Algorithms with state  adding some form of synchronization  serialize all concurrent executions  writing the code to be reentrant (i.e., it can be reentered without detrimental side effects while it is already running).  may not be possible if the update of global variables is part of the code.  Use thread-local storage if the variable(s) holding the state does not have to be shared between threads. Jaruloj ChongstitvatanaParallel Programming: Parallelization 19

Recurrence Relations Jaruloj ChongstitvatanaParallel Programming: Parallelization 20

Induction Variables i1 = 4; i2 = 0; for (k = 1; k < N; k++) { B[i1++] = function1(k,q,r); i2 += k; A[i2] = function2(k,r,q); } i1 = 4; i2 = 0; for (k = 1; k < N; k++) { B[k+4] = function1(k,q,r); i2 = (k*k + k)/2; A[i2] = function2(k,r,q); } Jaruloj ChongstitvatanaParallel Programming: Parallelization 21

Reduction  Combining a collection of data and reduce it to a single scalar value.  To remove the dependency, the combining operation must be associative and commutative. sum = 0; big = c[0]; for (i = 0; i < N; i++) { sum += c[i]; big = (c[i] > big ? c[i] : big); } Jaruloj ChongstitvatanaParallel Programming: Parallelization 22

Loop-carried Dependence  References to the same array on both LHS and RHS of assignments and a backward reference in some RHS use of array.  General case of recurrence relations.  Cannot be solved easily. for (k = 5; k < N; k++) { b[k] = DoSomething(k); a[k] = b[k-5] + MoreStuff(k); } Jaruloj ChongstitvatanaParallel Programming: Parallelization 23

Example: Loop-carried dependence wrap = a[0] * b[0]; for (i = 1; i < N; i++) { c[i] = wrap; wrap = a[i] * b[i]; d[i] = 2 * wrap; } for (i = 1; i < N; i++) { wrap = a[i-1] * b[i-1]; c[i] = wrap; wrap = a[i] * b[i]; d[i] = 2 * wrap; } Jaruloj ChongstitvatanaParallel Programming: Parallelization 24