CS4961 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 25, 2011 08/25/2011 CS4961.

Slides:

Advertisements

Similar presentations

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

Advertisements

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 5: Process Synchronization.

Mutual Exclusion.

CH7 discussion-review Mahmoud Alhabbash. Q1 What is a Race Condition? How could we prevent that? – Race condition is the situation where several processes.

Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

Parallel Algorithms Lecture Notes. Motivation Programs face two perennial problems:: –Time: Run faster in solving a problem Example: speed up time needed.

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.

Best-First Search: Agendas

Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.

Reference: Message Passing Fundamentals.

Parallel Programming Chapter 1 Introduction to Parallel Architectures Johnnie Baker Spring

Computer Architecture 2011 – coherency & consistency (lec 7) 1 Computer Architecture Memory Coherency & Consistency By Dan Tsafrir, 11/4/2011 Presentation.

CS510 Concurrent Systems Class 5 Threads Cannot Be Implemented As a Library.

A. Frank - P. Weisberg Operating Systems Introduction to Cooperating Processes.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.

L20: Putting it together: Tree Search (Ch. 6) November 29, 2011.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

10/04/2011CS4961 CS4961 Parallel Programming Lecture 12: Advanced Synchronization (Pthreads) Mary Hall October 4, 2011.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

CS223 Algorithms D-Term 2013 Instructor: Mohamed Eltabakh WPI, CS Introduction Slide 1.

L15: Putting it together: N-body (Ch. 6) October 30, 2012.

Lecture 8. How to Form Recursive relations 1. Recap Asymptotic analysis helps to highlight the order of growth of functions to compare algorithms Common.

08/26/2010CS4961 CS4961 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 26,

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

Computer Architecture 2015 – Cache Coherency & Consistency 1 Computer Architecture Memory Coherency & Consistency By Yoav Etsion and Dan Tsafrir Presentation.

L21: Final Preparation and Course Retrospective (also very brief introduction to Map Reduce) December 1, 2011.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,

Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.

 2008 Pearson Education, Inc. All rights reserved JavaScript: Control Statements I.

10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.

Fundamentals of Parallel Computer Architecture - Chapter 71 Chapter 7 Introduction to Shared Memory Multiprocessors Yan Solihin Copyright.

9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,

Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,

09/02/2010CS4961 CS4961 Parallel Programming Lecture 4: CTA, cont. Data and Task Parallelism Mary Hall September 2,

10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,

Concurrency and Performance Based on slides by Henri Casanova.

CS61C L20 Thread Level Parallelism II (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.

1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu

08/23/2012CS4230 CS4230 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 23,

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

Background on the need for Synchronization

The University of Adelaide, School of Computer Science

Lecture 5: GPU Compute Architecture

L21: Putting it together: Tree Search (Ch. 6)

EE 193: Parallel Computing

CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.

Lecture 5: GPU Compute Architecture for the last time

Designing Parallel Algorithms (Synchronization)

CS4961 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 27, /25/2009 CS4961.

Introduction to High Performance Computing Lecture 20

Parallel Computation Patterns (Reduction)

Algorithm Discovery and Design

Dr. Tansel Dökeroğlu University of Turkish Aeronautical Association Computer Engineering Department Ceng 442 Introduction to Parallel.

COMP60621 Fundamentals of Parallel and Distributed Systems

Lecture 2 The Art of Concurrency

Programming with Shared Memory Specifying parallelism

COMP60611 Fundamentals of Parallel and Distributed Systems

6- General Purpose GPU Programming

Presentation transcript:

CS4961 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 25, 2011 08/25/2011 CS4961

Homework 1: Parallel Programming Basics Turn in electronically on the CADE machines using the handin program: “handin cs4961 hw1 <probfile>” Problem 1: (#1.3 in textbook): Try to write pseudo-code for the tree-structured global sum illustrated in Figure 1.1. Assume the number of cores is a power of two (1, 2, 4, 8, …). Hints: Use a variable divisor to determine whether a core should send its sum or receive and add. The divisor should start with the value 2 and be doubled after each iteration. Also use a variable core_difference to determine which core should be partnered with the current core. It should start with the value 1 and also be doubled after each iteration. For example, in the first iteration 0 % divisor = 0 and 1 % divisor = 1, so 0 receives and adds, while 1 sends. Also in the first iteration 0 + core_difference = 1 and 1 – core_difference = 0, so 0 and 1 are paired in the first iteration. 08/23/2011 CS4961

Homework 1: Parallel Programming Basics Problem 2: I recently had to tabulate results from a written survey that had four categories of respondents: (I) students; (II) academic professionals; (III) industry professionals; and, (IV) other. The number of respondents in each category was very different; for example, there were far more students than other categories. The respondents selected to which category they belonged and then answered 32 questions with five possible responses: (i) strongly agree; (ii) agree; (iii) neutral; (iv) disagree; and, (v) strongly disagree. My family members and I tabulated the results “in parallel” (assume there were four of us). (a) Identify how data parallelism can be used to tabulate the results of the survey. Keep in mind that each individual survey is on a separate sheet of paper that only one “processor” can examine at a time. Identify scenarios that might lead to load imbalance with a purely data parallel scheme. (b) Identify how task parallelism and combined task and data parallelism can be used to tabulate the results of the survey to improve upon the load imbalance you have identified. 08/23/2011 CS4961

Homework 1, cont. Problem 3: What are your goals after this year and how do you anticipate this class is going to help you with that? Some possible answers, but please feel free to add to them. Also, please write at least one sentence of explanation. A job in the computing industry A job in some other industry that uses computing As preparation for graduate studies To satisfy intellectual curiosity about the future of the computing field Other 08/23/2011 CS4961

Aspects of parallel algorithms (and a hint at complexity!) Today’s Lecture Aspects of parallel algorithms (and a hint at complexity!) Derive parallel algorithms Discussion Sources for this lecture: Slides accompanying textbook 08/25/2011 CS4961

Reasoning about a Parallel Algorithm Ignore architectural details for now (next time) Assume we are starting with a sequential algorithm and trying to modify it to execute in parallel Not always the best strategy, as sometimes the best parallel algorithms are NOTHING like their sequential counterparts But useful since you are accustomed to sequential algorithms 08/25/2011 CS4961

Reasoning about a parallel algorithm, cont. Computation Decomposition How to divide the sequential computation among parallel threads/processors/computations? Aside: Also, Data Partitioning (ignore today) Preserving Dependences Keeping the data values consistent with respect to the sequential execution. Overhead We’ll talk about some different kinds of overhead 08/25/2011 CS4961

Race Condition or Data Dependence A race condition exists when the result of an execution depends on the timing of two or more events. A data dependence is an ordering on a pair of memory operations that must be preserved to maintain correctness. (More on data dependences in a subsequent lecture.) Synchronization is used to sequence control among threads or to sequence accesses to data in parallel code. 08/25/2011 CS4961

Simple Example (p. 4 of text) Compute n values and add them together. Serial solution: Parallel formulation?

Version 1: Computation Partitioning Suppose each core computes a partial sum on n/t consecutive elements (t is the number of threads or processors) Example: n = 24 and t = 8, threads are numbered from 0 to 3 { { { { { { { { 1 4 34 9 2 84 5 1 14 6 2 7 2 5 04 4 1 84 6 5 14 2 3 94 t0 t1 t2 t3 t4 t5 t6 t7 int block_length_per_thread = n/t; int start = id * block_length_per_thread; for (i=start; i<start+block_length_per_thread; i++) { x = Compute_next_value(…); sum += x; } 08/25/2011 CS4961

Data Race on Sum Variable Two threads may interfere on memory writes { { { { { { { { 1 4 34 9 2 84 5 1 14 6 2 7 2 5 04 4 1 84 6 5 14 2 3 94 t0 t1 t2 t3 t4 t5 t6 t7 Thread 1 Thread 3 load sum update sum store sum load sum update sum store sum sum = 0 sum = 6 sum = 9 store<sum,9> store<sum,6> 08/25/2011 CS4961

Dependence on sum across iterations/threads What Happened? Dependence on sum across iterations/threads But reordering ok since operations on sum are associative Load/increment/store must be done atomically to preserve sequential meaning Definitions: Atomicity: a set of operations is atomic if either they all execute or none executes. Thus, there is no way to see the results of a partial execution. Mutual exclusion: at most one thread can execute the code at any time 08/25/2011 CS4961

Version 2: Add Locks Insert mutual exclusion (mutex) so that only one thread at a time is loading/incrementing/storing count atomically int block_length_per_thread = n/t; mutex m; int start = id * block_length_per_thread; for (i=start; i<start+block_length_per_thread; i++) { my_x = Compute_next_value(…); mutex_lock(m); sum += my_x; mutex_unlock(m); } Correct now. Done? 08/25/2011 CS4961

Lock Contention and Poor Granularity To acquire lock, must go through at least a few levels of cache (locality) Local copy in register not going to be correct Not a lot of parallel work outside of acquiring/releasing lock Slide source: Larry Snyder, “http://www.cs.washington.edu/education/courses/524/08wi/” 08/25/2011 CS4961

Increase Parallelism Granularity Private copy of sum for each core Accumulate later – how? Each core uses it’s own private variables and executes this block of code independently of the other cores. Copyright © 2010, Elsevier Inc. All rights Reserved

Accumulating result Version 3: Lock only to update final sum from private copy int block_length_per_thread = n/t; mutex m; int my_sum; int start = id * block_length_per_thread; for (i=start; i<start+block_length_per_thread; i++) { my_x = Compute_next_value(…); my_sum += my_x; } mutex_lock(m); sum += my_sum; mutex_unlock(m); 08/25/2011 CS4961

Accumulating result Version 4 (bottom of page 4 in textbook): “Master” processor accumulates result int block_length_per_thread = n/t; mutex m; shared my_sum[t]; int start = id * block_length_per_thread; for (i=start; i<start+block_length_per_thread; i++) { my_x = Compute_next_value(…); my_sum[id] += my_x; } if (id == 0) { // master thread sum = my_sum[0]; for (i=1; i<t; i++) sum += my_sum[i]; Correct? Why not? 08/25/2011 CS4961

More Synchronization: Barriers Incorrect if master thread begins accumulating final result before other threads are done How can we force the master to wait until the threads are ready? Definition: A barrier is used to block threads from proceeding beyond a program point until all of the participating threads has reached the barrier. Implementation of barriers? 08/25/2011 CS4961

Accumulating result Version 5 (bottom of page 4 in textbook): “Master” processor accumulates result int block_length_per_thread = n/t; mutex m; shared my_sum[t]; int start = id * block_length_per_thread; for (i=start; i<start+block_length_per_thread; i++) { my_x = Compute_next_value(…); my_sum[t] += x; } Synchronize_cores(); // barrier for all participating threads if (id == 0) { // master thread sum = my_sum[0]; for (i=1; i<t; i++) sum += my_sum[t]; Now it’s correct! 08/25/2011 CS4961

Discussion: Overheads What were the overheads we saw with this example? Extra code to determine portion of computation Locking overhead: inherent cost plus contention Load imbalance 08/25/2011 CS4961

Problem #1 on homework (see page 5) A strategy to improve load imbalance Suppose number of threads is large and master is spending a lot of time in calculating final value Have threads combine partial results with their “neighbors” Tree-structured computation sums partial results 08/25/2011 CS4961

Version 6 (homework): Multiple cores forming a global sum Copyright © 2010, Elsevier Inc. All rights Reserved

How do we write parallel programs? Task parallelism Partition various tasks carried out solving the problem among the cores. Data parallelism Partition the data used in solving the problem among the cores. Each core carries out similar operations on it’s part of the data. Copyright © 2010, Elsevier Inc. All rights Reserved

Professor P 15 questions 300 exams Copyright © 2010, Elsevier Inc. All rights Reserved

Professor P’s grading assistants Copyright © 2010, Elsevier Inc. All rights Reserved

Division of work – data parallelism 100 exams 100 exams TA#2 100 exams Copyright © 2010, Elsevier Inc. All rights Reserved

Division of work – task parallelism Questions 11 - 15 Questions 1 - 5 TA#2 Questions 6 - 10 Copyright © 2010, Elsevier Inc. All rights Reserved

Generalizing from this example Interestingly, this code represents a common pattern in parallel algorithms A reduction computation From a large amount of input data, compute a smaller result that represents a reduction in the dimensionality of the input In this case, a reduction from an array input to a scalar result (the sum) Reduction computations exhibit dependences that must be preserved Looks like “result = result op …” Operation op must be associative so that it is safe to reorder them Aside: Floating point arithmetic is not truly associative, but usually ok to reorder 08/25/2011 CS4961

Examples of Reduction Computations Count all the occurrences of a word in news articles (how many times does Libya appear in articles from Tuesday?) What is the minimum id # of all students in this class? What is the highest price paid for a gallon of gas in Salt Lake City over the past 10 years? 08/25/2011 CS4961

Summary of Lecture How to Derive Parallel Versions of Sequential Algorithms Computation Partitioning Preserving Dependences using Synchronization Reduction Computations Overheads 08/25/2011 CS4961

Next Time A Discussion of parallel computing platforms Go over first written homework assignment 08/25/2011 CS4961