08/23/2012CS4230 CS4230 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 23, 2012 1.

Slides:



Advertisements
Similar presentations
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Advertisements

CS425 /CSE424/ECE428 – Distributed Systems – Fall 2011 Material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra, K. Nahrstedt, N. Vaidya.
ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
A Story of Units Module 1 Overview Grades K-5
D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 5: Process Synchronization.
Mutual Exclusion.
CH7 discussion-review Mahmoud Alhabbash. Q1 What is a Race Condition? How could we prevent that? – Race condition is the situation where several processes.
Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Parallel Algorithms Lecture Notes. Motivation Programs face two perennial problems:: –Time: Run faster in solving a problem Example: speed up time needed.
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
Lecture 11 CSS314 Parallel Computing
Best-First Search: Agendas
Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
Parallel Programming Chapter 1 Introduction to Parallel Architectures Johnnie Baker Spring
Chapter 6 – Concurrent Programming Outline 6.1 Introduction 6.2Monitors 6.2.1Condition Variables 6.2.2Simple Resource Allocation with Monitors 6.2.3Monitor.
CS510 Concurrent Systems Class 5 Threads Cannot Be Implemented As a Library.
A. Frank - P. Weisberg Operating Systems Introduction to Cooperating Processes.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
Multiprocessor Cache Coherency
L20: Putting it together: Tree Search (Ch. 6) November 29, 2011.
10/04/2011CS4961 CS4961 Parallel Programming Lecture 12: Advanced Synchronization (Pthreads) Mary Hall October 4, 2011.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
L15: Putting it together: N-body (Ch. 6) October 30, 2012.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Chapter An Introduction to Problem Solving 1 1 Copyright © 2013, 2010, and 2007, Pearson Education, Inc.
08/26/2010CS4961 CS4961 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 26,
CS4961 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 25, /25/2011 CS4961.
Computer Architecture 2015 – Cache Coherency & Consistency 1 Computer Architecture Memory Coherency & Consistency By Yoav Etsion and Dan Tsafrir Presentation.
Concurrency, Mutual Exclusion and Synchronization.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Lecture 1 Introduction Figures from Lewis, “C# Software Solutions”, Addison Wesley Richard Gesick.
Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.
09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
CUDA - 2.
09/24/2010CS4961 CS4961 Parallel Programming Lecture 10: Thread Building Blocks Mary Hall September 24,
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
Lab 2 Parallel processing using NIOS II processors
Fundamentals of Parallel Computer Architecture - Chapter 71 Chapter 7 Introduction to Shared Memory Multiprocessors Yan Solihin Copyright.
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
Week 61 Introduction to Programming Ms. Knudtzon C Period Tuesday October 12.
CGS 3763 Operating Systems Concepts Spring 2013 Dan C. Marinescu Office: HEC 304 Office hours: M-Wd 11: :30 AM.
Slides created by: Professor Ian G. Harris Operating Systems  Allow the processor to perform several tasks at virtually the same time Ex. Web Controlled.
10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,
1 Critical Section Problem CIS 450 Winter 2003 Professor Jinhua Guo.
Concurrency and Performance Based on slides by Henri Casanova.
CS61C L20 Thread Level Parallelism II (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu
The University of Adelaide, School of Computer Science
Multiprocessor Cache Coherency
Ivy Eva Wu.
L21: Putting it together: Tree Search (Ch. 6)
EE 193: Parallel Computing
CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.
CS4961 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 27, /25/2009 CS4961.
Introduction to High Performance Computing Lecture 20
Algorithm Discovery and Design
Background and Motivation
Dr. Tansel Dökeroğlu University of Turkish Aeronautical Association Computer Engineering Department Ceng 442 Introduction to Parallel.
COMP60621 Fundamentals of Parallel and Distributed Systems
Kernel Synchronization II
Lecture 2 The Art of Concurrency
Programming with Shared Memory Specifying parallelism
COMP60611 Fundamentals of Parallel and Distributed Systems
Presentation transcript:

08/23/2012CS4230 CS4230 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 23,

Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically on the CADE machines using the handin program: “handin cs4230 hw1 ” Problem 1: (from today’s lecture) We can develop a model for the performance behavior from the versions of parallel sum in today’s lecture based on sequential execution time S, number of threads T, parallelization overhead O (fixed for all versions), and the cost B for the barrier or M for each invocation of the mutex. Let N be the number of elements in the list. For version 5, there is some additional work for thread 0 that you should also model using the variables above. (a) Using these variables, what is the execution time of valid parallel versions 2, 3 and 5; (b) present a model of when parallelization is profitable for version 3; (c) discuss how varying T and N impact the relative profitability of versions 3 and 5. Problem 2: (#1.3 in textbook): Try to write pseudo-code for the tree-structured global sum illustrated in Figure 1.1. Assume the number of cores is a power of two (1, 2, 4, 8, …). Hints: Use a variable divisor to determine whether a core should send its sum or receive and add. The divisor should start with the value 2 and be doubled after each iteration. Also use a variable core_difference to determine which core should be partnered with the current core. It should start with the value 1 and also be doubled after each iteration. For example, in the first iteration 0 % divisor = 0 and 1 % divisor = 1, so 0 receives and adds, while 1 sends. Also in the first iteration 0 + core_difference = 1 and 1 – core_difference = 0, so 0 and 1 are paired in the first iteration. 08/23/2012CS42302

Homework 1: Parallel Programming Basics Problem 2: (#1.3 in textbook): Try to write pseudo-code for the tree-structured global sum illustrated in Figure 1.1. Assume the number of cores is a power of two (1, 2, 4, 8, …). Hints: Use a variable divisor to determine whether a core should send its sum or receive and add. The divisor should start with the value 2 and be doubled after each iteration. Also use a variable core_difference to determine which core should be partnered with the current core. It should start with the value 1 and also be doubled after each iteration. For example, in the first iteration 0 % divisor = 0 and 1 % divisor = 1, so 0 receives and adds, while 1 sends. Also in the first iteration 0 + core_difference = 1 and 1 – core_difference = 0, so 0 and 1 are paired in the first iteration. 08/23/2012CS42303

Homework 1, cont. Problem 3: What are your goals after this year and how do you anticipate this class is going to help you with that? Some possible answers, but please feel free to add to them. Also, please write at least one sentence of explanation. -A job in the computing industry -A job in some other industry that uses computing -As preparation for graduate studies -To satisfy intellectual curiosity about the future of the computing field -Other 08/23/2012CS42304

Today’s Lecture Aspects of parallel algorithms (and a hint at complexity!) Derive parallel algorithms Discussion Sources for this lecture: -Slides accompanying textbook 08/23/20125CS4230

Reasoning about a Parallel Algorithm Ignore architectural details for now (next time) Assume we are starting with a sequential algorithm and trying to modify it to execute in parallel -Not always the best strategy, as sometimes the best parallel algorithms are NOTHING like their sequential counterparts -But useful since you are accustomed to sequential algorithms 08/23/2012CS42306

Reasoning about a parallel algorithm, cont. Computation Decomposition -How to divide the sequential computation among parallel threads/processors/computations? Aside: Also, Data Partitioning (ignore today) Preserving Dependences -Keeping the data values consistent with respect to the sequential execution. Overhead -We’ll talk about some different kinds of overhead 08/23/2012CS42307

Race Condition or Data Dependence A race condition exists when the result of an execution depends on the timing of two or more events. A data dependence is an ordering on a pair of memory operations that must be preserved to maintain correctness. (More on data dependences in a subsequent lecture.) Synchronization is used to sequence control among threads or to sequence accesses to data in parallel code. 08/23/20128CS4230

Simple Example (p. 4 of text) Compute n values and add them together. Serial solution: Parallel formulation? 908/23/2012CS4230

Version 1: Computation Partitioning Suppose each core computes a partial sum on n/t consecutive elements (t is the number of threads or processors) Example: n = 24 and t = 8, threads are numbered from 0 to 3 08/23/2012CS { t0t1 t2t3 int block_length_per_thread = n/t; int start = id * block_length_per_thread; for (i=start; i<start+block_length_per_thread; i++) { x = Compute_next_value(…); sum += x; } { { { 272 { { { { t4t5t6t7

What Happened? Dependence on sum across iterations/threads -But reordering ok since operations on sum are associative Load/increment/store must be done atomically to preserve sequential meaning Definitions: -Atomicity: a set of operations is atomic if either they all execute or none executes. Thus, there is no way to see the results of a partial execution. -Mutual exclusion: at most one thread can execute the code at any time 08/23/2012CS423011

Version 2: Add Locks Insert mutual exclusion (mutex) so that only one thread at a time is loading/incrementing/storing count atomically 08/23/2012CS int block_length_per_thread = n/t; mutex m; int start = id * block_length_per_thread; for (i=start; i<start+block_length_per_thread; i++) { my_x = Compute_next_value(…); mutex_lock(m); sum += my_x; mutex_unlock(m); } Correct now. Done?

Version 3: Increase Granularity Version 3: -Lock only to update final sum from private copy 08/23/2012CS int block_length_per_thread = n/t; mutex m; int my_sum; int start = id * block_length_per_thread; for (i=start; i<start+block_length_per_thread; i++) { my_x = Compute_next_value(…); my_sum += my_x; } mutex_lock(m); sum += my_sum; mutex_unlock(m);

Version 4: Eliminate lock Version 4 (bottom of page 4 in textbook): -“Master” processor accumulates result 08/23/2012CS int block_length_per_thread = n/t; mutex m; shared my_sum[t]; int start = id * block_length_per_thread; for (i=start; i<start+block_length_per_thread; i++) { my_x = Compute_next_value(…); my_sum[id] += my_x; } if (id == 0) { // master thread sum = my_sum[0]; for (i=1; i<t; i++) sum += my_sum[i]; } Correct? Why not?

More Synchronization: Barriers Incorrect if master thread begins accumulating final result before other threads are done How can we force the master to wait until the threads are ready? Definition: -A barrier is used to block threads from proceeding beyond a program point until all of the participating threads has reached the barrier. -Implementation of barriers? 08/23/2012CS423015

Version 5: Eliminate lock, but add barrier Version 5 (bottom of page 4 in textbook): -“Master” processor accumulates result 08/23/2012CS int block_length_per_thread = n/t; mutex m; shared my_sum[t]; int start = id * block_length_per_thread; for (i=start; i<start+block_length_per_thread; i++) { my_x = Compute_next_value(…); my_sum[t] += x; } Synchronize_cores(); // barrier for all participating threads if (id == 0) { // master thread sum = my_sum[0]; for (i=1; i<t; i++) sum += my_sum[t]; } Now it’s correct!

Version 6 (homework): Multiple cores forming a global sum 17 CS /23/2012

How do we write parallel programs? Task parallelism -Partition various tasks carried out solving the problem among the cores. Data parallelism -Partition the data used in solving the problem among the cores. -Each core carries out similar operations on it’s part of the data. 18 CS /23/2012

Professor P 15 questions 300 exams 19 CS /23/2012

Professor P’s grading assistants TA#1 TA#2 TA#3 20 CS /23/2012

Division of work – data parallelism TA#1 TA#2 TA#3 100 exams 21 CS /23/2012

Division of work – task parallelism TA#1 TA#2 TA#3 Questions Questions Questions CS /23/2012

CS4230 Summary of Lessons from Sum Computation The sum computation had a race condition or data dependence. We used mutex and barrier synchronization to guarantee correct execution. We performed mostly local computation to increase parallelism granularity across threads. What were the overheads we saw with this example? -Extra code to determine portion of computation -Locking overhead: inherent cost plus contention -Load imbalance 23

Data and Task Parallelism: Discussion Problem 1 Group Members: Problem 1: Recall the example of building a house from the first lecture. -(a) Identify a portion of home building that can employ data parallelism, where “data” in this context is any object used as an input to the home-building process, as opposed to tools that can be thought of as processing resources. -(b) Identify task parallelism in home building by defining a set of tasks. Work out a schedule that shows when the various tasks can be performed. -(c) Describe how task and data parallelism can be combined in building a home. What computations can be reassigned to different workers to balance the load? Describe Solution: -(a) -(b) -(c) 08/23/2012CS423024

Data and Task Parallelism: Discussion Problem 2 Group Members: Problem 2: I recently had to tabulate results from a written survey that had four categories of respondents: (I) students; (II) academic professionals; (III) industry professionals; and, (IV) other. The number of respondents in each category was very different; for example, there were far more students than other categories. The respondents selected to which category they belonged and then answered 32 questions with five possible responses: (i) strongly agree; (ii) agree; (iii) neutral; (iv) disagree; and, (v) strongly disagree. My family members and I tabulated the results “in parallel” (assume there were four of us). -(a) Identify how data parallelism can be used to tabulate the results of the survey. Keep in mind that each individual survey is on a separate sheet of paper that only one “processor” can examine at a time. Identify scenarios that might lead to load imbalance with a purely data parallel scheme. -(b) Identify how task parallelism and combined task and data parallelism can be used to tabulate the results of the survey to improve upon the load imbalance you have identified. Describe Solution: -(a) -(b) 08/23/2012CS423025