High-Performance Computing 12.2: Parallel Algorithms.

Slides:



Advertisements
Similar presentations
Instructor Notes Lecture discusses parallel implementation of a simple embarrassingly parallel nbody algorithm We aim to provide some correspondence between.
Advertisements

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Partitioning and Divide-and-Conquer Strategies Data partitioning (or Domain decomposition) Functional decomposition.
N-Body I CS 170: Computing for the Sciences and Mathematics.
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Theory of Computing Lecture 3 MAS 714 Hartmut Klauck.
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
Parallel Strategies Partitioning consists of the following steps –Divide the problem into parts –Compute each part separately –Merge the results Divide.
Divide and Conquer The most well known algorithm design strategy: 1. Divide instance of problem into two or more smaller instances 2. Solve smaller instances.
Partitioning and Divide-and-Conquer Strategies Cluster Computing, UNC-Charlotte, B. Wilkinson, 2007.
Quicksort Quicksort     29  9.
Divide and Conquer. Recall Complexity Analysis – Comparison of algorithm – Big O Simplification From source code – Recursive.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Nattee Niparnan. Recall  Complexity Analysis  Comparison of Two Algos  Big O  Simplification  From source code  Recursive.
1 Lucifer’s Hammer Derek Mehlhorn William Pearl Adrienne Upah A Computer Simulation of Asteroid Trajectories Team 34 Albuquerque Academy.
Lecture 8 Jianjun Hu Department of Computer Science and Engineering University of South Carolina CSCE350 Algorithms and Data Structure.
Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Cluster Computing, UNC-Charlotte, B. Wilkinson, 2007.
1 Friday, September 29, 2006 If all you have is a hammer, then everything looks like a nail. -Anonymous.
Daniel Blackburn Load Balancing in Distributed N-Body Simulations.
Complexity Analysis (Part I)
Pipelined Computations Divide a problem into a series of tasks A processor completes a task sequentially and pipes the results to the next processor Pipelining.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Chapter 4 Parallel Sort and GroupBy 4.1Sorting, Duplicate Removal and Aggregate 4.2Serial External Sorting Method 4.3Algorithms for Parallel External Sort.
Hash Tables1 Part E Hash Tables  
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Nattee Niparnan. Recall  Complexity Analysis  Comparison of Two Algos  Big O  Simplification  From source code  Recursive.
Analysis of Algorithms
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Newton didn’t discover gravity; he discovered that gravity is universal. Everything pulls on everything else in a simple way that involves only mass and.
Sorting Fun1 Chapter 4: Sorting     29  9.
Hash Tables1   © 2010 Goodrich, Tamassia.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Chapter 10 Algorithm Analysis.  Introduction  Generalizing Running Time  Doing a Timing Analysis  Big-Oh Notation  Analyzing Some Simple Programs.
CS 361 – Chapters 8-9 Sorting algorithms –Selection, insertion, bubble, “swap” –Merge, quick, stooge –Counting, bucket, radix How to select the n-th largest/smallest.
CS 61B Data Structures and Programming Methodology July 21, 2008 David Sun.
© 2004 Goodrich, Tamassia Hash Tables1  
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
CSCI-455/552 Introduction to High Performance Computing Lecture 11.5.
Basic Linear Algebra Subroutines (BLAS) – 3 levels of operations Memory hierarchy efficiently exploited by higher level BLAS BLASMemor y Refs. FlopsFlops/
Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.
Data Structures and Algorithms in Parallel Computing Lecture 10.
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
Data Structures and Algorithms in Parallel Computing Lecture 8.
Concurrency and Performance Based on slides by Henri Casanova.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Randomized Quicksort (8.4.2/7.4.2) Randomized Quicksort –i = Random(p, r) –swap A[p]  A[i] –partition A(p, r) Average analysis = Expected runtime –solving.
“Every object in the universe attracts every other object in the universe”
All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! 1 ITCS 4/5145 Parallel Computing,
ChaNGa: Design Issues in High Performance Cosmology
The University of Adelaide, School of Computer Science
© 2013 Goodrich, Tamassia, Goldwasser
Course Outline Introduction in algorithms and applications
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
COSC 320 Advanced Data Structures and Algorithm Analysis
All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! ITCS 4/5145 Parallel Computing,
All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! ITCS 4/5145 Parallel Computing,
Dictionaries 1/17/2019 7:55 AM Hash Tables   4
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! ITCS 4/5145 Parallel Computing,
N-Body Gravitational Simulations
Presentation transcript:

High-Performance Computing 12.2: Parallel Algorithms

Some Problems Are “Embarrassingly Parallel” Embarrassingly Parallel (C. Moler): A problem in which the same operation is applied to all elements (e.g., of a grid), with little or no communication among elements Example: in Matlab, grid(rand(n) < probAnts) is inherently parallel in this way (and Matlab has begun to exploit this)

The Parallel Data Partition (Master/Slave) Approach Master process communicates with user and with slaves 1. Partition data into n chunks and send each chunk to one of the p slave processes 2. Receive partial answers from slaves 3. Put partial answers together into single answer (grid, sum, etc.) and report it to user

The Parallel Data Partition (Master/Slave) Approach Slave processes communicate with master 1. Receive one data chunk from master 2. Run the algorithm (grid initialization, update, sum, etc.) on the chunk 3. Send the result back to the master

Parallel Data Partition: Speedup Ignoring communication time, the theoretical speedup for n data items running on p slave processes is As n grows large, this value approaches p (homework problem): speedup is linear in # of slaves.

Simple Parallel Data Partition Is Inefficient Communication time: Master has to send and receive p messages out one at a time, like dealing a deck of cards. Idle time: Master and some slaves are sitting idle until all slaves have computed their results

The Divide-and-Conquer Approach We can arrange things so that each process only sends/receives only two messages: less communication, less idling Think of dividing a pile of coins in half, then dividing each pile in half, till each pile has only one coin (or some fixed minimum # of coins) Example: summing 256 values with 8 processors...

“Divide” Phase p 0 : x 0 …x 255 p 0 : x 0 …x 31 p 1 : x 32 …x 63 p 2 : x 64 …x 95 p 3 : x 96 …x 127 p 4 : x 128 …x 159 p 5 : x 160 …x 191 p 6 : x 192 …x 223 p 7 : x 223 …x 255 p 0 : x 0 …x 63 p 2 : x 64 …x 127 p 4 : x 128 …x 191 p 6 : x 192 …x 255 p 0 : x 0 …x 127 p 4 : x 128 …x 255

“Conquer” Phase p 0 : x 0 …x 255 p 0 : x 0 …x 31 p 1 : x 32 …x 63 p 2 : x 64 …x 95 p 3 : x 96 …x 127 p 4 : x 128 …x 159 p 5 : x 160 …x 191 p 6 : x 192 …x 223 p 7 : x 223 …x 255 p 0 : x 0 …x 63 p 2 : x 64 …x 127 p 4 : x 128 …x 191 p 6 : x 192 …x 255 p 0 : x 0 …x 127 p 4 : x 128 …x 255

Parallel Random Number Generator Recall our linear congruential formula for generating random numbers; e.g.: We’d like to have each of our p processors generate its share of numbers; then master can string them together Problem: each processor will produce the same sequence! modulus multiplier seed r = 1, 7, 5, 2, 3, 10, 4, 6, 9, 8,...

Parallel Random Number Generator “Interleaving” Trick: For p processors, 1. Generate the first p numbers in the sequence: e.g., for p = 2, get 1, 7. These become the seeds for each processor. 2. To get the new multiplier, raise the old multiplier to p and mod the result with the modulus: e.g., for p = 2, 7 2 = 49; 49 mod 11 = 5. This becomes the multiplier for all processors. 1. Master interleaves results from slaves.

Parallel Random Number Generator p 0 : 1, 5, 3, 4, 9, p 1 : 7, 2, 10, 6, 8

The N-Body Problem Consider a large number N of bodies interacting with each other through gravity (e.g. galaxy of stars) As time progresses, each body moves based on the gravitational forces acting on it from all the others: where m 1, m 2 are the masses of two objects, r is the distance between them, and G is Newton’s Gravitational Constant.

Problem: for each of the N bodies, we must compute the force between it and the other N -1 bodies. This is N*(N-1)/2 = (N 2 – N)/2 computations, which is proportional to N 2 as N grows large. Even with perfect parallelism, we still perform 1/p * N 2 computations; i.e., still proportional to N 2. The N-Body Problem

The N-Body Problem: Barnes-Hut Solution Division by r 2 means that bodies distant from each other have relatively low mutual force. So we can focus on small clusters of stars for the formula, and then treat each cluster as a single body acting on other clusters This is another instance of divide-and-conquer. We will treat space as 2D for illustration purposes.

Divide

Divide

Build Labelled Hierarchy of Clusters a

Build Labelled Hierarchy of Clusters a b c d

Build Labelled Hierarchy of Clusters a b c d e

Produces a Quad-Tree a bcd e Each node (circle) stores the total mass and center-of-mass coordinates for its members. If two nodes (e.g. 1, 5) are more than some predetermined distance apart, we use their clusters instead (1, e)

Barnes-Hut Solution: Speedup Each of the N bodies is on average compared with log N other bodies, so instead of N 2 we have N * log N Can have each processor do the movement of its cluster in parallel with others (no communication)