Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Slides:



Advertisements
Similar presentations
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Advertisements

Practical techniques & Examples
2 Less fish … More fish! Parallelism means doing multiple things at the same time: you can get more work done in the same time.
Parallel Programming in C with MPI and OpenMP
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Reference: Message Passing Fundamentals.
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
1 Friday, September 29, 2006 If all you have is a hammer, then everything looks like a nail. -Anonymous.
Lecture 6 Objectives Communication Complexity Analysis Collective Operations –Reduction –Binomial Trees –Gather and Scatter Operations Review Communication.
Parallel MIMD Algorithm Design
Message Passing Fundamentals Self Test. 1.A shared memory computer has access to: a)the memory of other nodes via a proprietary high- speed communications.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8 Matrix-vector Multiplication.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Multi-Tasking Models and Algorithms
Parallel Programming Models and Paradigms
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Topic Overview One-to-All Broadcast and All-to-One Reduction
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
Today Objectives Chapter 6 of Quinn Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and printing 2-D.
Distributed Process Management1 Learning Objectives Distributed Scheduling Algorithms Coordinator Elections Orphan Processes.
1 Parallel Computing 5 Parallel Application Design Ondřej Jakl Institute of Geonics, Academy of Sci. of the CR.
The sequence of graph transformation (P1)-(P2)-(P4) generating an initial mesh with two finite elements GENERATION OF THE TOPOLOGY OF INITIAL MESH Graph.
L15: Putting it together: N-body (Ch. 6) October 30, 2012.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Parallel Simulation of Continuous Systems: A Brief Introduction
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 8 October 23, 2002 Nayda G. Santiago.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Static Process Scheduling
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory
Day 2. Agenda Parallelism basics Parallel machines Parallelism again High Throughput Computing –Finding the right grain size.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
CS 420 Design of Algorithms Parallel Algorithm Design.
Lecture 3: Designing Parallel Programs. Methodological Design Designing and Building Parallel Programs by Ian Foster www-unix.mcs.anl.gov/dbpp.
Concurrency and Performance Based on slides by Henri Casanova.
Lecture 3: Today’s topics MPI Broadcast (Quinn Chapter 5) –Sieve of Eratosthenes MPI Send and Receive calls (Quinn Chapter 6) –Floyd’s algorithm Other.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Auburn University
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
CS 584 Lecture 3 How is the assignment going?.
Parallel Algorithm Design
Parallel Programming in C with MPI and OpenMP
Parallel Programming with MPI and OpenMP
CSCE569 Parallel Computing
CSCE569 Parallel Computing
Parallel Programming in C with MPI and OpenMP
CS 584.
CSCE569 Parallel Computing
Parallel Programming in C with MPI and OpenMP
COMP60611 Fundamentals of Parallel and Distributed Systems
CS 584 Lecture 5 Assignment. Due NOW!!.
Presentation transcript:

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of Computer Science and Engineering

Outline: Parallel Programming Model and algorithm design Task/channel model Algorithm design methodology Case studies

Task/Channel Model Parallel computation = set of tasks Task Program Local memory Collection of I/O ports Tasks interact by sending messages through channels

Task/Channel Model Task Channel

Foster’s Design Methodology Partitioning Communication Agglomeration Mapping

Foster’s Methodology

Partitioning Dividing computation and data into pieces Domain decomposition Divide data into pieces Determine how to associate computations with the data Functional decomposition Divide computation into pieces Determine how to associate data with the computations

Example Domain Decompositions

Example Functional Decomposition

Partitioning Checklist At least 10x more primitive tasks than processors in target computer Minimize redundant computations and redundant data storage Primitive tasks roughly the same size Number of tasks an increasing function of problem size

Communication Determine values passed among tasks Local communication Task needs values from a small number of other tasks Create channels illustrating data flow Global communication Significant number of tasks contribute data to perform a computation Don’t create channels for them early in design

Communication Checklist Communication operations balanced among tasks Each task communicates with only small group of neighbors Tasks can perform communications concurrently Task can perform computations concurrently

Agglomeration Grouping tasks into larger tasks Goals Improve performance Maintain scalability of program Simplify programming In MPI programming, goal often to create one agglomerated task per processor

Agglomeration Can Improve Performance Eliminate communication between primitive tasks agglomerated into consolidated task Combine groups of sending and receiving tasks

Agglomeration Checklist Locality of parallel algorithm has increased Replicated computations take less time than communications they replace Data replication doesn’t affect scalability Agglomerated tasks have similar computational and communications costs Number of tasks increases with problem size Number of tasks suitable for likely target systems Tradeoff between agglomeration and code modifications costs is reasonable

Mapping Process of assigning tasks to processors Centralized multiprocessor: mapping done by operating system Distributed memory system: mapping done by user Conflicting goals of mapping Maximize processor utilization Minimize interprocessor communication

Mapping Example

Optimal Mapping Finding optimal mapping is NP-hard Must rely on heuristics

Mapping Decision Tree Static number of tasks Structured communication Constant computation time per task Agglomerate tasks to minimize comm Create one task per processor Variable computation time per task Cyclically map tasks to processors Unstructured communication Use a static load balancing algorithm Dynamic number of tasks

Mapping Strategy Static number of tasks Dynamic number of tasks Frequent communications between tasks Use a dynamic load balancing algorithm Many short-lived tasks Use a run-time task-scheduling algorithm

Mapping Checklist Considered designs based on one task per processor and multiple tasks per processor Evaluated static and dynamic task allocation If dynamic task allocation chosen, task allocator is not a bottleneck to performance If static task allocation chosen, ratio of tasks to processors is at least 10:1

Case Studies Boundary value problem Finding the maximum The n-body problem Adding data input

Boundary Value Problem Ice waterRodInsulation

Rod Cools as Time Progresses

Finite Difference Approximation

Partitioning One data item per grid point Associate one primitive task with each grid point Two-dimensional domain decomposition

Communication Identify communication pattern between primitive tasks Each interior primitive task has three incoming and three outgoing channels

Agglomeration and Mapping Agglomeration

Sequential execution time  – time to update element n – number of elements m – number of iterations Sequential execution time: m (n-1) 

Parallel Execution Time p – number of processors – message latency Parallel execution time m(  (n-1)/p  +2 )

Finding the Maximum Error 6.25%

Reduction Given associative operator  a 0  a 1  a 2  …  a n-1 Examples Add Multiply And, Or Maximum, Minimum

Parallel Reduction Evolution

Binomial Trees Subgraph of hypercube

Finding Global Sum

Finding Global Sum

Finding Global Sum

Finding Global Sum 178

Finding Global Sum 25 Binomial Tree

Agglomeration

sum

The n-body Problem

Partitioning Domain partitioning Assume one task per particle Task has particle’s position, velocity vector Iteration Get positions of all other particles Compute new position, velocity

Gather

All-gather

Complete Graph for All-gather

Hypercube for All-gather

Communication Time Hypercube Complete graph

Adding Data Input

Scatter

Scatter in log p Steps

Summary: Task/channel Model Parallel computation Set of tasks Interactions through channels Good designs Maximize local computations Minimize communications Scale up

Summary: Design Steps Partition computation Agglomerate tasks Map tasks to processors Goals Maximize processor utilization Minimize inter-processor communication

Summary: Fundamental Algorithms Reduction Gather and scatter All-gather