Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN 600.320/420 Instructor: Randal Burns 26 February 2014.

Slides:



Advertisements
Similar presentations
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Advertisements

Parallel Programming Patterns Eun-Gyu Kim June 10, 2004.
2 Less fish … More fish! Parallelism means doing multiple things at the same time: you can get more work done in the same time.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
Advanced Topics in Algorithms and Data Structures Lecture pg 1 Recursion.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Reference: Message Passing Fundamentals.
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
1 Friday, September 29, 2006 If all you have is a hammer, then everything looks like a nail. -Anonymous.
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
Message Passing Fundamentals Self Test. 1.A shared memory computer has access to: a)the memory of other nodes via a proprietary high- speed communications.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel Programming Models and Paradigms
CS 284a, 4 November 1997 Copyright (c) , John Thornley1 CS 284a Lecture Tuesday, 4 November, 1997.
Pipelined Computations Divide a problem into a series of tasks A processor completes a task sequentially and pipes the results to the next processor Pipelining.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
Mapping Techniques for Load Balancing
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
Finding Concurrency CET306 Harry R. Erwin University of Sunderland.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
DLS on Star (Single-level tree) Networks Background: A simple network model for DLS is the star network with a master-worker platform. It consists of a.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
Lecture 3: Designing Parallel Programs. Methodological Design Designing and Building Parallel Programs by Ian Foster www-unix.mcs.anl.gov/dbpp.
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
Parallel Computing Presented by Justin Reschke
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Parallel Programming Models EECC 756 David D. McGann 18 May, 1999.
Parallel Patterns.
Auburn University
Xing Cai University of Oslo
Parallel Patterns Reduce & Scan
Conception of parallel algorithms
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Programming Patterns
CS 584 Lecture 3 How is the assignment going?.
Parallel Algorithm Design
Parallel Programming in C with MPI and OpenMP
EE 193: Parallel Computing
Mattan Erez The University of Texas at Austin
CS 584.
Mattan Erez The University of Texas at Austin
Parallel Programming in C with MPI and OpenMP
Mattan Erez The University of Texas at Austin
Presentation transcript:

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014

Lecture 7: Finding Concurrency Big Picture Recipes (patterns) for turning a problem into a parallel program in four steps – Find concurrency – Choose algorithmic structure – Identify data structures – Implement Based on an analysis of problem domain – And comparison of the effectiveness of different patterns

Lecture 7: Finding Concurrency First Questions Should I solve this problem with a parallel program? – Implementation time versus solution time – Does the problem parallelize Identify computationally expensive parts – Parallelism is for a reason: to improve performance – Only optimize the expensive parts

Lecture 7: Finding Concurrency Three Steps to Finding Concurrency Consider all options at each stage And iterate among steps

Lecture 7: Finding Concurrency What’s a good decomposition? Flexible – Independent of architecture, programming language, problem size, etc. Efficient – Scalable: generates enough tasks to keep PEs busy – Each tasks represents enough works to amortize startup – Tasks are minimally dependent (often competes w/ scalable) Simple – Can be implemented, debugged, maintained, and is resuable

Lecture 7: Finding Concurrency Task Decomposition Divide problem into groups of operations/instructions – Most natural/intuitive approach Identify tasks – Tasks should be independent or nearly independent – Recall, tasks are groups of operations/instructions – Consider all tasks sequentially Idioms for identifying tasks – Loops – Functions (with no side effects) = functional decomposition – Higher-level concepts (not software-derived), e.g. trajectories in medical imaging

Lecture 7: Finding Concurrency Data Decomposition Divide problem based on partitioning, distributing or replicating the data Works well when: – Compute intensive work manipulates a large data structure – Similar operations to different parts of data (SPMD) Idioms for data decomposition – Sequential (arrays) or spatial division – Recursive division of data – Clusters

Lecture 7: Finding Concurrency Tasks or Data or Both Decompositions are not independent – Task decomposition derives a data decomposition – Data decomposition implies a task decomposition Iteration leads to hybrid designs – Not purely either – For embarassingly parallel problems, task and data decomposition are identical. Why?

Lecture 7: Finding Concurrency Example Problem Streaming surface reconstruction Tasks = solve poisson equation in each cell

Lecture 7: Finding Concurrency Example Problem: Tasks Multiple streaming passes – Could decompose by pass – Limited parallelism and sequential data dependencies

Lecture 7: Finding Concurrency Example Problem: Tasks II Iterations of the solver – Same problems

Lecture 7: Finding Concurrency Example Problem: Data Quad tree

Lecture 7: Finding Concurrency Example Problem: Decomposition Data decomposition; hierarchy of streams – Replicate highest level streams – Paration lower level streams

Lecture 7: Finding Concurrency Example Problem: Comments End up with tasks defined by data – Update solution in each partition Mutliple parallel programs – For each phase in the processing pipeline

Lecture 7: Finding Concurrency Dependency Analysis Help! I decomposed my problem and the tasks are not independent. How does decomposed data depend on each other? How do tasks depend on each other?

Lecture 7: Finding Concurrency Dependency Analysis Help! I decomposed my problem and the tasks are not independent. How does decomposed data depend on each other? Data are used by multiple tasks – Replication (overlap) – Read/write dependencies How do tasks depend on each other? Share data – Sequential/ordering constraints

Lecture 7: Finding Concurrency Group Tasks For complex problems, not all tasks are the same Natural groups – Share a temporal constraint (satisfy constraint once) – Share a data dependency (meet depedency once) – Idependent tasks non-intuitive, but this is a group shares no dependencies and allows for maximum concurrency Grouping results in: – Simplified dependency analysis – Identification of concurrency

Lecture 7: Finding Concurrency Order Tasks Find and account for dependencies – Temporal – Concurrent – Independence Build an execution graph Design principles – Must meet all constraints (correct) – Minimally (to not interfere with concurrency) Example: merge sort (recursive parallelism)

Lecture 7: Finding Concurrency Merge Sort /06/07/ aspx Dependencies (L) and parallelism (R)

Lecture 7: Finding Concurrency Data Sharing What group and order are to task decomposition, data sharing is to data decomposition Several types of shared data – Local data partitioned to tasks (no dependencies) – Local data transferred from task to task (associated with dependencies) – Global read-only data (can be replicated, no dependencies) – Global shared data structure Map data sharing dependencies onto group/order Principles: – Minimize data sharing associated with dependencies – Think about sharing frequency and granularity

Lecture 7: Finding Concurrency Example Problem Dependency analysis: all neighbors in the tree at all levels in the hierarchy! – Replicate or share the root partition Group tasks – By data partitions – Into pipeline phases Order tasks: in sweep order? top to bottom? – We relaxed these constraints – But, ordered by iteration

Lecture 7: Finding Concurrency Example Problem Data sharing – Neighbors across each partition boundary (read/write) – Replicated root partition (read/write) – Replicated quad-tree structure (read-only) Separate replicated data by access type

Lecture 7: Finding Concurrency Design Evaluation Revisiting flexibility, efficiency and simplicity Suitability for the target platform? – #PEs available and #UEs produced by problem design – How are data structures share is granularity suitable for target architectures? (cache alignment, #messages, message size) How much concurrency did the design produce? – Compare useful work to interference/synchronization Can the design be parameterized? – To different problem sizes – To different numbers of processors Is the design architecture independent? – Rarely does one answer yes

Lecture 7: Finding Concurrency Design Evaluation II More specific questions: – Does the task decomposition and ordering allow for load balancing? Or too many temporal constraints? – Are the tasks regular? Or heterogeneous in size? – Can the tasks run asynchronously? (How many barriers?) – Does the decomposition allow for overlapped computation with communication and I/O? And then redo the whole thing