Mattan Erez The University of Texas at Austin

Slides:



Advertisements
Similar presentations
Distributed Systems CS
Advertisements

Parallel Programming Patterns Eun-Gyu Kim June 10, 2004.
ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
Includes slides from “Multicore Programming Primer” course at Massachusetts Institute of Technology (MIT) by Prof. Saman Amarasinghe and Dr. Rodric Rabbah.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
Rechen- und Kommunikationszentrum (RZ) Parallelization at a Glance Christian Terboven / Aachen, Germany Stand: Version 2.3.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Performance Evaluation of Parallel Processing. Why Performance?
INTEL CONFIDENTIAL Predicting Parallel Performance Introduction to Parallel Programming – Part 10.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note Introduction to Parallel Computing (Blaise Barney,
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Lecture 3: Designing Parallel Programs. Methodological Design Designing and Building Parallel Programs by Ian Foster www-unix.mcs.anl.gov/dbpp.
Parallel Computing Presented by Justin Reschke
Computer Science 320 Measuring Sizeup. Speedup vs Sizeup If we add more processors, we should be able to solve a problem of a given size faster If we.
Concurrency and Performance Based on slides by Henri Casanova.
Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
1 Parallel Processing Fundamental Concepts. 2 Selection of an Application for Parallelization Can use parallel computation for 2 things: –Speed up an.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Concurrency Idea. 2 Concurrency idea Challenge –Print primes from 1 to Given –Ten-processor multiprocessor –One thread per processor Goal –Get ten-fold.
Parallel programs Inf-2202 Concurrent and Data-intensive Programming Fall 2016 Lars Ailo Bongo
These slides are based on the book:
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Auburn University
Parallel Patterns.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
PARALLEL COMPUTING Submitted By : P. Nagalakshmi
Conception of parallel algorithms
Parallel Programming By J. H. Wang May 2, 2017.
6/16/2010 Parallel Performance Parallel Performance.
Parallel Programming Patterns
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Chapter 4: Multithreaded Programming
CS 584 Lecture 3 How is the assignment going?.
Parallel Algorithm Design
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.
EE 193: Parallel Computing
Mattan Erez The University of Texas at Austin
CPSC 531: System Modeling and Simulation
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
Lecture 3 : Performance of Parallel Programs
Mattan Erez The University of Texas at Austin
CSE8380 Parallel and Distributed Processing Presentation
Mattan Erez The University of Texas at Austin
CS 584.
Distributed Systems CS
Parallelismo.
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Mattan Erez The University of Texas at Austin
Computational issues Issues Solutions Large time scale
Mattan Erez The University of Texas at Austin
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

Mattan Erez The University of Texas at Austin EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 10 – Parallelism in Software I Mattan Erez The University of Texas at Austin EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 10 (c) Mattan Erez

Outline Parallel programming Parallelizing a program Start from scratch Reengineering for parallelism Parallelizing a program Decomposition (finding concurrency) Assignment (algorithm structure) Orchestration (supporting structures) Mapping (implementation mechanisms) Patterns for Parallel Programming EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 10 (c) Mattan Erez

Credits Most of the slides courtesy Dr. Rodric Rabbah (IBM) Taken from 6.189 IAP taught at MIT in 2007. EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 10 (c) Mattan Erez

Parallel programming from scratch Start with an algorithm Formal representation of problem solution Sequence of steps Make sure there is parallelism In each algorithm step Minimize synchronization points Don’t forget locality Communication is costly Performance, Energy, System cost More often start with existing sequential code EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 10 (c) Mattan Erez

Reengineering for Parallelism Define a testing protocol Identify program hot spots: where is most of the time spent? Look at code Use profiling tools Parallelization Start with hot spots first Make sequences of small changes, each followed by testing Patterns provide guidance EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 10 (c) Mattan Erez

4 Common Steps to Creating a Parallel Program Partitioning decomposi t ion assignment orchest rat ion mapping UE0 UE1 UE0 UE1 UE0 UE1 UE2 UE3 UE2 UE3 UE2 UE3 Sequential Computation Tasks Units of Execution Parallel Processors program EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 10 (c) Mattan Erez

Main consideration: coverage and Amdahl’s Law Decomposition Identify concurrency and decide at what level to exploit it Break up computation into tasks to be divided among processes Tasks may become available dynamically Number of tasks may vary with time Enough tasks to keep processors busy Number of tasks available at a time is upper bound on achievable speedup Main consideration: coverage and Amdahl’s Law EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 10 (c) Mattan Erez

Coverage Amdahl's Law: The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. Demonstration of the law of diminishing returns EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 10 (c) Mattan Erez

Amdahl’s Law Potential program speedup is defined by the fraction of code that can be parallelized time Use 5 processors for parallel work 25 seconds + sequential 25 seconds + sequential parallel 50 seconds + 10 seconds + sequential sequential 25 seconds 25 seconds 100 seconds 60 seconds EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 10 (c) Mattan Erez

Amdahl’s Law Speedup = old running time / new running time Use 5 processors for parallel work 25 seconds + sequential 25 seconds + sequential parallel 50 seconds + 10 seconds + sequential sequential 25 seconds 25 seconds 100 seconds 60 seconds Speedup = old running time / new running time = 100 seconds / 60 seconds = 1.67 (parallel version is 1.67 times faster) EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 10 (c) Mattan Erez

Amdahl’s Law p = fraction of work that can be parallelized n = the number of processor fraction of time to complete parallel work fraction of time to complete sequential work EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 10 (c) Mattan Erez

Implications of Amdahl’s Law Speedup tends to as number of processors tends to infinity Super linear speedups are possible due to registers and caches Typical speedup is less than linear linear speedup (100% efficiency) number of processors speedup Parallelism only worthwhile when it dominates execution EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 10 (c) Mattan Erez

Main considerations: granularity and locality Assignment Specify mechanism to divide work among PEs Balance work and reduce communication Structured approaches usually work well Code inspection or understanding of application Well-known design patterns As programmers, we worry about partitioning first Independent of architecture or programming model? Complexity often affects decisions Architectural model affects decisions Main considerations: granularity and locality EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 10 (c) Mattan Erez

Fine vs. Coarse Granularity Coarse-grain Parallelism High computation to communication ratio Large amounts of computational work between communication events Harder to load balance efficiently Fine-grain Parallelism Low computation to communication ratio Small amounts of computational work between communication stages High communication overhead Potential HW assist EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 10 (c) Mattan Erez

Load Balancing vs. Synchronization Fine Coarse PE0 PE1 PE0 PE1 EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 10 (c) Mattan Erez

Load Balancing vs. Synchronization Fine Coarse PE0 PE1 PE0 PE1 Expensive sync  coarse granularity Few units of exec + time disparity  fine granularity EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 10 (c) Mattan Erez

Orchestration and Mapping Computation and communication concurrency Preserve locality of data Schedule tasks to satisfy dependences early Survey available mechanisms on target system Main considerations: locality, parallelism, mechanisms (efficiency and dangers) EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 10 (c) Mattan Erez

Parallel Programming by Pattern Provides a cookbook to systematically guide programmers Decompose, Assign, Orchestrate, Map Can lead to high quality solutions in some domains Provide common vocabulary to the programming community Each pattern has a name, providing a vocabulary for discussing solutions Helps with software reusability, malleability, and modularity Written in prescribed format to allow the reader to quickly understand the solution and its context Otherwise, too difficult for programmers, and software will not fully exploit parallel hardware EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 10 (c) Mattan Erez