Mattan Erez The University of Texas at Austin

Slides:



Advertisements
Similar presentations
Parallel Programming Patterns Eun-Gyu Kim June 10, 2004.
Advertisements

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
Introduction to Parallel Computing
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
Includes slides from “Multicore Programming Primer” course at Massachusetts Institute of Technology (MIT) by Prof. Saman Amarasinghe and Dr. Rodric Rabbah.
Algorithms and Problem Solving-1 Algorithms and Problem Solving.
Algorithms and Problem Solving. Learn about problem solving skills Explore the algorithmic approach for problem solving Learn about algorithm development.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.
Rechen- und Kommunikationszentrum (RZ) Parallelization at a Glance Christian Terboven / Aachen, Germany Stand: Version 2.3.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
SOFTWARE DESIGN AND ARCHITECTURE LECTURE 27. Review UML dynamic view – State Diagrams.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Pipes & Filters Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Concurrency and Performance Based on slides by Henri Casanova.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
Advanced Computer Systems
Parallel Patterns.
Algorithms and Problem Solving
Lecture 1 Introduction Richard Gesick.
Conception of parallel algorithms
COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE
Data Flow Architecture
Introduction to Design Patterns
Multiscalar Processors
Parallel Processing - introduction
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Programming Patterns
Pattern Parallel Programming
CS101 Introduction to Computing Lecture 19 Programming Languages
Parallel Algorithm Design
Pipelining and Vector Processing
CSCI1600: Embedded and Real Time Software
Mattan Erez The University of Texas at Austin
Unit# 9: Computer Program Development
Mattan Erez The University of Texas at Austin
Central Processing Unit
Objective of This Course
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
CS 584.
Parallelismo.
Algorithms and Problem Solving
Creating Computer Programs
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
Dynamic Hardware Prediction
FUNDAMENTAL CONCEPTS OF PARALLEL PROGRAMMING
Creating Computer Programs
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Mattan Erez The University of Texas at Austin
From Use Cases to Implementation
Presentation transcript:

Mattan Erez The University of Texas at Austin EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 11 – Parallelism in Software II Mattan Erez The University of Texas at Austin EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Credits Most of the slides courtesy Dr. Rodric Rabbah (IBM) Taken from 6.189 IAP taught at MIT in 2007. EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

4 Common Steps to Creating a Parallel Program Partitioning decomposi t ion assignment orchest rat ion mapping UE0 UE1 UE0 UE1 UE0 UE1 UE2 UE3 UE2 UE3 UE2 UE3 Sequential Computation Tasks Units of Execution Parallel Processors program EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Main consideration: coverage and Amdahl’s Law Decomposition Identify concurrency and decide at what level to exploit it Break up computation into tasks to be divided among processes Tasks may become available dynamically Number of tasks may vary with time Enough tasks to keep processors busy Number of tasks available at a time is upper bound on achievable speedup Main consideration: coverage and Amdahl’s Law EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Main considerations: granularity and locality Assignment Specify mechanism to divide work among PEs Balance work and reduce communication Structured approaches usually work well Code inspection or understanding of application Well-known design patterns As programmers, we worry about partitioning first Independent of architecture or programming model? Complexity often affects decisions Architectural model affects decisions Main considerations: granularity and locality EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Orchestration and Mapping Computation and communication concurrency Preserve locality of data Schedule tasks to satisfy dependences early Survey available mechanisms on target system Main considerations: locality, parallelism, mechanisms (efficiency and dangers) EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Parallel Programming by Pattern Provides a cookbook to systematically guide programmers Decompose, Assign, Orchestrate, Map Can lead to high quality solutions in some domains Provide common vocabulary to the programming community Each pattern has a name, providing a vocabulary for discussing solutions Helps with software reusability, malleability, and modularity Written in prescribed format to allow the reader to quickly understand the solution and its context Otherwise, too difficult for programmers, and software will not fully exploit parallel hardware EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

History Berkeley architecture professor Christopher Alexander In 1977, patterns for city planning, landscaping, and architecture in an attempt to capture principles for “living” design EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Example 167 (p. 783): 6ft Balcony Not just a collection of patterns, but a pattern language/ Patterns leads to other patterns. Patterns are hierarchical and compositional. Embodies design methodology and vocabulary. EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Patterns in Object-Oriented Programming Design Patterns: Elements of Reusable Object- Oriented Software (1995) Gang of Four (GOF): Gamma, Helm, Johnson, Vlissides Catalogue of patterns Creation, structural, behavioral Design patterns describe successful solutions to recurring design problems, and are organized hierarchically into a pattern language. Their form has proven to be a useful tool to model design experience, in architecture where they were originally conceived [1, 2] EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Patterns for Parallelizing Programs 4 Design Spaces Algorithm Expression Finding Concurrency Expose concurrent tasks Algorithm Structure Map tasks to processes to exploit parallel architecture Software Construction Supporting Structures Code and data structuring patterns Implementation Mechanisms Low level mechanisms used to write parallel programs Patterns for Parallel Programming. Mattson, Sanders, and Massingill (2005). EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Here’s my algorithm. Where’s the concurrency? MPEG Decoder MPEG bit stream VLD macroblocks, motion vectors split frequency encoded macroblocks differentially coded motion vectors ZigZag IQuantization Motion Vector Decode IDCT Repeat Saturation spatially encoded macroblocks motion vectors join Motion Compensation recovered picture Picture Reorder Color Conversion Display EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Here’s my algorithm. Where’s the concurrency? MPEG Decoder MPEG bit stream Task decomposition Independent coarse-grained computation Inherent to algorithm Sequence of statements (instructions) that operate together as a group Corresponds to some logical part of program Usually follows from the way programmer thinks about a problem VLD macroblocks, motion vectors split frequency encoded macroblocks differentially coded motion vectors ZigZag IQuantization Motion Vector Decode IDCT Repeat Saturation spatially encoded macroblocks motion vectors join Motion Compensation recovered picture Picture Reorder Color Conversion Display EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Here’s my algorithm. Where’s the concurrency? MPEG Decoder MPEG bit stream Task decomposition Parallelism in the application Pipeline task decomposition Data assembly lines Producer-consumer chains VLD macroblocks, motion vectors split frequency encoded macroblocks differentially coded motion vectors ZigZag IQuantization Motion Vector Decode IDCT Repeat Saturation spatially encoded macroblocks motion vectors join Motion Compensation recovered picture Picture Reorder Color Conversion Display EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Here’s my algorithm. Where’s the concurrency? MPEG Decoder MPEG bit stream Task decomposition Parallelism in the application Pipeline task decomposition Data assembly lines Producer-consumer chains Data decomposition Same computation is applied to small data chunks derived from large data set VLD macroblocks, motion vectors split frequency encoded macroblocks differentially coded motion vectors ZigZag IQuantization Motion Vector Decode IDCT Repeat Saturation spatially encoded macroblocks motion vectors join Motion Compensation recovered picture Picture Reorder Color Conversion Display EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Guidelines for Task Decomposition Algorithms start with a good understanding of the problem being solved Programs often naturally decompose into tasks Two common decompositions are Function calls and Distinct loop iterations Easier to start with many tasks and later fuse them, rather than too few tasks and later try to split them EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Guidelines for Task Decomposition Flexibility Program design should afford flexibility in the number and size of tasks generated Tasks should not tied to a specific architecture Fixed tasks vs. Parameterized tasks Efficiency Tasks should have enough work to amortize the cost of creating and managing them Tasks should be sufficiently independent so that managing dependencies doesn’t become the bottleneck Simplicity The code has to remain readable and easy to understand, and debug EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Case for Pipeline Decomposition Data is flowing through a sequence of stages Assembly line is a good analogy What’s a prime example of pipeline decomposition in computer architecture? Instruction pipeline in modern CPUs What’s an example pipeline you may use in your UNIX shell? Pipes in UNIX: cat foobar.c | grep bar | wc Other examples Signal processing Graphics ZigZag IQuantization IDCT Saturation EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Guidelines for Data Decomposition Data decomposition is often implied by task decomposition Programmers need to address task and data decomposition to create a parallel program Which decomposition to start with? Data decomposition is a good starting point when Main computation is organized around manipulation of a large data structure Similar operations are applied to different parts of the data structure EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Common Data Decompositions Geometric data structures Decomposition of arrays along rows, columns, blocks Decomposition of meshes into domains EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Common Data Decompositions Geometric data structures Decomposition of arrays along rows, columns, blocks Decomposition of meshes into domains Recursive data structures Example: decomposition of trees into sub-trees problem subproblem split subproblem split split compute subproblem compute subproblem compute subproblem compute subproblem merge merge subproblem subproblem merge solution EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Guidelines for Data Decomposition Flexibility Size and number of data chunks should support a wide range of executions Efficiency Data chunks should generate comparable amounts of work (for load balancing) Simplicity Complex data compositions can get difficult to manage and debug EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Data Decomposition Examples Molecular dynamics Compute forces Update accelerations and velocities Update positions Decomposition Baseline algorithm is N2 All-to-all communication Best decomposition is to treat mols. as a set Some advantages to geometric discussed in future lecture EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Data Decomposition Examples Molecular dynamics Geometric decomposition Merge sort Recursive decomposition problem subproblem split subproblem split split compute subproblem compute subproblem compute subproblem compute subproblem merge merge subproblem subproblem merge solution EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Dependence Analysis Given two tasks how to determine if they can safely run in parallel? EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Bernstein’s Condition Ri: set of memory locations read (input) by task Ti Wj: set of memory locations written (output) by task Tj Two tasks T1 and T2 are parallel if input to T1 is not part of output from T2 input to T2 is not part of output from T1 outputs from T1 and T2 do not overlap EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Example T1 a = x + y T2 b = x + z R1 = { x, y } W1 = { a } R2 = { x, z } W2 = { b } EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Patterns for Parallelizing Programs 4 Design Spaces Algorithm Expression Finding Concurrency Expose concurrent tasks Algorithm Structure Map tasks to processes to exploit parallel architecture Software Construction Supporting Structures Code and data structuring patterns Implementation Mechanisms Low level mechanisms used to write parallel programs Patterns for Parallel Programming. Mattson, Sanders, and Massingill (2005). EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Algorithm Structure Design Space Given a collection of concurrent tasks, what’s the next step? Map tasks to units of execution (e.g., threads) Important considerations Magnitude of number of execution units platform will support Cost of sharing information among execution units Avoid tendency to over constrain the implementation Work well on the intended platform Flexible enough to easily adapt to different architectures EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Major Organizing Principle How to determine the algorithm structure that represents the mapping of tasks to units of execution? Concurrency usually implies major organizing principle Organize by tasks Organize by data decomposition Organize by flow of data EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Organize by Tasks? Recursive? yes Divide and Conquer no Task Parallelism EE382N: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez