Nov 2010 1 COMP60621 Concurrent Programming for Numerical Applications Lecture 1 The Nature of Parallelism: The Data-Parallel Algorithmic Core Len Freeman,

Slides:



Advertisements
Similar presentations
Tables and Information Retrieval
Advertisements

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Numerical Algorithms Matrix multiplication
Chapter 1 Computing Tools Data Representation, Accuracy and Precision Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction.
Section 2.3 Gauss-Jordan Method for General Systems of Equations
Advanced Topics in Algorithms and Data Structures Classification of the PRAM model In the PRAM model, processors communicate by reading from and writing.
PRAM Models Advanced Algorithms & Data Structures Lecture Theme 13 Prof. Dr. Th. Ottmann Summer Semester 2006.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
Introduction to Analysis of Algorithms
ADDITIONAL ANALYSIS TECHNIQUES LEARNING GOALS REVIEW LINEARITY The property has two equivalent definitions. We show and application of homogeneity APPLY.
Complexity Analysis (Part I)
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
Sparse Matrix Algorithms CS 524 – High-Performance Computing.
CEG 221 Lesson 5: Algorithm Development II Mr. David Lippa.
Parallel Merging Advanced Algorithms & Data Structures Lecture Theme 15 Prof. Dr. Th. Ottmann Summer Semester 2006.
Now Playing: Gong Sigur Rós From Takk... Released September 13, 2005.
Complexity (Running Time)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
MOHAMMAD IMRAN DEPARTMENT OF APPLIED SCIENCES JAHANGIRABAD EDUCATIONAL GROUP OF INSTITUTES.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Analysis of Algorithms COMP171 Fall Analysis of Algorithms / Slide 2 Introduction * What is Algorithm? n a clearly specified set of simple instructions.
Quantum One: Lecture 7. The General Formalism of Quantum Mechanics.
1 Chapter 2 Matrices Matrices provide an orderly way of arranging values or functions to enhance the analysis of systems in a systematic manner. Their.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Control Loop Interaction
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
Multi-Dimensional Arrays
Copyright © 2013, 2009, 2005 Pearson Education, Inc. 1 5 Systems and Matrices Copyright © 2013, 2009, 2005 Pearson Education, Inc.
Analysis of Algorithms
Prerequisites: Fundamental Concepts of Algebra
CMPS 1371 Introduction to Computing for Engineers MATRICES.
1 Lesson 8: Basic Monte Carlo integration We begin the 2 nd phase of our course: Study of general mathematics of MC We begin the 2 nd phase of our course:
Yaomin Jin Design of Experiments Morris Method.
Multivariate Statistics Matrix Algebra I W. M. van der Veld University of Amsterdam.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Gile Sampling1 Sampling. Fundamental principles. Daniel Gile
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Chapter 10 Algorithm Analysis.  Introduction  Generalizing Running Time  Doing a Timing Analysis  Big-Oh Notation  Analyzing Some Simple Programs.
Copyright © Cengage Learning. All rights reserved.
1 8. One Function of Two Random Variables Given two random variables X and Y and a function g(x,y), we form a new random variable Z as Given the joint.
AGC DSP AGC DSP Professor A G Constantinides©1 Signal Spaces The purpose of this part of the course is to introduce the basic concepts behind generalised.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.
Computer Graphics Matrices
Paper_topic: Parallel Matrix Multiplication using Vertical Data.
Divide and Conquer Faculty Name: Ruhi Fatima Topics Covered Divide and Conquer Matrix multiplication Recurrence.
Lecture 9 Architecture Independent (MPI) Algorithm Design
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
Matrices Introduction.
Pipelining and Vector Processing
Discrete Structures for Computer Science
Craig Schroeder October 26, 2004
Quantum One.
Hidden Markov Models Part 2: Algorithms
Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.
Objective of This Course
COMP60611 Fundamentals of Parallel and Distributed Systems
COMP60621 Designing for Parallelism
COMP60611 Fundamentals of Parallel and Distributed Systems
COMP60621 Fundamentals of Parallel and Distributed Systems
8. One Function of Two Random Variables
COMP60611 Fundamentals of Parallel and Distributed Systems
COMP60611 Fundamentals of Parallel and Distributed Systems
8. One Function of Two Random Variables
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

Nov COMP60621 Concurrent Programming for Numerical Applications Lecture 1 The Nature of Parallelism: The Data-Parallel Algorithmic Core Len Freeman, Graham Riley Centre for Novel Computing School of Computer Science University of Manchester

Nov Overview –Generic Properties of Applications –Task-Parallelism vs. Data-Parallelism –Four Data-Parallel 'Kernel' Algorithms Elementwise Vector Addition Vector Sum Reduction Matrix-Vector Multiplication Matrix-Matrix Multiplication –Summary

Nov Generic Properties of Applications –From extensive studies of HPC simulations, we conclude that there are many potential applications for HPC, drawn from diverse disciplines and with quite different intrinsic characteristics. –On the other hand, there are some characteristics in common between the applications. For example, simulations often require the use of discrete approximations to continuous domains. –Each application needs an underpinning mathematical model and an algorithmic procedure which 'animates' the model in a fashion suitable for digital computation. –Is it possible to classify the nature of the parallelism that occurs in applications?

Nov Task-Parallelism vs.Data-Parallelism –Perhaps the most practically interesting thing to emerge from our examples so far is the following distinction in styles of parallelism: Task-parallelism – in which different functions are performed simultaneously, possibly using (part of) the same data; the different functions may take very different times to execute. Data-parallelism – in which the same function is performed simultaneously, but on different (sets of) data; often, but not always, the function executes in the same time, even though the data values vary. There is further substructure in data-parallelism: experience points to three generic forms which are conveniently introduced using the three examples below.

Nov Data Parallelism –Data-parallel algorithms are epitomised by four very simple examples: Element-wise vector addition; Vector sum reduction; Matrix-vector multiplication; Matrix-matrix multiplication. –On their own, these are simple tasks, which we would normally expect to find embedded as subtasks of some more complex computation. Nevertheless, taken together, they are complex enough to illustrate most of the major issues in parallel computing (task-parallelism is readily included by chaining two or more of these examples together). They certainly deserve to be treated as a core part of our algorithmic presentation.

Nov Introduction to Kernel Data-Parallel Algorithms For each example, we shall investigate: The work that needs to be done. The ways in which the necessary work might be done in parallel. Any inherent constraints associated with the resulting parallelism. How performance might be affected as a result of any choices made. Remember that we are dealing with abstract parallelism (finding opportunities), so our discussion of concepts such as work and performance will be necessarily vague.

Nov Element-wise Vector Addition –At Algorithm Level, a vector is best thought of as an abstract data type representing a one-dimensional array of elements, all of the same data type. For simplicity, we will use arrays of integer values (this can be generalised with little effort). –The whole vector is normally identified by a user-defined name, while the individual elements of the vector are identified by use of a supplementary integer value, known as the index. The precise semantics of an index value can vary, but a convenient way of viewing it is as an offset, indicating how far away the element is from the first element (or base) of the vector. (In our examples, and using Fortran convention, an index of 1 corresponds to the first element.)

Nov Element-wise Vector Addition For our purposes, it is convenient to look at vectors in a diagrammatic form, as follows: vector name A integer elements

Nov Element-wise Vector Addition The task of adding together the elements of two vectors can be drawn as follows: +element-wise  A and B are input vectors. The result is an output vector.

Nov Element-wise Vector Addition –A simple, sequential algorithm for (element-wise) addition is to form the output vector one element at a time, by running through the elements of the two input vectors, in index order, computing the sum of the pair of input elements at each index point. –The work that has to be done comes in two forms: Accessing the elements of the vectors (two input vectors and one output vector); and Computing the sum of each pair of elements. –How might this work be done in parallel? –What range of options are there? –How do these affect performance?

Nov Element-wise Vector Addition –This has been a particularly easy case to study. The work is spread naturally over all the elements of the vectors, each parcel of work is independent of every other parcel of work, and the amount of work in each parcel is the same. –Unfortunately, this kind of parallel work seldom appears on its own, but it is so convenient for parallel systems that it has become known as embarrassingly parallel. Luckily, parallelism in this form frequently does appear as a subtask in algorithms with much more complex structure. –Related examples of this kind of parallel work are scalar multiplication of a vector (or matrix) and general matrix addition (a matrix is a generalisation of the array, used to model phenomena in two or more dimensions).

Nov Vector Sum –Finally, we look at the reduction of a vector into a scalar by summing its elements – this is a simplified case of the more general vector inner (dot ) product. For simplicity, we continue to assume integer-valued elements. –This reduction is implicit in the matrix-vector multiplication example since it is required to compute each inner product. –The following diagram shows what needs to be done:

Nov Vector Sum –The standard sequential algorithm for this task is to set the output scalar value to zero, and then add the values of the successive elements of the input vector into this 'running total', one-at-a-time. SUM = 0 DO I = 1, N SUM = SUM + A(I) END DO What scope is there for doing any of this work in parallel? What range of options are there? How do these affect performance?

Nov Vector Sum –This example illustrates how parallelism can be found even in tasks whose output is clearly scalar (at least at the level of integers). Because the output is non-parallel, the amount of work that can be done in parallel decreases during the computation. –The standard way of describing this kind of parallel work is divide-and-conquer. In its purest form, this leads to exponentially decreasing parallelism. –Although it is perhaps the most simple of our three examples, the presence of a data write conflict leads to the most difficult problems in implementation, as we shall see later.

Nov Matrix-Vector Multiplication –Now suppose we wish to multiply a vector by a two-dimensional matrix. This is not so straightforward, because the pattern of work is a little more complex. –For the moment, suppose that the matrix A is dense (i.e. almost all of its elements are non-zero). –For simplicity, assume the elements of both the matrix and the vector are integers, although generalisation is readily achieved.

Nov Matrix-Vector Multiplication The following diagram shows what needs to be done: How might this work be done in parallel? What range of options are there? How do these affect performance?

Nov Matrix-Vector Multiplication Two loop orderings for this problem: –Row based algorithm: DO I = 1, N B(I) = 0 DO J = 1, N B(i) = B(I)+A(I,J)*X(J) END DO –Column based algorithm: DO I = 1,N B(i) = 0 END DO DO J=1, N DO I = 1,N B(i) = B(I)+A(I,J)*X(J) END DO

Nov Matrix-Vector Multiplication –In the dense case, and the row-based algorithm, the outer I - loop can be parallelised; the work is split similarly to that in vector addition, i.e. by grouping together the elements of the output vector. The independent operation to be performed for each element of the output vector is to compute the inner product of the appropriate row of the matrix with the input vector. Since all rows of the matrix have the same number of elements, this gives the same amount of work for each element of the output vector. –However, the work at each point is not entirely independent of the work at other points, since the whole input vector is required for the computation of each inner product (and therefore each component of the output vector) – shared reads. This is an important matter at the Program Level, as we shall see later.

Nov Matrix-Vector Multiplication –In the dense case, and the column-based algorithm, the outer J -loop can be parallelised; the work is split similarly to that in the vector sum operation, i.e. by grouping together columns of the array A. This results in a reduction operation for each element of the result vector x – a vector-result reduction operation. –Now there are dependencies amongst the tasks (the output data), but there are no shared reads – no dependencies amongst the input data.

Nov Matrix-Vector Multiplication –Now let's consider what happens if the input matrix is sparse and structured (i.e. a well-defined, substantial number of its elements have a value of zero). We’ll restrict consideration to the row-based algorithm. –For example, what happens if the matrix is (upper) triangular? –A 'smart' sequential algorithm will avoid doing unnecessary work (multiplies by zero) in this case. What are the implications for parallel work?

Nov Matrix-Vector Multiplication Two effects emerge. Firstly, the amount of work per output vector element becomes different, but predictable. More work is needed to compute the ‘earlier' elements. This can lead to an ‘unbalanced' workload in the parallel realisation. Secondly, the dependence between the computations of the output vector elements changes, since different parts of the input vector are required for the different length inner product calculations (the number of shared reads varies). Overall, this example shows how data read conflicts can affect achievable performance if unwise implementation options are chosen.

Nov Matrix-Matrix Multiplication –Triply-nested loop DO I = 1, N DO J = 1, N C(I,J) = 0 DO K = 1, N C(I,J) = C(I,J) + A(I,K)*(B(K,J) END DO

Nov Matrix-Matrix Multiplication –One opportunity for parallelism is based on the observation that the computations of disjoint blocks of the result matrix are independent, although they will depend on (some of) the same data – lots of parallelism, but also lots of shared reads. –Could partition the result matrix into either (block) columns; (block) rows; blocks.

Nov Algorithmic Core: Summary –Parallel algorithms as-a-whole (i.e. including task-parallelism) boil down to one-or-more of the following three categories: Complete independence across the data elements (no sharing); embarrassingly parallel. Shared reads on abstract data elements; implement either by replicating the shared data (then we have independence and it becomes easy!); or by arranging for non-contending memory access (not always easy to achieve). Shared writes to data elements; in some special cases, we may be able to replicate the shared data (to an extent, but never completely); in the general case, the data must be protected (e.g. using locks) so that access to it is mutually exclusive.

Nov Recap –At Specification Level, a mathematical model of the application is developed; at Algorithm Level, this specification is converted into an appropriate algorithm. Abstract parallelism emerges at both Levels, in both task-parallel and data-parallel forms. –An algorithm is an abstract procedure for solving (an approximation to) the problem at hand; it is based on a discrete data domain that represents (an approximation to) the data domain of the specification. In HPC simulations, where the data domain of the specification is often continuous, it is necessary to develop a 'point-wise' discretisation for the algorithm to work on. Normally, parallelism is then exploited across the elements of the discretised data domain. –The resulting abstract data-parallelism appears in three forms: independent, shared reads and shared writes (in increasing order of difficulty to implement).