Parallel and Distributed Algorithms Spring 2007

Parallel and Distributed Algorithms Spring 2007
Johnnie W. Baker

Introduction and General Concepts
Chapter 1

References Selim Akl, Parallel Computation: Models and Methods, Prentice Hall, 1997, Updated online version available through website. Selim Akl, The Design of Efficient Parallel Algorithms, Chapter 2 in “Handbook on Parallel and Distributed Processing” edited by J. Blazewicz, K. Ecker, B. Plateau, and D. Trystram, Springer Verlag, 2000. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar, Introduction to Parallel Computing, 2nd Edition, Addison Wesley, 2003. Harry Jordan and Gita Alaghband, Fundamentals of Parallel Processing: Algorithms Architectures, Languages, Prentice Hall, 2003. Michael Quinn, Parallel Programming in C with MPI and OpenMP, McGraw Hill, 2004. Michael Quinn, Parallel Computing: Theory and Practice, McGraw Hill, 1994 Barry Wilkenson and Michael Allen, Parallel Programming, 2nd Ed.,Prentice Hall, 2005.

Outline Need for Parallel & Distributed Computing
Flynn’s Taxonomy of Parallel Computers Two Main Types of MIMD Computers Examples of Computational Models Data Parallel & Functional/Control/Job Parallel Granularity Analysis of Parallel Algorithms Elementary Steps: computational and routing steps Running Time & Time Optimal Parallel Speedup Speedup Cost and Work Efficiency Linear and Superlinear Speedup Speedup and Slowdown Folklore Theorems Amdahl’s and Gustafon’s Law

Reasons to Study Parallel & Distributed Computing
Sequential computers have severe limits to memory size Significant slowdowns occur when accessing data that is stored in external devices. Sequential computational times for most large problems are unacceptable. Sequential computers can not meet the deadlines for many real-time problems. Some problems are distributed in nature and natural for distributed computation

Grand Challenge to Computational Science Categories (1989)
Quantum chemistry, statistical mechanics, and relativistic physics Cosmology and astrophysics Computational fluid dynamics and turbulence Materials design and superconductivity Biology, pharmacology, genome sequencing, genetic engineering, protein folding, enzyme activity, and cell modeling Medicine, and modeling of human organs and bones Global weather and environmental modeling

Weather Prediction Atmosphere is divided into 3D cells
Data includes temperature, pressure, humidity, wind speed and direction, etc Recorded at regular time intervals in each cell There are about 5×103 cells of 1 mile cubes. Calculations would take a modern computer over 100 days to perform calculations needed for a 10 day forecast Details in Ian Foster’s 1995 online textbook Design & Building Parallel Programs See author’s online copy for further information

Flynn’s Taxonomy Best-known classification scheme for parallel computers. Depends on parallelism it exhibits with its Instruction stream Data stream A sequence of instructions (the instruction stream) manipulates a sequence of operands (the data stream) The instruction stream (I) and the data stream (D) can be either single (S) or multiple (M) Four combinations: SISD, SIMD, MISD, MIMD

SISD Single Instruction, Single Data Single-CPU systems
i.e., uniprocessors Note: co-processors don’t count Concurrent processing allowed Instruction prefetching Pipelined execution of instructions Functional but not data parallel execution Example: PCs

SIMD Single instruction, multiple data
One instruction stream is broadcast to all processors Each processor (also called a processing element or PE) is very simplistic and is essentially an ALU; PEs do not store a copy of the program nor have a program control unit. Individual processors can be inhibited from participating in an instruction (based on a data test).

SIMD (cont.) All active processor executes the same instruction synchronously, but on different data On a memory access, all active processors must access the same location in their local memory. The data items form an array and an instruction can act on the complete array in one cycle.

SIMD (cont.) Quinn calls this architecture a processor array
Two examples are Thinking Machines’ Connection Machine CM2 Connection Machine CM200 Also, MasPar’s MP1 and MP2 are examples Quinn also considers a pipelined vector processor to be a SIMD Somewhat non-standard. An example is the Cray-1

MISD Multiple instruction, single data
Quinn argues that a systolic array is an example of a MISD structure (pg 55-57) Some authors include pipelined architecture in this category This category does not receive much attention from most authors

MIMD Multiple instruction, multiple data
Processors are asynchronous, since they can independently execute different programs on different data sets. Communications is handled either by use of message passing or through shared memory. Considered by most researchers to contain the most powerful, least restricted computers.

MIMD (cont.) Have major communication costs that are usually ignored when comparing to SIMDs when computing algorithmic complexity A common way to program MIMDs is for all processors to execute the same program. Called SPMD method (for single program, multiple data) Usual method when number of processors are large.

Multiprocessors (Shared Memory MIMDs)
All processors have access to all memory locations . The processors access memory through some type of interconnection network. This type of memory access is called uniform memory access (UMA) . Most multiprocessors have hierarchical or distributed memory systems in order to provide fast access to all memory locations. This type of memory access is called nonuniform memory access (NUMA).

Multiprocessors (cont.)
Normally, fast cache is used with NUMA systems to reduce the problem of different memory access time for PEs. This creates the problem of ensuring that all copies of the same date in different memory locations are identical. Symmetric Multiprocessors (SMPs) are currently a popular example of shared memory multiprocessors

Multicomputers (Message-Passing MIMDs)
Processors are connected by an interconnection network Each processor has a local memory and can only access its own local memory. Data is passed between processors using messages, as dictated by the program. A common approach to programming multiprocessors is to use a message passing language (e.g., MPI, PVM) The problem is divided into processes that can be executed concurrently on individual processors. Each processor is normally assigned multiple processes.

Multicomputers (cont. 2/3)
Programming disadvantages of message-passing Programmers must make explicit message-passing calls in the code This is low-level programming and is error prone. Data is not shared but copied, which increases the total data size. Data Integrity: difficulty in maintaining correctness of multiple copies of data item.

Multicomputers (cont. 3/3)
Programming advantages of message-passing No problem with simultaneous access to data. Allows different PCs to operate on the same data independently. Allows PCs on a network to be easily upgraded when faster processors become available. Mixed “distributed shared memory” systems Lots of current interest in a cluster of SMPs. See Dr. David Bader’s or Dr. Joseph JaJa’s website

Data Parallel Computation
All tasks (or processors) apply the same set of operations to different data. Example: Data parallel is usual mode of computation for SIMDs All active processors execute the operations synchronously. for i  0 to 99 do a[i]  b[i] + c[i] endfor

Data Parallel Computation for MIMDs
A common way to support data parallel programming on MIMDs is to use SPMD (single program multiple data) programming as follows: Processors (or tasks) executing the same block of instructions concurrently but asynchronously No communication or synchronization occurs within these concurrent instruction blocks. Each instruction block is normally followed by synchronization and communication steps If processors have identical tasks (with different data), this method allows these to be executed in a data parallel fashion.

Functional/Control/Job Parallelism
Involves executing different sets of operations on different data sets. Typical MIMD control parallelism Problem is divided into different non-identical tasks Tasks are distributed among the processors Tasks are usually divided between processors so that their workload is roughly balanced Each processor follows some scheme for executing its task. One scheme is to use a task scheduling algorithm. Alternately, processor may follow some scheme for executing tasks concurrently.

Grain Size Defn: Grain Size is the average number of computations performed between communication or synchronization steps See Quinn textbook, page 411 Data parallel programming usually results in smaller grain size computation SIMD computation is considered to be fine-grain MIMD data parallelism is usually considered to be medium grain Parallelism at the task level is considered to be coarse grained parallelism Generally, increasing the grain size of increases the performance of MIMD programs.

Examples of Parallel Models
Synchronous PRAM (Parallel RAM) Each processor stores a copy of the program All execute instructions simultaneously on different data Generally, the active processors execute the same instruction The full power of PRAM allows processors to execute different instructions simultaneously. Here, “simultaneous” means they execute using same clock. All computers have the same (shared) memory. data exchanges occur through share memory All processors have I/O capabilities. See Akl, Figure 1.2

Examples of Parallel Models (cont)
Binary Tree of Processors (Akl, Fig. 1.3) No shared memory Memory distributed equally between PEs PEs communicate through binary tree network. Each PE stores a copy of the program. PE execute different instructions Normally synchronous, but asychronous is also possible. Only leaf and root PEs have I/O capabilities.

Examples of Parallel Models (cont)
Symmetric Multiprocessors (SMPs) Asychronous processors of the same size. Shared memory Memory locations equally distant from each PE in smaller machines. Formerly called “tightly coupled multiprocessors”.

Analysis of Parallel Algorithms
Elementary Steps: A computational step is a basic arithmetic or logic operation A routing step is used to move data of constant size between PEs using either shared memory or a link connecting the two PEs.

Communication Time Analysis
Worst Case Usually default for synchronous computation Almost impossible to use for asynchronous computation when executing multiple tasks Can be used for asynchronous SPMD programming when simulating synchronous computation Average Case Used occasionally in synchronous computation in the same manner as in sequential computation Usual default for asynchronous computation Important to distinguish which is being used Often “worst case” synchronous timings are compared to “average case” asynchronous timings. Default in Akl’s text is “worst case” Default in Quinn’s 2004 text is “average case”

Shared Memory Communication Time
Memory Accesses – Two Memory Accesses P(i) writes datum to a memory cell. P(j) reads datum in that memory cell Uniform Analysis Charge constant time for memory access Unrealistic, as shown for PRAM in Chapter 2 of Akl’s book Non-Uniform Analysis Charge O(lg M) cost for a memory access, where M is the number of memory cells More realistic, based on how memory is accessed Traditional to use Uniform Analysis Considering a second variable M makes the analysis complex Most non-constant time algorithms have the same complexity using either the uniform and non-uniform analysis.

Communication through links
Assumes constant time to communicate along one link. An earlier paper (mid-90?) estimates a communication step along a link to be on order of times slower than a computational step. When two processors are not connected, routing a datum between the two processors requires O(m) time units, where m is the number of links along path used.

Running Time Running Time for algorithm
Time between when first processor starts executing & when last processor terminates. Based on worst-case, unless stated otherwise Running Time bounds for a problem An upper bound is given by the number of steps of the fastest algorithm known for the problem. The lower bound gives the fewest number of steps possible to solve a worst case of the problem Time optimal algorithm: Algorithm in which the number of steps for the algorithm is the same complexity as the lower bound for problem.

Speedup A measure of the decrease in running time due to parallelism
Let t1 denote the running time of the fastest (possible/known) sequential algorithm for the problem. If the parallel algorithm uses p PEs, let tp denote its running time. The speedup of the parallel algorithm using p processors is defined by S(1,p) = t1/tp . The goal is to obtain largest speedup possible. For worst (average) case speedup, both t1 and tp should be worst (average) case, respectively. Tendency to ignore these details for asynchronous computing involving multiple concurrent tasks.

Optimal Speedup Theorem: The maximum possible speedup for parallel computers with n PEs for “traditional” problems’ is n. Proof (for “traditional problems”): Assume a computation is partitioned perfectly into n processes of equal duration. Assume no overhead is incurred as a result of this partitioning of the computation – (e.g., partitioning process, information passing, coordination of processes, etc), Under these ideal conditions, the parallel computation will execute n times faster than the sequential computation. The parallel running time is ts /n. Then the parallel speedup of this computation is S(n,1) = ts /(ts /n) = n

Linear Speedup Preceding slide argues that S(n,1)  n
and optimally S(n,1) = n. This speedup is called linear since S(n,1) = (n) The next slide formalizes this statement and provides an alternate proof.

Speedup Folklore Theorem
Statement: For a given problem, the maximum speedup possible for a parallel algorithm using p processors is at most p. That is, S(1,p)  p. An Alternate Proof (for “traditional problems”) Let t1 and tp be as above and assume t1/tp > p This parallel algorithm can be simulated by a sequential computer which executes each parallel step with p serial steps. The sequential algorithm runs in p tp time, which is faster than the fastest sequential algorithm since ptp < t1 This contradiction establishes the “theorem”. Note: We will see that this result is not valid for some non-traditional problems.

Speedup Usually Suboptimal
Unfortunately, the best speedup possible for most applications is much less than n, as Assumptions in proof are usually invalid. Usually some portions of algorithm are sequential, with only one PE active. Other portions of algorithm may allow a non-constant number of PEs to be idle for more than a constant number of steps. E.g., during parts of the execution, many PEs may be waiting to receive or to send data.

S(1,p) = n/(lg n) < n < 2n - 1 = p
Speedup Example Example: Compute the sum of n numbers using a binary tree with n leaves. Takes tp = lg n steps to compute the sum in the root node. Takes t1 = n-1 steps to compute the sum sequentially. the number of processors is p =2n-1, so for n>1, S(1,p) = n/(lg n) < n < 2n - 1 = p Solution is not optimal.

Optimal Speedup Example
Example: Computing the sum of n numbers optimally using a binary tree. Assume binary tree has n/(lg n) leaves on the tree. Each leaf is given lg n numbers and computes the sum of these numbers in the first step. The overall sum of the values in the n/(lg n) leaves are found in next step as in above example in lg(n/lg n) or (lg n) time. Then tp = (lg n) and S(1,p) = (n/lg n) = cp for some constant c since p = 2(n/lg n) -1. This parallel algorithm and speedup are optimal, as the Folklore Speedup Theorem is valid for traditional problems like this one.

Superlinear Speedup Superlinear speedup occurs when S(n) > n
Most texts besides Akl’s and Quinn’s argue that Linear speedup is the maximum speedup obtainable. The preceding “optimal speedup proof” is used to argue that superlinearity is impossible since linear speedup is optimal. Those not accepting superlinearity claim that any ‘superlinear behavior’ can be explained using the following type of reasons: the extra memory in parallel system. a sub-optimal sequential algorithm was used. Random luck, e.g., in case of an algorithm that has a random aspect in its design (e.g., random selection)

Superlinearity (cont)
Some problems cannot be solved without use of parallel computation. It seems reasonable to consider parallel solutions to these to be “superlinear”. Examples include many large software systems Data too large to fit in the memory of a sequential computer Problem contain deadlines that must be met, but the required computation is too great for a sequential computer to meet deadlines. Those who do not accept superlinearity would discount these, as indicated in previous slide

Some Non-Traditional Problems
Problems with any of the following characteristics are “non-traditional” Input may arrive in real time Input may vary over time For example, find the convex hull of a set of points, where points can be added or removed over time. Output may effect future input Output may have to meet certain deadlines. See discussion on page of Akl’ text See Chapter 12 of Akl’s text for additional discussions.

Other Superlinear Examples
Selim Akl has given many different examples of different types of superlinear algorithms. Example 1.17 in Akl’s online textbook (pg 17). See last Chapter (Ch 12) of his online text. A number of conference and journal publications

A Superlinearity Example
Example: Tracking 6,000 streams of radar data arrive during each 0.5 second and each corresponds to a aircraft. Each must be checked against the “projected position box” for each of 4,000 tracked aircraft The position of each plane that is matched is updated to correspond with its radar location. This process must be repeated each 0.5 seconds A sequential processor can not process data this fast.

Slowdown Folklore Theorem: If a computation can be performed with p processors in time tp, with q processors in time tq, and q < p, then tp  tq  (1 + p/q) tp Comments: This result puts a tight bound on the amount of extra time required to execute an algorithm with fewer PEs. Establishes a smooth degradation of running time as the number of processors decrease. For q = 1, this is essentially the speedup theorem, but with a less sharp constant.

tq  W1 /q + W2 /q + ...+ Wk /q
Proof: (for traditional problems) Let Wi denote the number of elementary steps performed collectively during the i-th time period of the algorithm (using p PEs). Then W = W1 + W Wk (where k=tp) is the total number of elementary steps performed collectively by the processors. Since all p processors are not necessarily busy all the time, W  p tp . With q processors and q < p, we can simulate the Wi elementary steps of i-th time period by having each of the q processors do Wi /q elementary steps. The time taken for the simulation is tq  W1 /q + W2 /q Wk /q  W1 /q +1 Wk /q +1  tp + p tp /q

A Non-Traditional Example for “Folklore Theorems”
Problem Statement: (from Akl, Example 1.17) n sensors are each used to repeat the same set {s1, s2, ..., sn } of measurements n times at an environmental site. It takes each sensor n time units to make each measurement. Each si is measured in turn by another sensor, producing n distinct cyclic permutations of the sequence S = (s1, s2, ..., sn)

Nontraditional Example (cont.)
The sequences produced by 3 sensors are (s1, s2, s3), (s3, s1 ,s2), (s2 ,s3,s1) The stream of data from each sensor is sent to a different processor, with all corresponding si values in each stream arriving at the same time. Each si must be read and stored in unit time or stream is terminated. Additionally, the minimum of the n new data values that arrive during each n steps must be computed, validated, and stored.

Sequential Processor Solution
A single processor can only monitor the values in one data stream. After first value of a selected stream is stored, it is too late to process the other n-1 items that arrived at the same time in other streams. A stream remains active only if its preceding value was read. Note this is the best we can do to solve problem We assume that a processor can read a value, compare it to the smallest previous value received, and update its “smallest value” in one unit of time. Sequentially, the computation takes n computational steps and n(n-1) time unit delays between arrivals, or exactly n2 time units to process all n streams. Summary: Calculating first minimum takes n2 time units

Binary Tree Solution using n Leaves
Assume a complete binary tree with n leaf processors and a total of 2n-1 processors. Each leaf receives the input from a different stream. Inputs from each stream sent up the tree. Each interior PE receives two inputs, computes the minimum, and sends result to parent. Takes lg n time to find compute the minimum and to store answer at the root. Running time for parallel solution for first round (i.e. first minimum) is lg n using 2n-1 = (n) processors

Counterexample to Speedup “Folklore Theorem”
A sequential solution to previous example required n2 steps The parallel solution to it took log n steps using a binary tree with steps using p = 2n-1 = (n) processors. The speedup is S(p,1) = t1/tn = n2/ log(n), which is asymptotic larger than p, the number of processors. Akl calls this speedup parallel synergy.

Binary Tree solution using n Leaves
Assume a complete binary tree with n leaf processors and a total of q = 2n - 1 processors. The n leaf processors monitor n streams whose first n values are distinct from each other. Requires order of data arrival in streams be known in advance. Note that this is the best we can do to solve original problem Consecutive data in each stream are separated by n time units, so each PE takes (n – 1)n time in the first phase to see the first n values in its stream and to compute their minimum. Each leaf computes the minimum of its first n values. The minimum of the leaf “minimum values” is computed by the binary tree in ln(n ) = ½(ln n) steps. Running time for parallel solution using q=2n-1 PEs for first round is (n – 1)n + ½(ln n)

Counterexample to Slowdown “Folklore Theorem”
A parallel solution to previous example with q = 2n-1 = (n) PEs had tq = (n – 1)n + ½(lg n) The parallel solution using a binary tree with steps using p = 2n-1 = (n) PEs took tp = lg n. Slowdown Folklore Thm states that tp  tq  (1 + p/q) tp Clearly tq/tp is asymptotically greater than p/q, which contradicts Slowdown “Folklore Theorem” (p/q) tp = (2n-1)/(2n-1) (lg n)=  (n lg n) tq = (n – 1)n + ½(lg n) = (n3/2) Akl also calls this asymptotical speedup parallel synergy.

Cost = (running time) (Nr. of PEs)
The cost of a parallel algorithm is defined by Cost = (running time) (Nr. of PEs) = tp  n Cost allows the performance of parallel algorithms to be compared to that of sequential algorithms. The cost of a sequential algorithm is its running time. The advantage that parallel algorithms have in using multiple processors is removed by multiplying their running time by the number n of processors used If a parallel algorithm requires exactly 1/n the running time of a sequential algorithm, then the parallel cost is the same as the sequential running time.

Cost Optimal A parallel algorithm for a problem is cost-optimal if its cost is proportional to the running time of an optimal sequential algorithm for the same problem. By proportional, we means that cost  tp  n = k  ts for some constant k. Equivalently, a parallel algorithm with cost C is cost optimal if there is an optimal sequential algorithm with running time ts and (C) = (ts). If no optimal sequential algorithm is known, then the cost of a parallel algorithm is usually compared to the running time of the “fastest known” sequential algorithm instead.

= [n/(lg n) × lg(n)] = (n)
Cost Optimal Example Recall speedup example of a binary tree with n/(lg n) leaf PEs to compute the sum of n values in lg n time. C(n) = p(n) × t(n) = [2(n/lg n)-1] × [lg(n/(lg n)] = [n/(lg n) × lg(n)] = (n) An optimal sequential algorithm requires n-1= (n) steps This shows that this algorithm is cost optimal

Another Cost Optimal Example
Sorting has a sequential lower bound of (n lg n) steps Assumes algorithm is general purpose and not tailored to specific machine. A parallel sorting algorithm using (n) PEs will require at least (lg n) time A parallel algorithm requiring n PEs and O(lg n) time is cost optimal. A parallel algorithm using (lg n) PEs will require at least (n) time

Work Defn: The work of a parallel algorithm is the sum of the individual steps executed by all the processors. While inactive processor time is included in cost, only active processors time is included in work. Work indicates the actual steps that a sequential computer would have to take in order to simulate the action of a parallel computer. Observe that the cost is an upper bound for the work. While this definition of work is fairly standard, a few authors use our definition of “cost” to define “work”.

Algorithm Efficiency Efficiency is defined by Observe that
Efficiency give the percentage of full utilization of parallel processors on computation. Note that for traditional problems: Maximum speedup when using n processors is n. Maximum efficiency is 1 For traditional problems, efficiency  1 For non-traditional problems, algorithms may have speedup > n and efficiency > 1.

Amdahl’s Law Let f be the fraction of operations in a computation that must be performed sequentially, where 0 ≤ f ≤ 1. The maximum speedup S(n) achievable by a parallel computer with n processors is Note: Amdahl’s law holds only for traditional problems.

Proof: If the fraction of the computation that cannot be divided into concurrent tasks is f, and no overhead incurs when the computation is divided into concurrent parts, the time to perform the computation with n processors is given by tp = fts + [(1 - f )ts] / n, as illustrated below:

Proof of Amdahl’s Law (cont.)
Using the preceding expression for tp The last expression is obtained by dividing numerator and denominator by ts , which establishes Amdahl’s law. Multiplying numerator & denominator by n produces the following alternate version of this formula:

Comments on Amdahl’s Law
Preceding proof assumes that speedup can not be superliner; i.e., S(n) = ts/ tp  n Question: Where was this assumption used??? Amdahl’s law is not valid for non-traditional problems where superlinearity can occur Note that S(n) never exceed 1/f, but approaches 1/f as n increases. The conclusion of Amdahl’s law is sometimes stated as S(n)  1/f

Limits on Parallelism Note that Amdahl’s law places a very strict limit on parallelism: Example: If 5% of the computation is serial, then its maximum speedup is at most 20, no matter how many processors are used, since Initially, Amdahl’s law was viewed as a fatal flaw to any significant future for parallelism.

Applications of Amdahl’s Law
Law shows that efforts required to further reduce the fraction of the code that is sequential may pay off in large performance gains. Shows that hardware that achieves even a small decrease in the percent of things executed sequentially may be considerably more efficient. Can sometimes use Amdahl’s law to increase the efficient of parallel algorithms E.g., see Jordan & Alaghband’s textbook

Amdahl & Gustafon’s Laws
A key flaw in past arguments that Amdahl’s law is a fatal flaw to the future of parallelism is Gustafon’s Law: The proportion of the computations that are sequential normally decreases as the problem size increases. Gustafon’s law is a “rule of thumb” or general principle, but is not a theorem that has been proved. Other limitations in applying Amdahl’s Law: Its proof focuses on the steps in a particular algorithm, and does not consider that other algorithms with a higher percent of parallelism may exist Amdahl’s law applies only to ‘traditional problems’ were superlinearity can not occur

Introduction and General Concepts
End of Chapter One Introduction and General Concepts

Parallel and Distributed Algorithms Spring 2007

Similar presentations

Presentation on theme: "Parallel and Distributed Algorithms Spring 2007"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel and Distributed Algorithms Spring 2007

Similar presentations

Presentation on theme: "Parallel and Distributed Algorithms Spring 2007"— Presentation transcript:

Similar presentations

About project

Feedback