CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University
CIS December '99 Parallel Computers Purpose - speed Divide a problem among processors Let each processor work on its portion of problem in parallel (simultaneously) with other processors Ideal - if p is the number of processors, get solution in 1/p of the time used by a computer of 1 processor Actual - rarely get that much speedup, due to delays for interprocessor communications
CIS December '99 Graphs of relevant functions
CIS December '99 Architectural issues Communications diameter - how many communication steps are necessary to send data from processor that has it to processor that needs it - large is bad Bisection width - how many wires must be cut to cut network in half - measure of how fast massive amounts of data can be moved through network - large is good Degree of network - important to scalability (ability to expand number of processors) - large is bad Limitations on speed: Limitation on expansion:
CIS December '99 PRAM - Parallel Random Access Machine Shared memory yields fast communications Source processor writes data to memory Destination processor reads data from memory Fast communications make this model theoretical ideal for fastest possible parallel algorithms for given # of processors Impractical - too many wires if lots of processors Any processor can send data to any other processor in time as follows:
CIS December '99
Notice the tree structure of the previous algorithm:
CIS December '99 Linear array architecture Degree of network: 2 - easily expanded Bisection width: 1 - can’t move large amounts of data efficiently across network Communication diameter: n-1 - won’t perform global communication operations efficiently
CIS December '99 Total on linear array: Assume 1 item per processor Communications diameter implies Since this is the time required to total n items on a RAM, there is no asymptotic benefit to using a linear array for this problem
CIS December '99 Input-based sorting on a linear array The algorithm illustrated is a version of Selection Sort - each processor selects the smallest value it sees and passes others to the right. Time is proportional to communication diameter,
CIS December '99 Mesh architecture Square grid of processors Each processor connected by communication link to N, S, E, W neighbors Degree of network: 4 - makes expansion easy - can introduce adjacent meshes and connect border processors
CIS December '99 Application: sorting Could have initial data all in “wrong half” of mesh, as shown. Since all n items must get to correct half-mesh, time required to sort is In 1 time unit, amt. of data that can cross into correct half of mesh:
CIS December '99 In a mesh, each of these steps takes time. Hence, time for broadcast is
CIS December '99 Semigroup operation (e.g., total) in mesh 1. “Roll up” columns in parallel, totaling each column in last row by sending data downward. 2. Roll up last row to get total in a corner. 3. Broadcast total from corner to all processors. Time:
CIS December '99 Previous algorithm could run in approximately half the time by gathering total in a center, than corner, processor. Mesh total algorithm - continued However, running time is still i.e., still approximately proportional to communication diameter (with smaller constant of proportionality).
CIS December '99 Hypercube Number n of processors is a power of 2 Processors are numbered from 0 to n-1 Connected processors are those whose binary labels differ in exactly 1 bit.
CIS December '99 Illustration of total operation in hypercube. Reverse direction of arrows to broadcast result Time:
CIS December '99 Coarse-grained parallelism Most of previous discussion was of fine-grained parallelism - # of processors comparable to # of data items Realistically, few budgets accommodate such expensive computers - more likely to use coarse-grained parallelism with relatively few processors compared with # of data items. Coarse grained algorithms often based on each processor boiling its share of data down to single partial result, then using fine-grained algorithm to combine these partial results
CIS December '99 Example: coarse-grained total Suppose n data are distributed evenly (n/p per processor among p processors) 1. In parallel, each processor totals its share of the data. Time: Θ(n/p) 2. Use a fine-grained algorithm to add the partial sums (total residing in one processor) and broadcast result to all processors. In case of mesh, time: Total time for mesh: Since, this is Θ(n/p) - optimal.
CIS December '99 More info: Algorithms Sequential and Parallel by Russ Miller and Laurence Boxer Prentice-Hall, 2000 (available December, 1999)