Decomposition Data Decomposition Functional Decomposition

Decomposition Data Decomposition Functional Decomposition
Dividing the data into subgroups and assigning each piece to different processors Example: Embarrassingly parallel applications Functional Decomposition Dividing an algorithm into its functional pieces and executing the pieces in separate processors Example: Pipelining

Pipelined Computations
Divide a problem into a series of tasks A processor completes a task sequentially and pipes the results to the next processor P0 P1 P4 P2 P3 P5 ∑A[i0] ∑A[i1] ∑A[i2] ∑A[i3] ∑A[i4] ∑A[i5] zero total Example of Summing Groups of Numbers Question: Is this data or is it functional decomposition?

Where is Pipelining Applicable?
Type 1 More than one instance of a problem Example: Multiple simulations with different parameter settings Type 2 Series of data items with multiple operations Example: Signal Filter or Eratosthenes Sieve Type 3 Partial results passed on while processing continues Example: Solving sets of linear equations Considerations Are there a series of sequential tasks? Is the processing of each tack approximately equal? Can items be grouped to minimize communication cost If stages exceed processors Group stages Wrap last stage back to the first Determine where the result will be at the end of the process

Summing Numbers Example
process Pi>0 && <N-1 recv(&sum, Pi-1); sum += number; send(&sum, Pi+1); Process P0 send(&number, P1); Process PN-1 recv(&number, Pn-2); Save or display result

Application Remove frequencies from a signal
Sequential Algorithm: Fourier Analysis (O(N lg(N)) Parallel: Apply filters to the signal (O(N*FilterLength)) with convolution. Filter Examples: Chebyshev, ButtorWorth, etc. Derive filter: Set Z-domain poles and zeroes, perform inverse tranformation. Filters can be useful to manipulate signals, detect patterns, etc.

Chebyshev Filter Design
Chebyshev in the z-domain Chebyshev Frequency Response Note: Depending on the placement of the poles (+) and zeroes (0), the filter will effect a signal differently

Type 1: Multiple Instances
Space Time Diagram Instance 1 P0 P1 P2 P3 P4 P5 Instance 2 Instance 3 Instance 4 Instance 5 Time Notation m = instances, p = processors tstart = latency tdata = bandwidth n = data transmitted /instance tm = total time to process an instance Total pipeline cycles = m + p – 1 Assume: Equal processing per stage Sequential execution: t1 = m*tm Parallel Processing: (m + p – 1)*tm/p Parallel Communication: (m+p-1)*(tstart+n*tdata) Speed up: tp= m*tm/((m+p-1)*(tm/p+tstart+n*tdata))

Type 2: Multiple Data Elements
d9d8d7d6d5d4d3d2d1d0 P0 P0 P0 P0 P0 P0 Example: Signal Filter Each process removes one or more frequencies from a digitized signal P0 P1 P4 P2 P3 P5 Filter f0 Unfiltered Signal Filtered Filter f1 Filter f2 Filter f3 Filter f4 Filter f5

Type 2 Timing Diagram

Type 3: Partial Processing
The next stage receives information to continue processing Additional processing continues at the source processor P0 P1 P2 P3 P4 P5 Linear Equations A More Balanced Load = Idle = Executing Question: How do we determine speed-up?

Operation at each processor Types 1 and 2
Processor with rank r = 0 Generate the instance (type 1) or the data (type 2) to process Process appropriately Send message to the processor with rank 1 Processors with rank r = 1, 2, p-2 Receive message from the processor with rank r-1 Send message to the processor with rank r+1 Processor with rank r = p-1 Receive message from processor with rank r-1 Output final results Examples Adding Numbers: n1 -> n1+n2 -> n1+n2+n3 -> . . . Frequency removal: f(t) -> f0; f(t-f0)-> f1; f(t-f0-f1)-> . . .

Parallel Pipeline Sort
5 4 3 2 1 Step Numbers P P P P P4 4, 3, 1, 2, 5 4, 3, 1, 2 4, 3, 1 4, 3 6 7 8 9 10 Pseudo code Receive xi IF xi < xmax Send xi ELSE Send xmax xmax = xi Note: Processors can hold blocks of numbers for better efficiency

Bi-Directional Pipeline
Use the pipeline to return results to the master Useful for line topologies, ring, or hypercube P0 P1 P4 P2 P3 P5 Sorting Phase P4 P3 P2 P1 P0 Time Gather Phase Example: Sorting Sort Phase If (myid == 0) generate number Else receive(&number, pmyid-1) If (number > max and myid<P-1) { save min(number, max), send(max(number,max) } Gather phase If (myid < P-1) receive sorted numbers from pmyid+1 If (myid > 0) send sorted numbers to pmyid-1 Sort phase Phases N(generate steps); N-1 (propagate steps); N-1 (return steps) = 3N-2

Sieve of Eratosthenes

Prime Number Generation Sieve of Eratosthenes (Type 2 pipeline)
Concept Each processor filters blocks of non-primes from the flow of data The “potential” prime numbers pass through to the next processor Pseudo-code The Master processor generates an array of odd n numbers In a loop after receiving a group of numbers Filter a group of numbers; pass unfiltered numbers down the pipeline Gather all of the primes Notes Wrapping the pipeline in a ring could help maintain load balance A termination message determines when the pipeline empties Question: What range of numbers should each processor get?

Implementation Sequential code Parallel Code Processor pi > 0
for (i = 2; i < n; i++) prime[i] = 1; for (i = 2; i <= sqrt_n; i++) if (prime[i] == 1) for (j = i + i; j < n; j = j + i) prime[j] = 0 Processor pi > 0 Recv(number, rank-1); PRIME = TRUE; FOR (int x=MIN; x<MAX; x+=MIN) IF ((number % x) == 0) PRIME = FALSE and BREAK IF (PRIME) send(number, rank+1); Termination recv(number, rank-1); send(number, rank+1) IF (number == terminator) break; Sequential Time O(n2)

Upper Triangular Matrix
All entries below the diagonal are zero Useful for solving N equations and N unknowns

Solving Sets of Linear Equations
This is a type 3 pipeline example General solution for xi xi= (bi – ∑j=0 to i-1 ai,j xj)/ai,I Sequential code x[0] = b0/a0,0, FOR (i=1; i<n; i++) sum=0; FOR (j=0; j<i; j++) sum += ai,I xj xi= (bi – sum)/ai,I Parallel Pseudo code for (j = 0; j < i; j++) recv(x[j], p-1); send(x[j], p+1); sum = 0; sum = sum + a[i][j]*x[j] x[i] = (b[i] - sum)/a[i][i]; send(x[i], p+1); Upper Triangular Form an-1,0x0 + an-1,1x1 + … + an-1,n-1xn-1 = bn-1 an-2,0x0 + an-2,1x1 + … + an-2,n-2xn-1 = bn-2 a1, 0x0 + a1,1x1 = b1 a0,0x0 = b0 Back Substitution x0=b0/a0,0 x1=(b1-a1,0x0)/a1,1 x2=(b2-a2,0x0-a2,1x1)/a2,2 Parallel code for pi where 1<=i<n sum = 0 For (j=0; j<i; j++) { receive(&x[j], pi-1); sum += ai,j * xj; send(xj,pi+1) } xi = (bi – sum)/ai,i Note: ai,j and bi are constants

Pipeline Solution DO IF p ≠ master, receive xj from previous processor IF p ≠ P-1, send xj to next processor back substitute xj UNTIL xi evaluated IF p ≠ P-1send xi to the next processor Notes: Processing continues after sending values down the pipeline Is the load imbalanced?

Illustration of Type 3 Solution
Compute x0 Compute x1 Compute x2 Compute x3 x0 x1 x2 x3 P0 P1 P2 P3 P5 P4 P3 P2 P1 P0 How balanced is This load? Time

Decomposition Data Decomposition Functional Decomposition

Similar presentations

Presentation on theme: "Decomposition Data Decomposition Functional Decomposition"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Decomposition Data Decomposition Functional Decomposition

Similar presentations

Presentation on theme: "Decomposition Data Decomposition Functional Decomposition"— Presentation transcript:

Similar presentations

About project

Feedback