Embarrassingly Parallel (or pleasantly parallel)

Slides:



Advertisements
Similar presentations
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Practical techniques & Examples
CS 484. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Parallel Strategies Partitioning consists of the following steps –Divide the problem into parts –Compute each part separately –Merge the results Divide.
MATH 685/ CSI 700/ OR 682 Lecture Notes
Embarrassingly Parallel (or pleasantly parallel) Domain divisible into a large number of independent parts. Minimal or no communication Each processor.
Announcements Mailing list: –you should have received messages Project 1 out today (due in two weeks)
Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations.
Fractal Image Compression
Edge Detection Today’s reading Forsyth, chapters 8, 15.1
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Parallel Processing – Final Project Performed by:Nitsan Mane Jonathan Distler PP9.
Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Mathematics for Computer Graphics (Appendix A) Won-Ki Jeong.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Lecture 7 – Data Reorganization Pattern Data Reorganization Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
Image Processing Edge detection Filtering: Noise suppresion.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations.
Many slides from Steve Seitz and Larry Zitnick
CSC508 Convolution Operators. CSC508 Convolution Arguably the most fundamental operation of computer vision It’s a neighborhood operator –Similar to the.
Duy & Piotr. How to reconstruct a high quality image with the least amount of samples per pixel the least amount of resources And preserving the image.
Embarrassingly Parallel (or pleasantly parallel) Characteristics Domain divisible into a large number of independent parts. Little or no communication.
Computational Vision CSCI 363, Fall 2012 Lecture 17 Stereopsis II
Hash Maps Rem Collier Room A1.02 School of Computer Science and Informatics University College Dublin, Ireland.
CMSC5711 Image processing and computer vision
Embarrassingly Parallel Computations
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Parallel Graph Algorithms
Computer Graphics CC416 Week 13 Clipping.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Sieve of Eratosthenes.
Mean Shift Segmentation
Parallel Sorting Algorithms
Parallel Programming with MPI and OpenMP
WINDOWING AND CLIPPING
CMSC5711 Image processing and computer vision
Embarrassingly Parallel
Decomposition Data Decomposition Functional Decomposition
Image Processing, Lecture #8
WINDOWING AND CLIPPING
Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, 2012 slides5.ppt Oct 24, 2013.
Image Processing, Lecture #8
CSCE569 Parallel Computing
Parallel Sorting Algorithms
All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! ITCS 4/5145 Parallel Computing,
All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! ITCS 4/5145 Parallel Computing,
Monte Carlo I Previous lecture Analytical illumination formula
Using compiler-directed approach to create MPI code automatically
Introduction to High Performance Computing Lecture 7
Parallel Techniques • Embarrassingly Parallel Computations
Embarrassingly Parallel Computations
Magnetic Resonance Imaging
Patterns Paraguin Compiler Version 2.1.
Adaptivity and Dynamic Load Balancing
Computer Organization & Architecture 3416
Edge Detection Today’s readings Cipolla and Gee Watt,
Jacobi Project Salvatore Orlando.
CS100J Lecture 16 Previous Lecture This Lecture Programming concepts
Stencil Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson Oct 14, 2014 slides6b.ppt 1.
Introduction to High Performance Computing Lecture 16
Stencil Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson StencilPattern.ppt Oct 14,
To accompany the text “Introduction to Parallel Computing”,
CS100J Lecture 16 Previous Lecture This Lecture Programming concepts
Introduction to High Performance Computing Lecture 17
CSCE 313: Embedded Systems Scaling Multiprocessors
Embedded Image Processing: Edge Detection on FPGAs
Presentation transcript:

Embarrassingly Parallel (or pleasantly parallel) Definition: Problems that scale well to thousands of processors Characteristics Domain divisible into a large number of independent parts. Little or no communication between processors Processor performs the same calculation independently “Nearly embarrassingly parallel” Communication is limited to distributing and gathering data Computation dominates the communication

Embarrassingly Parallel Examples Embarrassingly Parallel Application P1 P2 P3 P0 Send Data Receive Data Nearly Embarrassingly Parallel Application

Low Level Image Processing Note: Does not include communication to a graphics adapter Storage A two dimensional array of pixels. One bit, one byte, or three bytes may represent pixels Operations may only involve local data Image Applications Shift: newX=x+delta; newY=y+delta Scale: newX = x*scale; newY = y*scale Rotate a point about the origin newX = x cosF+y sinF; newY=-xsinF+ycosF Clip newX = x if minx<=x< maxx; 0 otherwise newY = y if miny<=y<=maxy; 0 otherwise

Non-trivial Image Processing Smoothing A function that captures important patterns, while eliminating noise or artifacts Linear smoothing: Apply a linear transformation to a picture Convolution: Pnew(x,y) = ∑j=0,m-1∑k=0,n-1 P(x,y,j,k)old f(j,k) Edge Detection A function that searches for discontinuities or variations in depth, surface, or color Purpose: Significantly reduce follow-up processing Uses: Pattern recognition and computer vision One approach: differentiate to identify large changes Pattern Matching Match an image against a template or a group of features Example: ∑i=0,X∑j=1,Y (Picture(x+I,y+i) – Template(x,y)) Note: This is another digital signal processing application

Array Storage Row –major (left most dimensions) are stored one after another Column-major (right most dimensions) are stored one after another The C language stores arrays in row-major order, Matlab and Fortran use column-major order Loops can be extremely slow in C if the outer loop processes columns due to the system memory cache operation int A[2][3] = { {1, 2, 3}, {4, 5, 6} }; In memory: 1 2 3 4 5 6 int A[2][3][2] = {{{1,2}, {3,4}, {5,6}}, {{7,8}, {9,10}, {11,12}}}; In memory: 1 2 3 4 5 6 7 8 9 10 11 12 Translate multi-dimension indices to single dimension offsets Two Dimensions: offset = row*COLS + column Three Dimensions: offset = i*DIM2*DIM3 + j*DIM3+ k What is the formula for four dimensions?

Process Partitioning 1024 Note: 128 rows per displayed cell Pixel 2053 Row 2, column 5 Rows 0: 0-1023 2: 2048-3071 Pixel 21 Rows 0: 0-7 2: 8-15 3:16-23 768 Note: 128 columns per displayed cell Partitioning might assign groups of rows or columns to processors

Typical Static Partitioning Master Scatter or broadcast the image along with assigned processor rows Gather the updated data back and perform final updates if necessary Slave Receive Data Compute translated coordinates Perform collective gather operation Questions How does the master decide how much to assign to each processor? Is the load balanced (all processors working equally)? Notes on the Text shift example Employs individual sends/receives, which is much slower However, if coordinate positions change or results do not represent contiguous pixel positions, this might be required

Mandelbrot Set Implementation Complex plane Display Definition: Those points C = (x,y) = x + iy in the complex plane that iterate with a function (normally: zn+1 = zn2 + C) converge to a finite value Implementation Z0 = 0 + 0 * i For each (x,y) from [-2,+2] Iterate zn until either The iteration stops when the iteration count reaches a limit (not in the set) Zn is out of bounds ( |zn|>2 (in the set) Save the iteration count which will map to a display color Complex plane Display horizontal axis: real values vertical axis: imaginary values

Scaling and Zooming Display range of points Display range of pixels From cmin = xmin + iymin to cmax = xmax + iymax Display range of pixels From pixel at (0,0) to the pixel at (ROWS,COLUMNS) Pseudo code For row = 0 to ROWS For col = 0 to COLUMNS cy = ymin+(ymax-ymin)* row/ROWS cx = xmin+(xmax-xmin)* col/COLUMNS color = mandelbrot(cx, cy) picture[COLUMNS*row + col] = color

Pseudo code (mandelbrot(cx,cy)) SET z = zreal + i*zimaginary = 0 + i0 SET iterations = 0 DO SET z = z2 + C // temp = zreal; zreal=zreal2–zimaginary2 + cx // zimaginary = 2 * temp * zimaginary + cy SET value = zreal2 + zimaginary2 iterations++ WHILE value<=4 and iterations < MAX RETURN iterations Notes: The final iteration count determines each point’s color Some points converge quickly; others slowly, and others not at all Non-converging points are in the Mandelbrot Set (black on the previous slide) Note 4½ = 2, so we don't need to compute the square root when setting value

Parallel Implementation Both the Static and Dynamic algorithms are examples of load balancing Load-balancing Algorithms used to avoid processors from becoming idle Note: A balanced load does NOT require even same work loads Static Approach The load is assigned once at the start of the run Mandelbrot: assign each processor a group of rows Deficiencies: Not load balanced Dynamic Approach The load is dynamically assigned during the run Mandelbrot: Slaves ask for work when they complete a section

The Dynamic Approach The Master's work is increased somewhat Must send rows when receive requests from slaves Must be responsive to slave requests. A separate thread might help or the master can make use of MPI's asynchronous receive calls. Termination Slaves terminate when receiving "no work" indication in received messages The master must not terminate until all of the slaves complete Partitioning of the load Master receives blocks of pixels, Slaves receive ranges of (x,y) ranges Partitions can be in columns or in rows. Which is better? Refinement: Ask for work before completion (double buffering)

Pseudo-code (Throw darts to converge at a solution) Monte Carlo Methods Pseudo-code (Throw darts to converge at a solution) Compute a definite integral While more iterations needed pick a random point total += f(x) result = 1/iterations * total Calculation of PI While more iterations needed Randomly pick a point If point is in circle within++ // (x2+y2<=1) Compute PI = 4 * within / iterations Using the upper right quadrant eliminates the 4 in the equation Note: Parallel programs shouldn't use the standard random number generator 1/N ∑1N f(pick.x) (xmax – xmin)

Computation of PI ∫(1-x2)1/2dx = π; -1<=x<=1 Within if (point.x2 + point.y2) ≤ 1 Total points/Points within = Total Area/Area in shape Questions: How to handle the boundary condition? What is the best accuracy that we can achieve?

Parallel Random Number Generator Numbers of a pseudo random sequence are Uniformly, large period, repeatable, statistically independent Each processor must generate a unique sequence Accuracy depends upon random sequence precision Sequential linear generator (a and m are prime; c=0) Xi+1 = (axi +c) mod m (ex: a=16807, m=231 – 1, c=0) Many other generators are possible Parallel linear generator with unique sequences Xi+k = (Axi + C) mod m where k is the "jump" constant A=ak, C=c (ak-1 + ak-2 + … + a1 + a0) if k = P, we can compute A and C and the first k random numbers to get started x1 x2 xP-1 xP xP+1 x2P-2 x2P-1 Parallel Random Sequence