18.337: Image Median Filter Rafael Palacios Aeronautics and Astronautics department. Visiting professor (IIT-Institute for Research in Technology, University.

Slides:



Advertisements
Similar presentations
Comparing and Optimising Parallel Haskell Implementations on Multicore Jost Berthold Simon Marlow Abyd Al Zain Kevin Hammond.
Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Parallel Computing in Matlab
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?
OpenFOAM on a GPU-based Heterogeneous Cluster
Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.
Introduction to Analysis of Algorithms
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
Data Locality Aware Strategy for Two-Phase Collective I/O. Rosa Filgueira, David E.Singh, Juan C. Pichel, Florin Isaila, and Jesús Carretero. Universidad.
TurboBLAST: A Parallel Implementation of BLAST Built on the TurboHub Bin Gan CMSC 838 Presentation.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.
FLANN Fast Library for Approximate Nearest Neighbors
/ 6.338: Parallel Computing Project FinalReport Parallelization of Matrix Multiply: A Look At How Differing Algorithmic Approaches and CPU Hardware.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Parallelization with the Matlab® Distributed Computing Server CBI cluster December 3, Matlab Parallelization with the Matlab Distributed.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Calculating Discrete Logarithms John Hawley Nicolette Nicolosi Ryan Rivard.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Binary Image Compression via Monochromatic Pattern Substitution: A Sequential Speed-Up Luigi Cinque and Sergio De Agostino Computer Science Department.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
Implementing a Speech Recognition System on a GPU using CUDA
Independent Component Analysis (ICA) A parallel approach.
Parallel Computing with Matlab CBI Lab Parallel Computing Toolbox TM An Introduction Oct. 27, 2011 By: CBI Development Team.
GU Junli SUN Yihe 1.  Introduction & Related work  Parallel encoder implementation  Test results and Analysis  Conclusions 2.
CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University.
Parallelization of the Classic Gram-Schmidt QR-Factorization
Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Scaling Area Under a Curve. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.
GPU-Accelerated Beat Detection for Dancing Monkeys Philip Peng, Yanjie Feng UPenn CIS 565 Spring 2012 Final Project – Final Presentation img src:
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
CS 471 Final Project 2d Advection/Wave Equation Using Fourier Methods December 10, 2003 Jose L. Rodriguez
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
Advanced Computer Networks Lecture 1 - Parallelization 1.
1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
Slide-1 Parallel MATLAB MIT Lincoln Laboratory Multicore Programming in pMatlab using Distributed Arrays Jeremy Kepner MIT Lincoln Laboratory This work.
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
Scaling Conway’s Game of Life. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Fast VLSI Implementation of Sorting Algorithm for Standard Median Filters Hyeong-Seok Yu SungKyunKwan Univ. Dept. of ECE, Vada Lab.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Analysis of Sparse Convolutional Neural Networks
Memory Management.
Adaptive Median Filter
Parallel Density-based Hybrid Clustering
Linchuan Chen, Peng Jiang and Gagan Agrawal
Oct. 27, By: CBI Development Team
VSIPL++: Parallel Performance HPEC 2004
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

18.337: Image Median Filter Rafael Palacios Aeronautics and Astronautics department. Visiting professor (IIT-Institute for Research in Technology, University Pontificia Comillas, Madrid, Spain) 1

MEDIAN FILTER 2

Median Filter 3

Median filter algorithm Median filter is a nonlinear operation for noise reduction (dust or spikes). Eliminates noise while preserving edges. Assigns to each point the median value of the neighborhood n*ns log(ns) Matlab function: – C=medfilt2(cn); % 3x3 neighborhood – C=medfilt2(cn,[r c]); % rxc neighborhood 4

MATRIX PREPARATION 5

Size adjustment 1024x1600x3 5 MB 2048x3200x3 20 MB Original image 6

Noise added cn=imnoise(c,'salt & pepper'); 7

EXPERIMENTAL RESULTS 8

Sensitivity to Image size ~O(n) 9

Sensitivity to Neighborhood size Unexpected ! 10

Basic experiments Original matrix size: 2048x3200x3=20M Matrix sizes: n=[20M, 80M, 320M, 1280M]  x4 steps Neighborhood sizes: nn=[ ];  2^n + 1 neighborhood Partitioning strategies: 11

Computer systems Dell (Xeon 2.67 GHz 8M L3, 12 GB DDR3 1066MHz) – Matlab single core – Matlab parallel toolbox – Matlab with pMatlab Cluster (beagle, beowulf) – MPI 12

SINGLE-CORE RESULTS 13

Matlab Single-Core 14

PARALLEL COMPUTING TOOLBOX 15

Matlab Multi-Core Parallel computing toolbox using ‘spmd’ Image size=80MB, neighborhood=65 Worker time matches prediction 16

Matlab Multi-Core with spmd there is an overhead of 1.5s for the 80MB matrix (transfer rate 200 MB/s) There are no memory conflict because each lab works on its own copy of the image Parallelization by rows or columns are equivalent 17

Matlab Multi-Core 8 core computer, slower memory 2x Xeon Quad 2.26GHz, 8GB 667MHz More overhead 18

pMATLAB 19

pMatlab Allows to run Matlab in parallel by launching several Matlab processes that communicate using MPI Communications are transparent to the user, since pMatlab uses a distributed matrix approach

How it works Several Matlab processes are started The leader process loads the image into a shared matrix Each subprocess receives its corresponding section of the image in X Each subprocess applies median filter and stores results in Y The leader process aggregates results 21

Results Computing time does not decrease significantly using double. It scales well using uint8  less data to be moved 22 double uint8

Testing remarks Initially the pMatlab algorithm was implemented using 2D double matrices – Filtering was performed in three steps (R, G, B) – The conversion to double, involved multiplying by 8 the size of the matrices (affecting communications) The final implementation involved 3D uint8 matrices 23

CONCLUSION 24

Conclusion Performance may depend on the algorithm more that on parallelization. (5x5 neighborhood) Matlab’s Parallel Computing Toolbox does not use shared memory. Parallel toolbox uses a lot of memory and communication, because the whole matrix is propagated to all clients. – Algorithm implemented with spmd – It is possible to use distribute matrices to improve – It is possible to use sliced variables if parfor loops. pMatlab uses memory efficiently. MPI version was not developed.

Conclusion Speedup comparison

Conclusion pMatlab using double pMatlab using uint8

pMatlab (3D uint8) 320MB 28 For larger sizes, the impact of latencies is reduced. (computing time and transmission time are linear with size) Speedup is almost perfect in pMatlab, but worst in Toolbox. The amount of memory needed to be sent increases asymptotically to 320MB in the case of pMatlab, however it increases linearly with the number of processors in the case of Parallel Computing Toolbox. 320MB image matrix pMatlabToolbox total timespeeduptotal timespeedup 1 core core core This slide shows the effect of data transfer

BACKUP SLIDES 29

Parallel computing toolbox: memory issues %Activate parallel computing %matlabpool(4) tic %Create treads spmd c = myfilterP(a,labindex,numlabs); end toc %gather results from treads (inefficient memory allocation) result=[]; for ii=1:length( c ) result=[result,c{ii}]; end toc %Close parallel computing %matlabpool close … spmd(4) if labindex==1 c = myfilterP(a1); end if labindex==2 c = myfilterP(a2); end if labindex==3 c = myfilterP(a3); end if labindex==4 c = myfilterP(a4); end 30 Same result All 4 matrices are sent to all threads

pMatlab: sending initial data to clients PARALLEL = 1; if (PARALLEL) %Create map for XL. The leader process owns all data mapL=map([1 1],{},0); %Create map for distributed matrices X and Y. Each processor gets a set of columns mapM=map([1 Np],{},0:Np-1); else mapL=1; mapM=1; end %Create matrices XL, X and Y XL=zeros(n,m,mapL); %owned by Pid 0 X=zeros(n,m,mapM); %distributed input Y=zeros(n,m,mapM); %distributed output if Pid==0 %only the main process makes the initialization load input_matrix XL(:,:)=a; %all data stored in Pid 0 end … X(:,:)=XL; %only leader process has a non-empty X, % so only leader process writes something to X. %Writing to X involves sending data to subproceses, since % different chunks of X belong to different Pids. %Get local part in a standard double matrix. It is faster to work with local matrices. Xloc=local(X); %code Y=put_local(Y,res) ; %After obtaining the resulting matrix res, store it in distributed matrix Y 31

pMatlab (double) computing%comm%total timespeedup 1 core %2.36.2% core % % core % % More data transfer occur with 4 cores (75% of the matrix) than 2 cores (50% of the matrix is copied back and forth). Results are consistent. Conversions from uint8 to double is penalizing pMatlab tests. The 80MB image matrix is in fact 630MB in double format.

pMatlab (3D uint8) 33 Times are smaller Speedup is better because communication delays don’t penalize as much computing%comm%total timespeedup 1 core %0.41.2% core1790.9%1.79.1% core981.8%218.2%113.0