Computations with Big Image Data Phuong Nguyen Sponsor: NIST 1.

Slides:



Advertisements
Similar presentations
A MapReduce Workflow System for Architecting Scientific Data Intensive Applications By Phuong Nguyen and Milton Halem phuong3 or 1.
Advertisements

Distributed and Parallel Processing Technology Chapter2. MapReduce
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Interpreting the data: Parallel analysis with Sawzall LIN Wenbin 25 Mar 2014.
H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui, Ku.
Face Detection And Recognition For Distributed Systems Meng Lin and Ermin Hodžić 1.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
An Introduction to HDInsight June 27 th,
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce. Google and MapReduce Google searches billions of web pages very, very quickly How? It uses a technique called “MapReduce” to distribute the.
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Matrix Multiplication in Hadoop
By: Joel Dominic and Carroll Wongchote 4/18/2012.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Image taken from: slideshare
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Hadoop Aakash Kag What Why How 1.
Pathology Spatial Analysis February 2017
Optimizing Big-Data Queries using Program Synthesis
SpatialHadoop: A MapReduce Framework for Spatial Data
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Applying Twister to Scientific Applications
Author: Ahmed Eldawy, Mohamed F. Mokbel, Christopher Jonathan
Cloud Distributed Computing Environment Hadoop
MapReduce: Data Distribution for Reduce
On Spatial Joins in MapReduce
Cse 344 May 4th – Map/Reduce.
MapReduce.
Overview of big data tools
Lecture 16 (Intro to MapReduce and Hadoop)
Presentation transcript:

Computations with Big Image Data Phuong Nguyen Sponsor: NIST 1

Computations with Big Image Data Motivation: – Live cell image processing application: microscope generates a large number of spatial image tiles with several measurements at each pixel per time slice. – Analyze these image including computations that calibrate, segment and visualize image channels, as well as extract image features for further analyses – Using desktop E.g. image segmentation on stitched image using Matlab 954 files*8mins= 127 hours Stitched TIFF: ~0.578 TB per experiment E.g 161files * 8mins= 21.5 hours 1GB per file Goals: – Computational scalability of cell image processing Data distributed partitioning strategies, parallel algorithms Analysis and evaluation on different algorithm/approaches – Generalize as libraries/benchmarks /tools for image processing 2

Computations with Big Image Data cont. Processing these image: – Operate either on thousands of Mega-pixel images (image tiles) or on hundreds of a half or Giga-pixel images (stitched images) – Range from computationally intensive to data intensive Approaches: – Develop distributed data partitioning strategies and parallel processing algorithms – Implement/Run benchmarks: distributed /parallel framework/platforms – Use Hadoop MapReduce framework and compare with using other frameworks or parallel scripts (PBS) using network file system storage 3

Image segmentation using Java/Hadoop Segmentation method that consists of four linear workflow steps: 1.Sobel-based image gradient computation 2.Connectivity analysis to group 4-connected pixels and threshold by a value to remove small objects 3.Morphological open (dilation of erosion) using 3x3 convolution kernel to remove small holes and islands, and 4.Connectivity analysis and threshold by a value to remove small objects again 5.Connectivity analysis to assign the same label to each contiguous group of 4-connected pixels. Sodel gradient equation 4

Flat Field Correction Correct spatial shading of tile image where I FFC (x,y) the flat-field corrected image intensity, DI(x,y) is the dark image acquired by closing camera shutter, is the raw uncorrected image intensity WI(x,y) is the flat field intensity acquired without any object 5

Characteristic of selected cell image processing computations Image Processing Computation Type Spatial Extent Characteristic of Image Processing Input & Output File Characteristics Computational Complexity Data-Access Pattern During Computations Flat Field CorrectionLocal Input & Output: Tens of thousands of a few MB size files Low (two subtractions and one division per pixel) Medium (accessing three files and creating one file) – data skew Segmentation based on convolution kernels Global with fixed kernelInput & Output: Hundreds of a half GB size files Medium (tens of subtractions, multiplications, comparisons per pixel) Low (accessing one file and creating one file) Summary of computations, and input and output image data files Image Computation TypeInput Data in TIFF File FormatOutput Data mostly in TIFF File Format Num. of FilesSize Per FileNum. of FilesSize Per File Flat Field Correction Large number of raw image tiles (98,169 GFP channel tiles~531GB) 2 bytes per pixel: 2.83 MB Large number of raw image tiles (98,169 GFP channel corrected tiles ~ 531 GB) 4 bytes per pixel ~5.6MB Segmentation based on convolution kernels Small number of phase contrast channel stitched images (388 time frames ~ 219GB) 2 bytes per pixel: 593MBSmall number of mask images (388 time frames ~86GB) 2 bytes per pixel 71MB-331MB 6

Hadoop MapReduce approach Images files upload to HDFS Changes of input formats (read image Input format and serialization ) Splitting of the input (currently No split – mapper process whole stitched image … ). Only use Mapper, output directly write to HDFS as files Source: 7 Output Files

Hadoop MapReduce approach cont. Advantage of using Hadoop – Data at local node -> avoid network file system bottlenecks running at scale – Managing execution of tasks, auto rerun-failed tasks for task failures – Big image loss more work if failures on task – Small images e.g. use Hadoop SequenceFiles that consists of binary key/value pairs (key: image filename, value: image data). Alternative Apache Avro (a data serialization system) Run on NIST HPC cluster (Raritan cluster) – HPC queue system – Move data in/out – Not possible to share data in HDFS 8

Image segmentation benchmark using Hadoop results Single node and single threaded using Java take 10 hours. Using Matlab on desktop machine take ~21.5 hours 9 Both I/O and computation intensive. Image segmentation scale well using Hadoop Efficiency decrease as increase number of nodes

Flat Field Correction benchmark using Hadoop results I/O intensive tasks primary writing output data to HDFS file system 10

Hadoop MapReduce approach cont. Future work considering techniques Future work considering techniques – Achieve pixel level parallelism by breaking each image into smaller images, running algorithms (segmentation/flat field correction, …) and joining the results upon completion (before download files from HDFS to network file system. – This method can also be extended to overlapping blocks (by provide a method that splits the input (image) along boundaries between atomic number of rows/cols in input image and define number of overlapping pixels along each sides) – Comparison between non split/split/split with overlapping pixels – Reduce tasks in MapReduce framework can be useful for some image processing algorithm e.g. feature extraction 11

Summary We have developed image processing algorithms and characterized their computations as potential contributions to – scale cell image analysis application and – provide image processing benchmarks using Hadoop Future work considers – Optimize and tune these image processing computations using Hadoop – Towards generalize as libraries/benchmarks /tools for image processing 12