Computations with Big Image Data Phuong Nguyen Sponsor: NIST 1
Computations with Big Image Data Motivation: – Live cell image processing application: microscope generates a large number of spatial image tiles with several measurements at each pixel per time slice. – Analyze these image including computations that calibrate, segment and visualize image channels, as well as extract image features for further analyses – Using desktop E.g. image segmentation on stitched image using Matlab 954 files*8mins= 127 hours Stitched TIFF: ~0.578 TB per experiment E.g 161files * 8mins= 21.5 hours 1GB per file Goals: – Computational scalability of cell image processing Data distributed partitioning strategies, parallel algorithms Analysis and evaluation on different algorithm/approaches – Generalize as libraries/benchmarks /tools for image processing 2
Computations with Big Image Data cont. Processing these image: – Operate either on thousands of Mega-pixel images (image tiles) or on hundreds of a half or Giga-pixel images (stitched images) – Range from computationally intensive to data intensive Approaches: – Develop distributed data partitioning strategies and parallel processing algorithms – Implement/Run benchmarks: distributed /parallel framework/platforms – Use Hadoop MapReduce framework and compare with using other frameworks or parallel scripts (PBS) using network file system storage 3
Image segmentation using Java/Hadoop Segmentation method that consists of four linear workflow steps: 1.Sobel-based image gradient computation 2.Connectivity analysis to group 4-connected pixels and threshold by a value to remove small objects 3.Morphological open (dilation of erosion) using 3x3 convolution kernel to remove small holes and islands, and 4.Connectivity analysis and threshold by a value to remove small objects again 5.Connectivity analysis to assign the same label to each contiguous group of 4-connected pixels. Sodel gradient equation 4
Flat Field Correction Correct spatial shading of tile image where I FFC (x,y) the flat-field corrected image intensity, DI(x,y) is the dark image acquired by closing camera shutter, is the raw uncorrected image intensity WI(x,y) is the flat field intensity acquired without any object 5
Characteristic of selected cell image processing computations Image Processing Computation Type Spatial Extent Characteristic of Image Processing Input & Output File Characteristics Computational Complexity Data-Access Pattern During Computations Flat Field CorrectionLocal Input & Output: Tens of thousands of a few MB size files Low (two subtractions and one division per pixel) Medium (accessing three files and creating one file) – data skew Segmentation based on convolution kernels Global with fixed kernelInput & Output: Hundreds of a half GB size files Medium (tens of subtractions, multiplications, comparisons per pixel) Low (accessing one file and creating one file) Summary of computations, and input and output image data files Image Computation TypeInput Data in TIFF File FormatOutput Data mostly in TIFF File Format Num. of FilesSize Per FileNum. of FilesSize Per File Flat Field Correction Large number of raw image tiles (98,169 GFP channel tiles~531GB) 2 bytes per pixel: 2.83 MB Large number of raw image tiles (98,169 GFP channel corrected tiles ~ 531 GB) 4 bytes per pixel ~5.6MB Segmentation based on convolution kernels Small number of phase contrast channel stitched images (388 time frames ~ 219GB) 2 bytes per pixel: 593MBSmall number of mask images (388 time frames ~86GB) 2 bytes per pixel 71MB-331MB 6
Hadoop MapReduce approach Images files upload to HDFS Changes of input formats (read image Input format and serialization ) Splitting of the input (currently No split – mapper process whole stitched image … ). Only use Mapper, output directly write to HDFS as files Source: 7 Output Files
Hadoop MapReduce approach cont. Advantage of using Hadoop – Data at local node -> avoid network file system bottlenecks running at scale – Managing execution of tasks, auto rerun-failed tasks for task failures – Big image loss more work if failures on task – Small images e.g. use Hadoop SequenceFiles that consists of binary key/value pairs (key: image filename, value: image data). Alternative Apache Avro (a data serialization system) Run on NIST HPC cluster (Raritan cluster) – HPC queue system – Move data in/out – Not possible to share data in HDFS 8
Image segmentation benchmark using Hadoop results Single node and single threaded using Java take 10 hours. Using Matlab on desktop machine take ~21.5 hours 9 Both I/O and computation intensive. Image segmentation scale well using Hadoop Efficiency decrease as increase number of nodes
Flat Field Correction benchmark using Hadoop results I/O intensive tasks primary writing output data to HDFS file system 10
Hadoop MapReduce approach cont. Future work considering techniques Future work considering techniques – Achieve pixel level parallelism by breaking each image into smaller images, running algorithms (segmentation/flat field correction, …) and joining the results upon completion (before download files from HDFS to network file system. – This method can also be extended to overlapping blocks (by provide a method that splits the input (image) along boundaries between atomic number of rows/cols in input image and define number of overlapping pixels along each sides) – Comparison between non split/split/split with overlapping pixels – Reduce tasks in MapReduce framework can be useful for some image processing algorithm e.g. feature extraction 11
Summary We have developed image processing algorithms and characterized their computations as potential contributions to – scale cell image analysis application and – provide image processing benchmarks using Hadoop Future work considers – Optimize and tune these image processing computations using Hadoop – Towards generalize as libraries/benchmarks /tools for image processing 12