Project Matsu: Large Scale On-Demand Image Processing for Disaster Relief Collin Bennett, Robert Grossman, Yunhong Gu, and Andrew Levine Open Cloud Consortium June 21,
Project Matsu Goals Provide persistent data resources and elastic computing to assist in disasters: – Make imagery available for disaster relief workers – Elastic computing for large scale image processing – Change detection for temporally different and geospatially identical image sets Provide a resource to test standards and interoperability studies large data clouds
Part 1: Open Cloud Consortium
501(3)(c) Not-for-profit corporation Supports the development of standards, interoperability frameworks, and reference implementations. Manages testbeds: Open Cloud Testbed and Intercloud Testbed. Manages cloud computing infrastructure to support scientific research: Open Science Data Cloud. Develops benchmarks. 4
OCC Members Companies: Aerospace, Booz Allen Hamilton, Cisco, InfoBlox, Open Data Group, Raytheon, Yahoo Universities: CalIT2, Johns Hopkins, Northwestern Univ., University of Illinois at Chicago, University of Chicago Government agencies: NASA Open Source Projects: Sector Project 5
Operates Clouds 500 nodes 3000 cores 1.5+ PB Four data centers 10 Gbps Target to refresh 1/3 each year. Open Cloud Testbed Open Science Data Cloud Intercloud Testbed Project Matsu: Cloud- based Disaster Relief Services
Open Science Data Cloud 7 Astronomical data Biological data (Bionimbus) Networking data Image processing for disaster relief
Focus of OCC Large Data Cloud Working Group 8 Cloud Storage Services Cloud Compute Services (MapReduce, UDF, & other programming frameworks) Table-based Data Services Relational-like Data Services App Developing APIs for this framework.
Tools and Standards Apache Hadoop/MapReduce Sector/Sphere large data cloud Open Geospatial Consortium – Web Map Service (WMS) OCC tools are open source (matsu-project) –
Part 2: Technical Approach Hadoop – Lead Andrew Levine Hadoop with Python Streams – Lead Collin Bennet Sector/Sphere – Lead Yunhong Gu
Implementation 1: Hadoop & Mapreduce Andrew Levine
Image Processing in the Cloud - Mapper Mapper Input Key: Bounding Box Mapper Input Value: Mapper Output Key: Bounding Box Mapper Output Value: Mapper resizes and/or cuts up the original image into pieces to output Bounding Boxes (minx = miny = 45.0 maxx = maxy = 67.5) Step 1: Input to Mapper Step 2: Processing in Mapper Step 3: Mapper Output Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Mapper Output Value: + Timestamp
Image Processing in the Cloud - Reducer Reducer Key Input: Bounding Box (minx = miny = maxx = maxy = ) Reducer Value Input: Step 1: Input to Reducer … … Step 2: Process difference in Reducer Assemble Images based on timestamps and compare Result is a delta of the two Images Step 3: Reducer Output All images go to different map layers set of images for display in WMS Timestamp 1 Set Timestamp 2 Set Delta Set
Implementation 2: Hadoop & Python Streams Collin Bennett
Preprocessing Step All images (in a batch to be processed) are combined into a single file. Each line contains the image’s byte array transformed to pixels (raw bytes don’t seem to work well with the one-line-at-a-time Hadoop streaming paradigm). geolocation \t timestamp | tuple size ; image width ; image height; comma-separated list of pixels the fields in red are metadata needed to process the image in the reducer
Map and Shuffle We can use the identity mapper All of the work for mapping was done in the pre-process step Map / Shuffle key is the geolocation In the reducer, the timestamp will be 1st field of each record when splitting on ‘|’
Implementation 3: Sector/Sphere Yunhong Gu
Sector Distributed File System Sector aggregate hard disk storage across commodity computers – With single namespace, file system level reliability (using replication), high availability Sector does not split files – A single image will not be split, therefore when it is being processed, the application does not need to read the data from other nodes via network – A directory can be kept together on a single node as well, as an option
Sphere UDF Sphere allows a User Defined Function to be applied to each file (either it is a single image or multiple images) Existing applications can be wrapped up in a Sphere UDF In many situations, Sphere streaming utility accepts a data directory and a application binary as inputs./stream -i haiti -c ossim_foo -o results
For More Information