Project Matsu: Large Scale On-Demand Image Processing for Disaster Relief Collin Bennett, Robert Grossman, Yunhong Gu, and Andrew Levine Open Cloud Consortium.

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

Overview of MapReduce and Hadoop
The Open Science Data Cloud Robert L. Grossman University of Chicago and Open Cloud Consortium April 4, 2012 A 501(c)(3) not-for-profit operating clouds.
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Distributed Data Storage and Processing over Commodity Clusters Sector & Sphere Yunhong Gu Univ. of Illinois at of Chicago, Feb. 17, 2009.
An Introduction to Sector/Sphere Sector & Sphere Yunhong Gu Univ. of Illinois at Chicago and VeryCloud June 22, 2010.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
Workshop on Basics & Hands on Kapil Bhosale M.Tech (CSE) Walchand College of Engineering, Sangli. (Worked on Hadoop in Tibco) 1.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
An Introduction to the Open Science Data Cloud Heidi Alvarez Florida International University Robert L. Grossman University of Chicago Open Cloud Consortium.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium on Computer Modeling.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Distributed Data Storage and Parallel Processing Engine Sector & Sphere Yunhong Gu Univ. of Illinois at Chicago.
On the Varieties of Clouds for Data Intensive Computing 董耀文 Antslab Robert L. Grossman University of Illinois at Chicago And Open Data.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Appraisal and Data Mining of Large Size Complex Documents Rob Kooper, William McFadden and Peter Bajcsy National Center for Supercomputing Applications.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
MalStone:Towards A Benchmark for Analytics on Large Data Clouds Collin Bennett Open Data Group 400 Lathrop Ave Suite 90 River Forest IL Robert L.
The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
Sector and Sphere: the design and implementation of a high-performance data cloud by Yunhong Gu, and Robert L. Grossman Philosophical Transactions A Volume.
Yunhong Gu and Robert Grossman University of Illinois at Chicago 碩資工一甲 王聖爵
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Introduction to Search Engines Technology CS Technion, Winter 2013 Amit Gross Some slides are courtesy of: Edward Bortnikov & Ronny Lempel, Yahoo!
Hidemoto Nakada, Hirotaka Ogawa and Tomohiro Kudoh National Institute of Advanced Industrial Science and Technology, Umezono, Tsukuba, Ibaraki ,
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 2011 UKSim 5th European Symposium on Computer Modeling and Simulation Speker : Hong-Ji.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
Lessons About Sustainability Learned from the Open Science Data Cloud Robert Grossman University of Chicago & Open Cloud Consortium.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Toward Efficient and Simplified Distributed Data Intensive Computing IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 6, JUNE 2011PPT.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Next Generation of Apache Hadoop MapReduce Owen
TERENA June 3 rd,2013 Julio Ibarra, PhD. Assistant Vice President of Technology Augmented Research (CIARA) PARTNERSHIP FOR INTERNATIONAL RESEARCH AND EDUCATION.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
Hadoop Aakash Kag What Why How 1.
Introduction to MapReduce and Hadoop
Hadoop Clusters Tess Fulkerson.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
The Basics of Apache Hadoop
Cloud Distributed Computing Environment Hadoop
Hadoop Basics.
CS110: Discussion about Spark
Hadoop Technopoints.
VI-SEEM data analysis service
Presentation transcript:

Project Matsu: Large Scale On-Demand Image Processing for Disaster Relief Collin Bennett, Robert Grossman, Yunhong Gu, and Andrew Levine Open Cloud Consortium June 21,

Project Matsu Goals Provide persistent data resources and elastic computing to assist in disasters: – Make imagery available for disaster relief workers – Elastic computing for large scale image processing – Change detection for temporally different and geospatially identical image sets Provide a resource to test standards and interoperability studies large data clouds

Part 1: Open Cloud Consortium

501(3)(c) Not-for-profit corporation Supports the development of standards, interoperability frameworks, and reference implementations. Manages testbeds: Open Cloud Testbed and Intercloud Testbed. Manages cloud computing infrastructure to support scientific research: Open Science Data Cloud. Develops benchmarks. 4

OCC Members Companies: Aerospace, Booz Allen Hamilton, Cisco, InfoBlox, Open Data Group, Raytheon, Yahoo Universities: CalIT2, Johns Hopkins, Northwestern Univ., University of Illinois at Chicago, University of Chicago Government agencies: NASA Open Source Projects: Sector Project 5

Operates Clouds 500 nodes 3000 cores 1.5+ PB Four data centers 10 Gbps Target to refresh 1/3 each year. Open Cloud Testbed Open Science Data Cloud Intercloud Testbed Project Matsu: Cloud- based Disaster Relief Services

Open Science Data Cloud 7 Astronomical data Biological data (Bionimbus) Networking data Image processing for disaster relief

Focus of OCC Large Data Cloud Working Group 8 Cloud Storage Services Cloud Compute Services (MapReduce, UDF, & other programming frameworks) Table-based Data Services Relational-like Data Services App Developing APIs for this framework.

Tools and Standards Apache Hadoop/MapReduce Sector/Sphere large data cloud Open Geospatial Consortium – Web Map Service (WMS) OCC tools are open source (matsu-project) –

Part 2: Technical Approach Hadoop – Lead Andrew Levine Hadoop with Python Streams – Lead Collin Bennet Sector/Sphere – Lead Yunhong Gu

Implementation 1: Hadoop & Mapreduce Andrew Levine

Image Processing in the Cloud - Mapper Mapper Input Key: Bounding Box Mapper Input Value: Mapper Output Key: Bounding Box Mapper Output Value: Mapper resizes and/or cuts up the original image into pieces to output Bounding Boxes (minx = miny = 45.0 maxx = maxy = 67.5) Step 1: Input to Mapper Step 2: Processing in Mapper Step 3: Mapper Output Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Mapper Output Value: + Timestamp

Image Processing in the Cloud - Reducer Reducer Key Input: Bounding Box (minx = miny = maxx = maxy = ) Reducer Value Input: Step 1: Input to Reducer … … Step 2: Process difference in Reducer Assemble Images based on timestamps and compare Result is a delta of the two Images Step 3: Reducer Output All images go to different map layers set of images for display in WMS Timestamp 1 Set Timestamp 2 Set Delta Set

Implementation 2: Hadoop & Python Streams Collin Bennett

Preprocessing Step All images (in a batch to be processed) are combined into a single file. Each line contains the image’s byte array transformed to pixels (raw bytes don’t seem to work well with the one-line-at-a-time Hadoop streaming paradigm). geolocation \t timestamp | tuple size ; image width ; image height; comma-separated list of pixels the fields in red are metadata needed to process the image in the reducer

Map and Shuffle We can use the identity mapper All of the work for mapping was done in the pre-process step Map / Shuffle key is the geolocation In the reducer, the timestamp will be 1st field of each record when splitting on ‘|’

Implementation 3: Sector/Sphere Yunhong Gu

Sector Distributed File System Sector aggregate hard disk storage across commodity computers – With single namespace, file system level reliability (using replication), high availability Sector does not split files – A single image will not be split, therefore when it is being processed, the application does not need to read the data from other nodes via network – A directory can be kept together on a single node as well, as an option

Sphere UDF Sphere allows a User Defined Function to be applied to each file (either it is a single image or multiple images) Existing applications can be wrapped up in a Sphere UDF In many situations, Sphere streaming utility accepts a data directory and a application binary as inputs./stream -i haiti -c ossim_foo -o results

For More Information