U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.

U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing And Production With Apache Hadoop

Overview  Apache Hadoop  Applications, Environment and Use Case  Log Processing Example  EROS Science Processing Architecture (ESPA) and Hadoop  ESPA Processing Example  ESPA Implementation Strategy  Performance Results  Thoughts, Notes and Takeaway  Questions

Apache Hadoop – What is it?  Open source distributed processing system  Designed to run on commodity hardware  Widely used for solving “Big Data” challenges  Has been deployed in clusters with thousands of machines and petabytes of storage  Two primary subsystems: Hadoop Distributed File System (HDFS) and the MapReduce engine

Hadoop’s Applications  Web content indexing  Data mining  Machine learning  Statistical analysis and modeling  Trend analysis  Search optimization  … and of course, satellite image processing!

Hadoop’s Environment  Linux and Unix  Java based but relies on ssh for job distribution  Jobs written in any language executable from shell prompt  Java, C/C++, Perl, Python, Ruby, R, Bash, et. al.

Hadoop’s Use Case  Cluster of machines is configured into a Hadoop cluster  Each contributes:  Local compute resources to MapReduce  Local storage resources to HDFS  Files are stored in HDFS  File size is typically measured in gigabytes and terabytes  Job is run against an input file in HDFS  Target input file is specified  Code to run against input also specified

Hadoop’s Use Case  Unlike traditional systems which move data to the code, Hadoop flips this and moves code to the data  Two software functions comprise a MapReduce job  Map operation  Reduce operation  Upon execution:  Hadoop identifies input file chunk locations, moves the algorithms and executes the code  The “Map”  Sorts the Map results and aggregates final answer (single thread)  The “Reduce”

Log Processing Example

ESPA and Hadoop  Hadoop map code runs parallel on the input (log file)  Processes a single input file as quickly as possible  Reduce code runs on mapper output  ESPA processes satellite images, not text  Algorithms cannot run parallel within an image  Cannot use satellite images as the input  Solution: Use a text file with the image location as input. Skip the reduce step  Rather than parallelize within an image, ESPA handles many images at once

ESPA Processing Example

Implementation Strategy  LSRD is budget constrained for hardware  Other projects regularly excess old hardware upon warranty expiration  Take ownership of these systems… if they fail, they fail  Also ‘borrow’ compute and storage from other projects  Only network connectivity is necessary  Current cluster is 102 cores, minimal expense  Cables, switches, etc

Performance Results  Original throughput requirement was 455 atmospherically corrected Landsat scenes per day  Currently able to process ~ 4800!  Biggest bottleneck is local machine storage input/output  Due to implementation of ftp’ing files instead of using HDFS as intended  Attempted to solve this with ram disk, not enough memory  Currently evaluating solid state disk

Thoughts and Notes  Number of splits on input file can be controlled via the dfs.block.size parameter  Therefore control number of jobs run against an input file  ESPA-like implementation does not require massive storage unlike other Hadoop instances  Input files are very small  Robust internal job monitoring mechanisms are usually custom-built

Thoughts and Notes  Jobs written for Hadoop Streaming may be tested and run without Hadoop  cat inputfile.txt | mapper.py | sort | reducer.py > out.txt  Projects can share resources  Hadoop is tunable to restrict resource utilization on a per machine basis  Provides instant productivity gains versus internal development  LSRD is all about science and science algorithms  Minimal time and budget for building internal systems

Takeaways  Hadoop is proven and tested  Massively scalable out of the box  Cloud based instances available from Amazon and others  Shortest path to processing massive amounts of data  Extremely hardware failure tolerant  No specialized hardware or software needed  Flexible job API allows existing software skills to be leveraged  Industry adoption means support skills available

Questions

U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.

Similar presentations

Presentation on theme: "U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.

Similar presentations

Presentation on theme: "U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing."— Presentation transcript:

Similar presentations

About project

Feedback