1 Models for scientific exploitation of EO data * ESRIN * 12.10.2012.

Slides:



Advertisements
Similar presentations
ESA Data Integration Application Open Grid Services for Earth Observation Luigi Fusco, Pedro Gonçalves.
Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Page 1GlobColour CDR Meeting – July 10-11, 2006, ESRIN All rights reserved © 2006, ACRI-ST Resulting Technical Specification.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Spark: Cluster Computing with Working Sets
UNCLASSIFIED: LA-UR Data Infrastructure for Massive Scientific Visualization and Analysis James Ahrens & Christopher Mitchell Los Alamos National.
Distributed Computations
Distributed Computations MapReduce
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Jeffrey D. Ullman Stanford University.  Mining of Massive Datasets, J. Leskovec, A. Rajaraman, J. D. Ullman.  Available for free download at i.stanford.edu/~ullman/mmds.html.
HADOOP ADMIN: Session -2
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
MapReduce M/R slides adapted from those of Jeff Dean’s.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
_______________________________________________________________CMAQ Libraries and Utilities ___________________________________________________Community.
VIPER Quality Assessment Overview Presenter: Sathyadev Ramachandran, SPS Inc.
Jonas Eberle 25th March Automatization of information extraction to build up a crowd-sourced reference database for vegetation changes Jonas Eberle,
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
2. WP9 – Earth Observation Applications ESA DataGrid Review Frascati, 10 June Welcome and introduction (15m) 2.WP9 – Earth Observation Applications.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
MODIS SDST, STTG and SDDT MODIS Science Team Meeting (Land Discipline Breakout Session) July 13, 2004 Robert Wolfe Raytheon NASA GSFC Code 922.
CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 8: hadoop and Tera/Peta byte graphs.
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Cloud paradigm, standards and middleware for PGS * ESRIN *
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Apache Tez : Accelerating Hadoop Query Processing Page 1.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Research and Service Support Resources for EO data exploitation RSS Team, ESRIN, 23/01/2013 Requirements for a Federated Infrastructure.
WP18, High-speed data recording Krzysztof Wrona, European XFEL
Introduction to Distributed Platforms
Software Systems Development
Hadoop MapReduce Framework
Spark Presentation.
湖南大学-信息科学与工程学院-计算机与科学系
SDM workshop Strawman report History and Progress and Goal.
Hadoop Technopoints.
Overview of big data tools
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

1 Models for scientific exploitation of EO data * ESRIN *

Calvalus Full mission EO cal/val processing and exploitation services 2 Models for scientific exploitation of EO data * ESRIN *

Outline  Objectives and achievements  Apache Hadoop in five slides  Calvalus = Hadoop for EO  Calvalus bulk processing 3 * Sixth Symposium on Operating System Design and Implementation; San Francisco, CA, 2004 Models for scientific exploitation of EO data * ESRIN * Jeffrey Dean and Sanjay Ghemawat, Google, 2004: “MapReduce: Simplified Data Processing on Large Clusters” *

 exploit easily full mission EO archives  have a powerful and affordable multi-mission processing infrastructure  generate products using full mission datasets, with new algorithms and algorithm versions  aggregate results in temporal and spatial dimension  test new ideas in a rapid prototyping approach  have a tool to perform calibration and validation on full mission archives as the basis for reliable scientific conclusions There was a dream … 4 Models for scientific exploitation of EO data * ESRIN *  Robust production

Calvalus for Land Cover CCI Pre-processing 5 Generation of 7-day composites of surface reflectance from full mission MERIS FRS and RR for CCI Land Cover is a data and computing intensive automated job that runs for 3 months on a 72 nodes Calvalus/Hadoop cluster Quicklook generation for full mission MERIS FRS and RR reads and processes 150 TB input data in 10 hours. This is about 50 Gbit/s. Other full mission processes are between these two times. Models for scientific exploitation of EO data * ESRIN *

Projects using Calvalus  ESA CoastColour: 6 years MERIS FR, 27 regions  ESA Land Cover CCI: pre-processing, full mission weekly L3 from MERIS and SPOT VGT  ESA Ocean Colour CCI: algorithm improvement cycle, MODIS, SeaWiFS, MERIS  GlobVeg: global FAPAR and LAI from MERIS  Prevue: MERIS full mission subset extraction  Fronts: MERIS detection of fronts  Diversity II: bio-diversity of lakes and drylands 6 Models for scientific exploitation of EO data * ESRIN *

Hadoop = HDFS + jobs/tasks + MapReduce Archive-centric approach  Network storage  data are transferred on the network  risk of network bottleneck 7 Direct, data-local processing Compute cluster Network data archive Hadoop approach  data-local processing  tasks are transferred on the network  good scalability Models for scientific exploitation of EO data * ESRIN *

Cluster hardware and network  standard hardware  Calvalus additions for I/O and development 8 node 1 local disk node 2 local disk node n local disk... master feeder external data source or destination test server test 1 vm1 node 3 local disk node 4 local disk Models for scientific exploitation of EO data * ESRIN *

Hadoop Distributed File System 9 distributed file system HDFS on local disks of compute nodes transparent, optimised data-local access data replication automated recovery continued service Models for scientific exploitation of EO data * ESRIN *

Hadoop Job Scheduling 10  flexible granularity of inputs defined by split functions (for EO: one file – one split)  massive parallel processing, task pull  takes failure into account, automated re-attempt, optional speculative execution  job queues, priorities, fair sharing among projects Job Input set Task Input split Task Input split Task Input split Task Input split Task Input split data-local processing Models for scientific exploitation of EO data * ESRIN *

Parallel aggregation with MapReduce  data-local access of inputs  a well-selected sorting and partitioning function  generation of the output in parts that can be simply concatenated 11 Models for scientific exploitation of EO data * ESRIN *

L2 Bulk Processing Realisation MERIS RR L1, North Sea, 3 days CoastColour NN L2 processor 6 minutes (22 nodes) output: L2 files L1 File L2 Processor (Mapper Task) L2 Processor (Mapper Task) L2 File L1 File L2 Processor (Mapper Task) L2 Processor (Mapper Task) L2 File L1 File L2 Processor (Mapper Task) L2 Processor (Mapper Task) L2 File L1 File L2 Processor (Mapper Task) L2 Processor (Mapper Task) L2 File L1 File L2 Processor (Mapper Task) L2 Processor (Mapper Task) L2 File 13 Models for scientific exploitation of EO data * ESRIN *

Match-up Analysis Realisation MERIS RR L1, global, 3 months CoastColour C2W processor NOMAD in-situ dataset 6 minutes (22 nodes) Scatter-plots and pixel extraction L1 File L2 Proc. & Matcher (Mapper Task) OutpRecs L1 File L2 Proc. & Matcher (Mapper Task) OutpRecs L1 File L2 Proc. & Matcher (Mapper Task) OutpRecs L1 File L2 Proc. & Matcher (Mapper Task) OutpRecs L1 File L2 Proc. & Matcher (Mapper Task) OutpRecs MA Output Gen. (Reducer Task) MA Output Gen. (Reducer Task) Inp Recs MA Report 14 Models for scientific exploitation of EO data * ESRIN *

L2/L3 Processing Realisation MERIS RR L1, global, 10-day CoastColour C2W processor 1.5 hours (22 nodes) 1 L3 product L3 Temp. Binning (Reducer Task) L3 Temp. Binning (Reducer Task) Spa.Bins L1 File L2 Proc. & Spat. Binning (Mapper Task) L1 File L2 Proc. & Spat. Binning (Mapper Task) Spat.Bins L1 File L2 Proc. & Spat. Binning (Mapper Task) Spat.Bins L1 File L2 Proc. & Spat. Binning (Mapper Task) Spat.Bins L1 File L2 Proc. & Spat. Binning (Mapper Task) Spat.Bins L3 Temp. Binning (Reducer Task) L3 Temp. Binning (Reducer Task) L3 File(s) Temp.Bins L3 Formatting (Staging) L3 Formatting (Staging) 11 Models for scientific exploitation of EO data * ESRIN *

Trend Analysis Realisation MERIS RR L1, South Pacific Gyre, , first 4 days of a month CoastColour C2W processor 30 minutes (22 nodes) Time-series plots and data L3 Temp. Binning Spat.Bins L1 File L2 Proc. & Spat. Binning (Mapper Task) L1 File L2 Proc. & Spat. Binning (Mapper Task) Spat.Bins L1 File L2 Proc. & Spat. Binning (Mapper Task) Spat.Bins L1 File L2 Proc. & Spat. Binning (Mapper Task) Spat.Bins L1 File L2 Proc. & Spat. Binning (Mapper Task) Spat.Bins L3 Temp. Binning (Reducer Task) L3 Temp. Binning (Reducer Task) Temp.Bins L3 Temp. Binning Spat.Bins L1 File L2 Proc. & Spat. Binning (Mapper Task) L1 File L2 Proc. & Spat. Binning (Mapper Task) Spat.Bins L1 File L2 Proc. & Spat. Binning (Mapper Task) Spat.Bins L1 File L2 Proc. & Spat. Binning (Maper Task) Spat.Bins L1 File L2 Proc. & Spat. Binning (Mapper Task) Spat.Bins L3 Temp. Binning (Reducer Task) L3 Temp. Binning (Reducer Task) TA Report Temp.Bins TA Formatting (Staging) TA Formatting (Staging) 16 Models for scientific exploitation of EO data * ESRIN *

Processor integration  Adapter for Unix executables (C++, Fortran, Python,...)  Adapter for BEAM GPF operators  Concurrent processor versions in the system  Automated deployment of processor bundles at runtime 17 Models for scientific exploitation of EO data * ESRIN *

Supported by BEAM Graph Processing Framework Access to data via reader/writer objects instead of files Operator chaining to build processors from modules Tile cache and pull principle for in-memory processing Hadoop MapReduce for partitioning and streaming Calvalus + BEAM for data streaming 18 Models for scientific exploitation of EO data * ESRIN *

Quality check in bulk processing workflows QL gen 1 day QL visual QC black list autom. QC inven tory QL gen SR QL visual QC black list GET ASSE ORB ATT error report GET ASSE feed back FRS/ RR L1B AMOR GOS FRG/ RRG L1B L2 proc. SDR 7 day SR compo L3 proc. autom. QC inven tory Models for scientific exploitation of EO data * ESRIN * inputs with issues identified in MERIS L1B

Bulk production control for full mission reprocessing 20 Processing Monitor Request Queue Workflow engine Resource management start bulk production concurrent processing steps progress observation parameterssequencingresourcesconstraints reportstatus years, increasing two months at a time processing workflow processor versions,... Models for scientific exploitation of EO data * ESRIN *

Jobs and tasks to be managed 21 Workflow StepBulksJobsTasksInputsOutputs Input MERIS FRS+RR TB auto-QA+inventory QL daily QL scenes visual QA screening inputs7300+ AMORGOS geocoding Level 2 SDR processing Level 3 SR 7-day composites QL SR visual QA screening outputs1040+ SR result export(10)60TB Sum Models for scientific exploitation of EO data * ESRIN *

Calvalus portal for on-demand processing 22 input set selection processor versions processing parameters in-situ data for matchup analysis variables for aggregation trend analysis Models for scientific exploitation of EO data * ESRIN *

Summary  Calvalus is a multi-mission full mission data processing system for bulk (re)processing, data analysis and algorithm validation  Calvalus is based on the open source middleware Apache Hadoop and implements massive parallel data-local processing  Calvalus integrates processors of the BEAM GPF processing framework and Unix executables in any programming language  Calvalus is successfully in used by various projects and will be further developed  Acknowledgement: The initial Calvalus idea was developed and its realisation was funded by the European Space Agency under the SME-LET programme. 23 Models for scientific exploitation of EO data * ESRIN *

Reflection points  The adequate hardware infrastructure for Hadoop is different from the current trend of virtualisation and network storage (transparency vs. knowledge of data location).  Adapted optimised solutions may have a shorter life cycle than generic, standardised ones (processor interfaces that support data streaming vs. file interface)  Historical missions (ENVISAT) are not the problem. Are we prepared for Sentinel data? 24 Models for scientific exploitation of EO data * ESRIN *