High-throughput parallel pipelined data processing system for remote Earth sensing big data in the clouds Высокопроизводительная параллельно-конвейерная.

Slides:



Advertisements
Similar presentations
Three Perspectives & Two Problems Shivnath Babu Duke University.
Advertisements

RCAC Research Computing Presents: DiaGird Overview Tuesday, September 24, 2013.
A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.
ARCS Data Analysis Software An overview of the ARCS software management plan Michael Aivazis California Institute of Technology ARCS Baseline Review March.
Jun Peng Stanford University – Department of Civil and Environmental Engineering Nov 17, 2000 DISSERTATION PROPOSAL A Software Framework for Collaborative.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
Swift: A Scientist’s Gateway to Campus Clusters, Grids and Supercomputers Swift project: Presenter contact:
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
Numerical Grid Computations with the OPeNDAP Back End Server (BES)
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Struts 2.0 an Overview ( )
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
INTERPROSCAN 5 Analyses, Architecture and JMS. Introduction to InterProScan: automatic annotation of protein sequence Protein Sequence Protein Sequence.
Christopher Jeffers August 2012
BlackBridge | Earth Science Conference 2014 – USA Earth Science Conference 2014 – San Francisco - USA MASS PROCESSING.
DISTRIBUTED DATA FLOW WEB-SERVICES FOR ACCESSING AND PROCESSING OF BIG DATA SETS IN EARTH SCIENCES A.A. Poyda 1, M.N. Zhizhin 1, D.P. Medvedev 2, D.Y.
LOGO Scheduling system for distributed MPD data processing Gertsenberger K. V. Joint Institute for Nuclear Research, Dubna.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
WSRF Supported Data Access Service (VO-DAS)‏ Chao Liu, Haijun Tian, Dan Gao, Yang Yang, Yong Lu China-VO National Astronomical Observatories, CAS, China.
RISICO on the GRID architecture First implementation Mirko D'Andrea, Stefano Dal Pra.
Developing software and hardware in parallel Vladimir Rubanov ISP RAS.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
ESIP Federation 2004 : L.B.Pham S. Berrick, L. Pham, G. Leptoukh, Z. Liu, H. Rui, S. Shen, W. Teng, T. Zhu NASA Goddard Earth Sciences (GES) Data & Information.
Wrapping Scientific Applications As Web Services Using The Opal Toolkit Wrapping Scientific Applications As Web Services Using The Opal Toolkit Sriram.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
SEE-GRID-SCI The SEE-GRID-SCI initiative is co-funded by the European Commission under the FP7 Research Infrastructures contract no.
LOGO Development of the distributed computing system for the MPD at the NICA collider, analytical estimations Mathematical Modeling and Computational Physics.
CS 127 Introduction to Computer Science. What is a computer?  “A machine that stores and manipulates information under the control of a changeable program”
Sep 13, 2006 Scientific Computing 1 Managing Scientific Computing Projects Erik Deumens QTP and HPC Center.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
Centroute, Tenet and EmStar: Development and Integration Karen Chandler Centre for Embedded Network Systems University of California, Los Angeles.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
The ATLAS DAQ System Online Configurations Database Service Challenge J. Almeida, M. Dobson, A. Kazarov, G. Lehmann-Miotto, J.E. Sloper, I. Soloviev and.
Document Name CONFIDENTIAL Version Control Version No.DateType of ChangesOwner/ Author Date of Review/Expiry The information contained in this document.
Monitoring with InfluxDB & Grafana
Climate-SDM (1) Climate analysis use case –Described by: Marcia Branstetter Use case description –Data obtained from ESG –Using a sequence steps in analysis,
Origami: Scientific Distributed Workflow in McIDAS-V Maciek Smuga-Otto, Bruce Flynn (also Bob Knuteson, Ray Garcia) SSEC.
Next Generation of Apache Hadoop MapReduce Owen
Function as a Service An Ad Hoc Approach to Cloud Computing By Keith Downie.
Virtual Cluster Computing in IHEPCloud Haibo Li, Yaodong Cheng, Jingyan Shi, Tao Cui Computer Center, IHEP HEPIX Spring 2016.
Distributed parallel processing analysis framework for Belle II and Hyper Suprime-Cam MINEO Sogo (Univ. Tokyo), ITOH Ryosuke, KATAYAMA Nobu (KEK), LEE.
Introduction of Wget. Wget Wget is a package for retrieving files using HTTP and FTP, the most widely-used Internet protocols. Wget is non-interactive,
1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
Research and Service Support Resources for EO data exploitation RSS Team, ESRIN, 23/01/2013 Requirements for a Federated Infrastructure.
Guide To Develop Mobile Apps With Titanium. Agenda Overview Installation of Platform SDKs Pros of Appcelerator Titanium Cons of Appcelerator Titanium.
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
Apache Ignite Compute Grid Research Corey Pentasuglia.
Image taken from: slideshare
Introduction to Distributed Platforms
Example: Rapid Atmospheric Modeling System, ColoState U
ATLAS Cloud Operations
GWE Core Grid Wizard Enterprise (
Applying Control Theory to Stream Processing Systems
Existing Perl/Oracle Pipeline
The Client/Server Database Environment
湖南大学-信息科学与工程学院-计算机与科学系
Apache Spark & Complex Network
Chapter 2: System Structures
Module 01 ETICS Overview ETICS Online Tutorials
Overview of big data tools
What's New in eCognition 9
Overview of Workflows: Why Use Them?
Control Theory in Log Processing Systems
A General Approach to Real-time Workflow Monitoring
What's New in eCognition 9
Production Manager Tools (New Architecture)
Presentation transcript:

High-throughput parallel pipelined data processing system for remote Earth sensing big data in the clouds Высокопроизводительная параллельно-конвейерная система обработки больших данных дистанционного зондирования Земли в облаках A.M. Novikov, Aulov V.A.,Poyda A.A., NRC "Kurchatov Institute" GRID /07/2014, Dubna

2/07/20142 Problem overview Satellite: Tera, Aqua(1999,2002) Sensor: 36 spectral bands, entire Earth`s surface, >=1km2 Data: in 1-2 days, many ftp sites, 3,9Tb for 10 years (atmosphere, land)

2/07/20143 Problem overview Satellite: Suomi NPP(2011) Sensor: only for VIIRS ~20 spectral bands, entire Earth`s surface Data: twice per day, 85 days on-line ftp site, several Tb`s daily, image of X px (0,65km2 ?)

2/07/20144 Pipeline Terrain_correction DNB geo files Reprojection Mosaic_stiching Fires detection DNB data files SVM geo files SVM10 data files SVM data files Terrain_correction SVM TerrainCorrected geo files N:1 10 optional, 1:1 1 optional, 1:1 1.Simple chains with aggregation. 2. Nested chains supported up to 2 level(more if output files are input ones for high level chain). 3. New version of data or optional file restart(or repeat) chain-match instance.

2/07/20145 Pipeline DNBgDNBdSVMgSVMdTcgReprMosDet in(all)124Gb15Gb78Gb128Gb124Gb93Gb0,6*N299Gb out(all)78Gb254Gb?<1*N0,5Gb number <N<254 Solvers(or «binary») are: Terrain_correction & Reprojection — Java classes, single core, memory use 1500M and 2500M+ Detection and Mosaic_stiching — Matlab (MRE12b), single core, memory use ~3000M and ?(depends on window size). Supported system pipelines (that must just process data as set in one time - infinite), and users pipelines (finite or temporarily). Daily system must process at last around 2700 files (~0,4Tb) — see table.

2/07/20146 Modular architecture

2/07/20147 Choice of software stack Use ready solutions, not write code twice. Cluster resourses — cloud provider (pay free): OpenStack (fast development,like de-facto cloud, open and easy python API). Starcluster similar project. Workflow system candidate — didn`t find any quite appropriate: Storm by Apache (again Java), Celery(not enough flexible for our task), Heroku(?). Write ourselves. Job scheduling subsystem — any PBS-like (ready to use and failproof): HTCondor, Torque, SLURM. Programming language — smth bash-like and yet flexible for system-core: Python(yes, not Ruby, though it possibly has many good libraries). Database — mysql (quite easy and common, support complex sql-requests). Message service — RabbitMQ(would be very nice,... sometimes). Data storage and file I/O — just NFS(later anything else for performance).

2/07/20148 Pipeline description language Use JSON for define task input files set and some interaction: {"pipeline":{"tasks":[ {"taskname": "terrain_correction", "input":{"param":{ "d": [" | "], "t": [],"e": [],"b": [],"end": ["h5"] },"files":{ "geo":{"s": ["GMODO","GDNBO"]}, "geo2":{"s": ["GDTCD"]}, },"optional": ["geo2"] },"output": "./out", "requirements":{ "retr":3,"disk_space":"100m","cpu":1,"memory":1400,"time":"15","priority":"3" },"runline": "bash /data/VIIRS/in/sh/viirs/task2.sh %geo% %output%" },

2/07/20149 Test run Memory load(percent): total available and requested

2/07/ Test run CPU(percent) load: total available and requested

2/07/ Conclusion and summary Pros: 1) Found out that we need some feature like «pilot» jobs (at least for testing user&admin&developers&config bugs). 2) Lack of feedback from jobs failure and binaries/scripts/pipelines/nodes config. 3) Possibly needed more powerful network FS (the same jobs wall time differs a lot) and more testing. 4) Needed more mature development of system modules (database, messaging service, statistic, etc.)

2/07/ Conclusion and summary Cons: 1) System can function automatically :) 2) User can provide flexible pipeline json description, which can match sets of files quite clever(i.e. any possible and|or|between logic for files and tasks parameters available). 3) Can be used for problems in other branches of science, or just other tasks and models. 4) System does use jobs retries and can overcome some failures in its components (wn or central node fails, bugs,etc...)

2/07/ Thank you for attention! Questions?

2/07/ Problem overview Algorithm accuracy: ~400m For spots of 0,5-1000m2 with temperatures of K