Presentation is loading. Please wait.

Presentation is loading. Please wait.

High-throughput parallel pipelined data processing system for remote Earth sensing big data in the clouds Высокопроизводительная параллельно-конвейерная.

Similar presentations


Presentation on theme: "High-throughput parallel pipelined data processing system for remote Earth sensing big data in the clouds Высокопроизводительная параллельно-конвейерная."— Presentation transcript:

1 High-throughput parallel pipelined data processing system for remote Earth sensing big data in the clouds Высокопроизводительная параллельно-конвейерная система обработки больших данных дистанционного зондирования Земли в облаках A.M. Novikov, Aulov V.A.,Poyda A.A., NRC "Kurchatov Institute" GRID 2014 2/07/2014, Dubna

2 2/07/20142 Problem overview Satellite: Tera, Aqua(1999,2002) Sensor: 36 spectral bands, entire Earth`s surface, >=1km2 Data: in 1-2 days, many ftp sites, 3,9Tb for 10 years (atmosphere, land)

3 2/07/20143 Problem overview Satellite: Suomi NPP(2011) Sensor: only for VIIRS ~20 spectral bands, entire Earth`s surface Data: twice per day, 85 days on-line ftp site, several Tb`s daily, image of 86400 X 33601 px (0,65km2 ?)

4 2/07/20144 Pipeline Terrain_correction DNB geo files Reprojection Mosaic_stiching Fires detection DNB data files SVM geo files SVM10 data files SVM06- 16 data files Terrain_correction SVM TerrainCorrected geo files N:1 10 optional, 1:1 1 optional, 1:1 1.Simple chains with aggregation. 2. Nested chains supported up to 2 level(more if output files are input ones for high level chain). 3. New version of data or optional file restart(or repeat) chain-match instance.

5 2/07/20145 Pipeline DNBgDNBdSVMgSVMdTcgReprMosDet in(all)124Gb15Gb78Gb128Gb124Gb93Gb0,6*N299Gb out(all)78Gb254Gb?<1*N0,5Gb number254 2540254 2<N<254 Solvers(or «binary») are: Terrain_correction & Reprojection — Java classes, single core, memory use 1500M and 2500M+ Detection and Mosaic_stiching — Matlab (MRE12b), single core, memory use ~3000M and ?(depends on window size). Supported system pipelines (that must just process data as set in one time - infinite), and users pipelines (finite or temporarily). Daily system must process at last around 2700 files (~0,4Tb) — see table.

6 2/07/20146 Modular architecture

7 2/07/20147 Choice of software stack Use ready solutions, not write code twice. Cluster resourses — cloud provider (pay free): OpenStack (fast development,like de-facto cloud, open and easy python API). Starcluster similar project. Workflow system candidate — didn`t find any quite appropriate: Storm by Apache (again Java), Celery(not enough flexible for our task), Heroku(?). Write ourselves. Job scheduling subsystem — any PBS-like (ready to use and failproof): HTCondor, Torque, SLURM. Programming language — smth bash-like and yet flexible for system-core: Python(yes, not Ruby, though it possibly has many good libraries). Database — mysql (quite easy and common, support complex sql-requests). Message service — RabbitMQ(would be very nice,... sometimes). Data storage and file I/O — just NFS(later anything else for performance).

8 2/07/20148 Pipeline description language Use JSON for define task input files set and some interaction: {"pipeline":{"tasks":[ {"taskname": "terrain_correction", "input":{"param":{ "d": ["20130201|20130507"], "t": [],"e": [],"b": [],"end": ["h5"] },"files":{ "geo":{"s": ["GMODO","GDNBO"]}, "geo2":{"s": ["GDTCD"]}, },"optional": ["geo2"] },"output": "./out", "requirements":{ "retr":3,"disk_space":"100m","cpu":1,"memory":1400,"time":"15","priority":"3" },"runline": "bash /data/VIIRS/in/sh/viirs/task2.sh %geo% %output%" },

9 2/07/20149 Test run Memory load(percent): total available and requested

10 2/07/201410 Test run CPU(percent) load: total available and requested

11 2/07/201411 Conclusion and summary Pros: 1) Found out that we need some feature like «pilot» jobs (at least for testing user&admin&developers&config bugs). 2) Lack of feedback from jobs failure and binaries/scripts/pipelines/nodes config. 3) Possibly needed more powerful network FS (the same jobs wall time differs a lot) and more testing. 4) Needed more mature development of system modules (database, messaging service, statistic, etc.)

12 2/07/201412 Conclusion and summary Cons: 1) System can function automatically :) 2) User can provide flexible pipeline json description, which can match sets of files quite clever(i.e. any possible and|or|between logic for files and tasks parameters available). 3) Can be used for problems in other branches of science, or just other tasks and models. 4) System does use jobs retries and can overcome some failures in its components (wn or central node fails, bugs,etc...)

13 2/07/201413 Thank you for attention! Questions?

14 2/07/201414 Problem overview Algorithm accuracy: ~400m For spots of 0,5-1000m2 with temperatures of 400-3000K


Download ppt "High-throughput parallel pipelined data processing system for remote Earth sensing big data in the clouds Высокопроизводительная параллельно-конвейерная."

Similar presentations


Ads by Google