George Kola Computer Sciences Department University of Wisconsin-Madison Data Pipelines: Real Life Fully.

George Kola Computer Sciences Department University of Wisconsin-Madison kola@cs.wisc.edu http://www.cs.wisc.edu/condor Data Pipelines: Real Life Fully Automated Fault-tolerant Data Movement and Processing

www.cs.wisc.edu/condor 2 Outline › What users want ? › Data pipeline overview › Real life Data pipelines  NCSA and WCER pipelines › Conclusions

www.cs.wisc.edu/condor 3 What users want ? › Make data available at different sites › Process data and make results available at different sites › Use distributed computing resources for processing › Full automation and fault-tolerance

www.cs.wisc.edu/condor 4 What users want ? › Can we press a button and expect it to complete ? › Can we not bother about failures ? › Can we get acceptable throughput ? › Yes… Data pipeline is the solution!

www.cs.wisc.edu/condor 5 Data Pipeline Overview › Fully automated framework for data movement and processing › Fault tolerant & resilient to failures  Understands failures and handles them › Self-tuning › Rich statistics › Dynamic visualization of system state

www.cs.wisc.edu/condor 6 Data Pipelines Design › View data placement and computation as full fledged jobs › Data placement handled by Stork › Computation handled by Condor/Condor-G › Dependencies between jobs handled by DAGMan › Tunable statistics generation/collection tool › Visualization handled by DEVise

www.cs.wisc.edu/condor 7 Fault Tolerance › Failure makes automation difficult › Variety of failures happen in real life  Network, software, hardware › System designed taking failure into account › Hierarchical fault tolerance  Stork/Condor, DAGMan › Understands failures  Stork switches protocols › Persistent logging. Recovers from machine crashes

www.cs.wisc.edu/condor 8 Self Tuning › Users are domain experts and not necessarily computer experts › Data movement tuned using  Storage system characteristics  Dynamic network characteristics › Computation scheduled on data availability

www.cs.wisc.edu/condor 9 Statistics/Visualization › Network statistics › Job run-times, data transfer times › Tunable statistics collection › Statistics entered into Postgres database › Interesting facts can be derived from the data › Dynamic system visualization using DEVise

www.cs.wisc.edu/condor 10 Real life Data Pipelines › Astronomy data processing pipeline  ~3 TB (2611 x 1.1 GB files)  Joint work with Robert Brunner, Michael Remijan et al. at NCSA › WCER educational video pipeline  ~6TB (13 GB files)  Joint work with Chris Thorn et al at WCER

www.cs.wisc.edu/condor 11 DPOSS Data › Palomar-Oschin photographic plates used to map one half of celestial sphere › Each photographic plate digitized into a single image › Calibration done by software pipeline at Caltech › Want to run SExtractor on the images The Palomar Digital Sky Survey (DPOSS)

www.cs.wisc.edu/condor 12 Unitree @NCSA Input Data flow Output Data flow Processing Staging Node @UW Condor Pool @UW Condor Pool @Starlight Staging Node @NCSA NCSA Pipeline

www.cs.wisc.edu/condor 13 NCSA Pipeline › Moved & Processed 3 TB of DPOSS image data in under 6 days  Most powerful astronomy data processing facility! › Adapt for other datasets (Petabytes): Quest2, CARMA, NOAO, NRAO, LSST › Key component in future Astronomy Cyber infrastructure

www.cs.wisc.edu/condor 14 WCER Pipeline › Need to convert DV videos to MPEG-1, MPEG-2 and MPEG-4 › Each 1 hour video is 13 GB › Videos accessible through ‘transana’ software › Need to stage the original and processed videos to SDSC

www.cs.wisc.edu/condor 15 WCER Pipeline › First attempt at such large scale distributed video processing › Decoder problems with large 13 GB files › Uses bleeding edge technology EncodingResolutionFile SizeAverage Time MPEG-1Half (320 x 240)600 MB2 hours MPEG-2Full (720x480)2 GB8 hours MPEG-4Half (320 x 480)250 MB4 hours

www.cs.wisc.edu/condor 16 SRB Server @SDSC Staging Node @UW Condor Pool @UW Input Data flow Output Data flow Processing WCER Pipeline WCER

www.cs.wisc.edu/condor 17 Conclusion › Large scale data movement & processing can be fully automated! › Successfully processed terabytes of data › Data pipelines are useful for diverse fields › We have shown two working case studies in astronomy and educational research › We are working with our collaborators to make this production quality MSScmd DiskRouter UniTree Server @NCSA Management Site @UW User submits a DAG at management site Staging Site @NCSA Staging Site @UW Condor Pool @UW Globus-url-copy Condor File Transfer Mechanism Condor Pool @StarLight 2a 2b 1 2 3 Control flow Input Data flow Output Data flow Processing 4 5 4b 4a 3 A B C DE

www.cs.wisc.edu/condor 18 Questions › Thanks for listening › Contact Information George Kola kola@cs.wisc.edukola@cs.wisc.edu Tevfik Kosar kosart@cs.wisc.edukosart@cs.wisc.edu Office: 3361 Computer Science › Collaborators NCSA Robert Brunner (rb@astro.uiuc.edu)rb@astro.uiuc.edu NCSA Michael Remijan (remijan@ncsa.uiuc.edu) WCER Chris Thorn (cathorn@wisc.edu)

George Kola Computer Sciences Department University of Wisconsin-Madison Data Pipelines: Real Life Fully.

Similar presentations

Presentation on theme: "George Kola Computer Sciences Department University of Wisconsin-Madison Data Pipelines: Real Life Fully."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

George Kola Computer Sciences Department University of Wisconsin-Madison Data Pipelines: Real Life Fully.

Similar presentations

Presentation on theme: "George Kola Computer Sciences Department University of Wisconsin-Madison Data Pipelines: Real Life Fully."— Presentation transcript:

Similar presentations

About project

Feedback