Download presentation
Presentation is loading. Please wait.
Published byLucas Pitts Modified over 8 years ago
1
George Kola Computer Sciences Department University of Wisconsin-Madison kola@cs.wisc.edu http://www.cs.wisc.edu/condor Data Pipelines: Real Life Fully Automated Fault-tolerant Data Movement and Processing
2
www.cs.wisc.edu/condor 2 Outline › What users want ? › Data pipeline overview › Real life Data pipelines NCSA and WCER pipelines › Conclusions
3
www.cs.wisc.edu/condor 3 What users want ? › Make data available at different sites › Process data and make results available at different sites › Use distributed computing resources for processing › Full automation and fault-tolerance
4
www.cs.wisc.edu/condor 4 What users want ? › Can we press a button and expect it to complete ? › Can we not bother about failures ? › Can we get acceptable throughput ? › Yes… Data pipeline is the solution!
5
www.cs.wisc.edu/condor 5 Data Pipeline Overview › Fully automated framework for data movement and processing › Fault tolerant & resilient to failures Understands failures and handles them › Self-tuning › Rich statistics › Dynamic visualization of system state
6
www.cs.wisc.edu/condor 6 Data Pipelines Design › View data placement and computation as full fledged jobs › Data placement handled by Stork › Computation handled by Condor/Condor-G › Dependencies between jobs handled by DAGMan › Tunable statistics generation/collection tool › Visualization handled by DEVise
7
www.cs.wisc.edu/condor 7 Fault Tolerance › Failure makes automation difficult › Variety of failures happen in real life Network, software, hardware › System designed taking failure into account › Hierarchical fault tolerance Stork/Condor, DAGMan › Understands failures Stork switches protocols › Persistent logging. Recovers from machine crashes
8
www.cs.wisc.edu/condor 8 Self Tuning › Users are domain experts and not necessarily computer experts › Data movement tuned using Storage system characteristics Dynamic network characteristics › Computation scheduled on data availability
9
www.cs.wisc.edu/condor 9 Statistics/Visualization › Network statistics › Job run-times, data transfer times › Tunable statistics collection › Statistics entered into Postgres database › Interesting facts can be derived from the data › Dynamic system visualization using DEVise
10
www.cs.wisc.edu/condor 10 Real life Data Pipelines › Astronomy data processing pipeline ~3 TB (2611 x 1.1 GB files) Joint work with Robert Brunner, Michael Remijan et al. at NCSA › WCER educational video pipeline ~6TB (13 GB files) Joint work with Chris Thorn et al at WCER
11
www.cs.wisc.edu/condor 11 DPOSS Data › Palomar-Oschin photographic plates used to map one half of celestial sphere › Each photographic plate digitized into a single image › Calibration done by software pipeline at Caltech › Want to run SExtractor on the images The Palomar Digital Sky Survey (DPOSS)
12
www.cs.wisc.edu/condor 12 Unitree @NCSA Input Data flow Output Data flow Processing Staging Node @UW Condor Pool @UW Condor Pool @Starlight Staging Node @NCSA NCSA Pipeline
13
www.cs.wisc.edu/condor 13 NCSA Pipeline › Moved & Processed 3 TB of DPOSS image data in under 6 days Most powerful astronomy data processing facility! › Adapt for other datasets (Petabytes): Quest2, CARMA, NOAO, NRAO, LSST › Key component in future Astronomy Cyber infrastructure
14
www.cs.wisc.edu/condor 14 WCER Pipeline › Need to convert DV videos to MPEG-1, MPEG-2 and MPEG-4 › Each 1 hour video is 13 GB › Videos accessible through ‘transana’ software › Need to stage the original and processed videos to SDSC
15
www.cs.wisc.edu/condor 15 WCER Pipeline › First attempt at such large scale distributed video processing › Decoder problems with large 13 GB files › Uses bleeding edge technology EncodingResolutionFile SizeAverage Time MPEG-1Half (320 x 240)600 MB2 hours MPEG-2Full (720x480)2 GB8 hours MPEG-4Half (320 x 480)250 MB4 hours
16
www.cs.wisc.edu/condor 16 SRB Server @SDSC Staging Node @UW Condor Pool @UW Input Data flow Output Data flow Processing WCER Pipeline WCER
17
www.cs.wisc.edu/condor 17 Conclusion › Large scale data movement & processing can be fully automated! › Successfully processed terabytes of data › Data pipelines are useful for diverse fields › We have shown two working case studies in astronomy and educational research › We are working with our collaborators to make this production quality MSScmd DiskRouter UniTree Server @NCSA Management Site @UW User submits a DAG at management site Staging Site @NCSA Staging Site @UW Condor Pool @UW Globus-url-copy Condor File Transfer Mechanism Condor Pool @StarLight 2a 2b 1 2 3 Control flow Input Data flow Output Data flow Processing 4 5 4b 4a 3 A B C DE
18
www.cs.wisc.edu/condor 18 Questions › Thanks for listening › Contact Information George Kola kola@cs.wisc.edukola@cs.wisc.edu Tevfik Kosar kosart@cs.wisc.edukosart@cs.wisc.edu Office: 3361 Computer Science › Collaborators NCSA Robert Brunner (rb@astro.uiuc.edu)rb@astro.uiuc.edu NCSA Michael Remijan (remijan@ncsa.uiuc.edu) WCER Chris Thorn (cathorn@wisc.edu)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.