Download presentation
Presentation is loading. Please wait.
Published byJewel Stevenson Modified over 9 years ago
1
The GriPhyN Virtual Data System Ian Foster for the VDS team
2
Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao, University of Chicago Galaxy cluster size distribution DAG Science as “Workflow”: E.g., Galaxy Cluster Search Sloan Data
3
Requirements l Express complex multi-step “workflows” u Perhaps 100,000s of individual tasks l Operate on heterogeneous distributed data u Different formats & access protocols l Harness many computing resources u Parallel computers &/or distributed Grids l Execute workflows reliably u Despite diverse failure conditions l Enable reuse of data & workflows u Discovery & composition l Support many users, workflows, resources u Policy specification & enforcement
4
Virtual Data System l Express complex multi-step “workflows” u Perhaps 100,000s of individual tasks l Operate on heterogeneous distributed data u Different formats & access protocols l Harness many computing resources u Parallel computers &/or distributed Grids l Execute workflows reliably & efficiently u Despite diverse failure conditions l Enable reuse of data & workflows u Discovery & composition l Support many users, workflows, resources u Policy specification & enforcement VDL, XDTM Pegasus, DAGman, Globus VDC TBD
5
Virtual Data System Local planner DAGman DAG Statically Partitioned DAG DAGman & Condor-G Dynamically Planned DAG Job Planner Job Cleanup Abstract workflow VDL Program Virtual Data catalog Virtual Data Workflow Generator Workflow spec Create Execution Plan Grid Workflow Execution
6
Genome Analysis & DB Update (GADU) 600-1000+ CPUs
7
The Rest of the Talk l Express complex multi-step “workflows” u Perhaps 100,000s of individual tasks l Operate on heterogeneous distributed data u Different formats & access protocols l Harness many computing resources u Parallel computers &/or distributed Grids l Execute workflows reliably & efficiently u Despite diverse failure conditions l Enable reuse of data & workflows u Discovery & composition l Support many users, workflows, resources u Policy specification & enforcement VDL, XDTM Pegasus, DAGman, Globus VDC TBD Ewa
8
“Messy” Scientific Data l Diverse storage formats & access protocols u Logically identical dataset can be stored in text file (e.g. CSV), binary file, spreadsheet u Data available from filesystem, database, HTTP, WebDAV, etc... l Metadata encoded in directory & file names u E.g.: “fMRI volume is composed of an image file & header file with same prefix” l Format dependency hinders program and workflow reuse
9
But... Data is Often Logically Structured l Scientific data often maintain hierarchical structure l A common practice is to select a set of data items and apply a transformation to each individual item l A nested approach of such iterations could scale up to millions of objects
10
Introducing a Typing System l Describe logical data structures as types … u … & physical representations as mappings l Define procedures in terms of typed datasets u … & apply procedures to different physical data l Compose workflows from typed procedures l Benefits u Type checking u Dataset selection and iteration u Discovery by types u Dynamic binding u Type conversion
11
XDTM (Moreau, Zhao, Wilde, Foster) l XML Dataset Typing and Mapping l Separates logical structure from physical representations l Logical structure described by XML Schema u Primitive scalar types: int, float, string, date … u Complex types (structs and arrays) l Mapping descriptor u How logical elements map to physical u External parameters (e. g. location) l XPath for dataset selection
12
Mapping l Define a common mapping interface u Initialize, read, create, write, close l Data providers implement the interface u Responsible for data access details l XView maintains cached logical datasets VDS Mapper Data Source VDS XViewMgr Data SourceMapper XView
13
Use Case: Functional MRI DBIC Archive Study #1 Group #1 Subject #1 Anatomy high-res volume Functional Runs run #1 volume #001... volume #275... run #5 volume #001... snrun #... … Group #5... Study #... DBIC Archive Study_2004.0521.hgd Group_1 Subject_2004.e024 volume_anat.img volume_anat.hdr bold1_001.img bold1_001.hdr... bold1_275.img bold1_275.hdr... bold5_001.img... snrbold*_* air*... Group_5... Study... Logical StructurePhysical Representation
14
Type Definitions in VDL type Image {}; type Header {}; type Volume { Image img; Header hdr; } type Anat Volume; type Warp {}; type NormAnat { Anat aVol; Warp aWarp; Volume nHires; } Part of fMRI AIRSN (Spatial Normalization) Workflow type Run { Volume v [ ]; } type Subject { Anat anat; Run run [ ]; Run snrun [ ]; } type Group { Subject s[ ]; } type Study { Group g[ ]; }
15
Type Definitions in XML Schema <xs:schema targetNamespace="http://www.fmri.org/schema/airsn.xsd" xmlns="http://www.fmri.org/schema/airsn.xsd" xmlns:xs="http://www.w3.org/2001/XMLSchema">
16
Procedure Definition in VDL (Run snr) functional( Run r, NormAnat a, Air shrink ) { Run yroRun = reorientRun( r, "y" ); Run roRun = reorientRun( yroRun, "x" ); Volume std = roRun[0]; Run rndr = random_select( roRun,.1 ); //10% sample AirVector rndAirVec = align_linearRun( rndr, std, 12, 1000, 1000, [81,3,3] ); Run reslicedRndr = resliceRun( rndr, rndAirVec, "o", "k"); Volume meanRand = softmean(reslicedRndr, "y", null ); Air mnQAAir = alignlinear( a.nHires, meanRand, 6, 1000, 4, [81,3,3] ); Volume mnQA = reslice( meanRand, mnQAAir, "o", "k“ ); Warp boldNormWarp = combinewarp( shrink, a.aWarp, mnQAAir ); Run nr = reslice_warp_run( boldNormWarp, roRun ); Volume meanAll = strictmean ( nr, "y", null ) Volume boldMask = binarize( meanAll, "y" ); snr = gsmoothRun( nr, boldMask, 6, 6, 6 ); }
17
Dataset Iteration l Functional analysis expressed in typed datasets l Iterate over each volume in a run
18
Expanded Execution Plan l Datasets dynamically instantiated from data sources by mappers
19
Functional MRI Execution
20
Code Size Comparison WorkflowScriptGeneratorVDL GENATLAS149726 GENATLAS29713510 FILM16313417 FEAT8419113 AIRSN215~40037 Lines of code with different workflow encodings
21
The Rest of the Talk l Express complex multi-step “workflows” u Perhaps 100,000s of individual tasks l Operate on heterogeneous distributed data u Different formats & access protocols l Harness many computing resources u Parallel computers &/or distributed Grids l Execute workflows reliably & efficiently u Despite diverse failure conditions l Enable reuse of data & workflows u Discovery & composition l Support many users, workflows, resources u Policy specification & enforcement VDL, XDTM Pegasus, DAGman, Globus VDC TBD
22
Virtual Data Schema
23
fMRI Virtual Data Queries Which transformations can process a “subject image”? l Q: xsearchvdc -q tr_meta dataType subject_image input l A: fMRIDC.AIR::align_warp List anonymized subject-images for young subjects: l Q: xsearchvdc -q lfn_meta dataType subject_image privacy anonymized subjectType young l A: 3472-4_anonymized.img Show files that were derived from patient image 3472-3: l Q: xsearchvdc -q lfn_tree 3472-3_anonymized.img l A: 3472-3_anonymized.img 3472-3_anonymized.sliced.hdr atlas.hdr atlas.img … atlas_z.jpg 3472-3_anonymized.sliced.img
24
Provenance for ATLAS DC2 (High Energy Physics) How much compute time was delivered? | years| mon | year | +------+------+------+ |.45 | 6 | 2004 | | 20 | 7 | 2004 | | 34 | 8 | 2004 | | 40 | 9 | 2004 | | 15 | 10 | 2004 | | 15 | 11 | 2004 | | 8.9 | 12 | 2004 | +------+------+------+ Selected statistics for one of these jobs: start: 2004-09-30 18:33:56 duration: 76103.33 pid: 6123 exitcode: 0 args: 8.0.5 JobTransforms-08-00-05-09/share/dc2.g4sim.filter.trf CPE_6785_556... -6 6 2000 4000 8923 dc2_B4_filter_frag.txt utime: 75335.86 stime: 28.88 minflt: 862341 majflt: 96386 Which Linux kernel releases were used ? How many jobs were run on a Linux 2.4.28 Kernel?
25
LIGO Inspiral Search Application l Describe… Inspiral workflow application is the work of Duncan Brown, Caltech, Scott Koranda, UW Milwaukee, and the LSC Inspiral group
26
FOAM: Fast Ocean/Atmosphere Model 250-Member Ensemble Run on TeraGrid under VDS FOAM run for Ensemble Member 1 FOAM run for Ensemble Member 2 FOAM run for Ensemble Member N Atmos Postprocessing Ocean Postprocessing for Ensemble Member 2 Coupl Postprocessing for Ensemble Member 2 Atmos Postprocessing for Ensemble Member 2 Coupl Postprocessing for Ensemble Member 2 Results transferred to archival storage Work of: Rob Jacob (FOAM), Veronica Nefedova (workflow design and execution) Remote Directory Creation for Ensemble Member 1 Remote Directory Creation for Ensemble Member 2 Remote Directory Creation for Ensemble Member N
27
FOAM and VDS Climate Supercomputer and Grad student TeraGrid and VDS Visualization courtesy Pat Behling and Yun Liu, UW Madison 160 ensemble members in 75 days 250 ensemble members in 4 days
28
Summary: Science as Workflow Executed Executing Executable Not yet executable Query Edit Schedule Execution environment What I Did What I Want to Do What I Am Doing …
29
Acknowledgements l The Virtual Data System group is: u ISI/USC: Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Mei-Hui Su, Karan Vahi u U of Chicago: Ben Clifford, Ian Foster, Mike Wilde, Yong Zhao l GriPhyN is supported by the NSF l Many research efforts involved in this work are supported by the US Department of Energy, Office of Science
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.