Presentation is loading. Please wait.

Presentation is loading. Please wait.

The GriPhyN Virtual Data System Ian Foster for the VDS team.

Similar presentations


Presentation on theme: "The GriPhyN Virtual Data System Ian Foster for the VDS team."— Presentation transcript:

1 The GriPhyN Virtual Data System Ian Foster for the VDS team

2 Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao, University of Chicago Galaxy cluster size distribution DAG Science as “Workflow”: E.g., Galaxy Cluster Search Sloan Data

3 Requirements l Express complex multi-step “workflows” u Perhaps 100,000s of individual tasks l Operate on heterogeneous distributed data u Different formats & access protocols l Harness many computing resources u Parallel computers &/or distributed Grids l Execute workflows reliably u Despite diverse failure conditions l Enable reuse of data & workflows u Discovery & composition l Support many users, workflows, resources u Policy specification & enforcement

4 Virtual Data System l Express complex multi-step “workflows” u Perhaps 100,000s of individual tasks l Operate on heterogeneous distributed data u Different formats & access protocols l Harness many computing resources u Parallel computers &/or distributed Grids l Execute workflows reliably & efficiently u Despite diverse failure conditions l Enable reuse of data & workflows u Discovery & composition l Support many users, workflows, resources u Policy specification & enforcement VDL, XDTM Pegasus, DAGman, Globus VDC TBD

5 Virtual Data System Local planner DAGman DAG Statically Partitioned DAG DAGman & Condor-G Dynamically Planned DAG Job Planner Job Cleanup Abstract workflow VDL Program Virtual Data catalog Virtual Data Workflow Generator Workflow spec Create Execution Plan Grid Workflow Execution

6 Genome Analysis & DB Update (GADU) 600-1000+ CPUs

7 The Rest of the Talk l Express complex multi-step “workflows” u Perhaps 100,000s of individual tasks l Operate on heterogeneous distributed data u Different formats & access protocols l Harness many computing resources u Parallel computers &/or distributed Grids l Execute workflows reliably & efficiently u Despite diverse failure conditions l Enable reuse of data & workflows u Discovery & composition l Support many users, workflows, resources u Policy specification & enforcement VDL, XDTM Pegasus, DAGman, Globus VDC TBD Ewa

8 “Messy” Scientific Data l Diverse storage formats & access protocols u Logically identical dataset can be stored in text file (e.g. CSV), binary file, spreadsheet u Data available from filesystem, database, HTTP, WebDAV, etc... l Metadata encoded in directory & file names u E.g.: “fMRI volume is composed of an image file & header file with same prefix” l Format dependency hinders program and workflow reuse

9 But... Data is Often Logically Structured l Scientific data often maintain hierarchical structure l A common practice is to select a set of data items and apply a transformation to each individual item l A nested approach of such iterations could scale up to millions of objects

10 Introducing a Typing System l Describe logical data structures as types … u … & physical representations as mappings l Define procedures in terms of typed datasets u … & apply procedures to different physical data l Compose workflows from typed procedures l Benefits u Type checking u Dataset selection and iteration u Discovery by types u Dynamic binding u Type conversion

11 XDTM (Moreau, Zhao, Wilde, Foster) l XML Dataset Typing and Mapping l Separates logical structure from physical representations l Logical structure described by XML Schema u Primitive scalar types: int, float, string, date … u Complex types (structs and arrays) l Mapping descriptor u How logical elements map to physical u External parameters (e. g. location) l XPath for dataset selection

12 Mapping l Define a common mapping interface u Initialize, read, create, write, close l Data providers implement the interface u Responsible for data access details l XView maintains cached logical datasets VDS Mapper Data Source VDS XViewMgr Data SourceMapper XView

13 Use Case: Functional MRI DBIC Archive Study #1 Group #1 Subject #1 Anatomy high-res volume Functional Runs run #1 volume #001... volume #275... run #5 volume #001... snrun #... … Group #5... Study #... DBIC Archive Study_2004.0521.hgd Group_1 Subject_2004.e024 volume_anat.img volume_anat.hdr bold1_001.img bold1_001.hdr... bold1_275.img bold1_275.hdr... bold5_001.img... snrbold*_* air*... Group_5... Study... Logical StructurePhysical Representation

14 Type Definitions in VDL type Image {}; type Header {}; type Volume { Image img; Header hdr; } type Anat Volume; type Warp {}; type NormAnat { Anat aVol; Warp aWarp; Volume nHires; } Part of fMRI AIRSN (Spatial Normalization) Workflow type Run { Volume v [ ]; } type Subject { Anat anat; Run run [ ]; Run snrun [ ]; } type Group { Subject s[ ]; } type Study { Group g[ ]; }

15 Type Definitions in XML Schema <xs:schema targetNamespace="http://www.fmri.org/schema/airsn.xsd" xmlns="http://www.fmri.org/schema/airsn.xsd" xmlns:xs="http://www.w3.org/2001/XMLSchema">

16 Procedure Definition in VDL (Run snr) functional( Run r, NormAnat a, Air shrink ) { Run yroRun = reorientRun( r, "y" ); Run roRun = reorientRun( yroRun, "x" ); Volume std = roRun[0]; Run rndr = random_select( roRun,.1 ); //10% sample AirVector rndAirVec = align_linearRun( rndr, std, 12, 1000, 1000, [81,3,3] ); Run reslicedRndr = resliceRun( rndr, rndAirVec, "o", "k"); Volume meanRand = softmean(reslicedRndr, "y", null ); Air mnQAAir = alignlinear( a.nHires, meanRand, 6, 1000, 4, [81,3,3] ); Volume mnQA = reslice( meanRand, mnQAAir, "o", "k“ ); Warp boldNormWarp = combinewarp( shrink, a.aWarp, mnQAAir ); Run nr = reslice_warp_run( boldNormWarp, roRun ); Volume meanAll = strictmean ( nr, "y", null ) Volume boldMask = binarize( meanAll, "y" ); snr = gsmoothRun( nr, boldMask, 6, 6, 6 ); }

17 Dataset Iteration l Functional analysis expressed in typed datasets l Iterate over each volume in a run

18 Expanded Execution Plan l Datasets dynamically instantiated from data sources by mappers

19 Functional MRI Execution

20 Code Size Comparison WorkflowScriptGeneratorVDL GENATLAS149726 GENATLAS29713510 FILM16313417 FEAT8419113 AIRSN215~40037 Lines of code with different workflow encodings

21 The Rest of the Talk l Express complex multi-step “workflows” u Perhaps 100,000s of individual tasks l Operate on heterogeneous distributed data u Different formats & access protocols l Harness many computing resources u Parallel computers &/or distributed Grids l Execute workflows reliably & efficiently u Despite diverse failure conditions l Enable reuse of data & workflows u Discovery & composition l Support many users, workflows, resources u Policy specification & enforcement VDL, XDTM Pegasus, DAGman, Globus VDC TBD

22 Virtual Data Schema

23 fMRI Virtual Data Queries Which transformations can process a “subject image”? l Q: xsearchvdc -q tr_meta dataType subject_image input l A: fMRIDC.AIR::align_warp List anonymized subject-images for young subjects: l Q: xsearchvdc -q lfn_meta dataType subject_image privacy anonymized subjectType young l A: 3472-4_anonymized.img Show files that were derived from patient image 3472-3: l Q: xsearchvdc -q lfn_tree 3472-3_anonymized.img l A: 3472-3_anonymized.img 3472-3_anonymized.sliced.hdr atlas.hdr atlas.img … atlas_z.jpg 3472-3_anonymized.sliced.img

24 Provenance for ATLAS DC2 (High Energy Physics) How much compute time was delivered? | years| mon | year | +------+------+------+ |.45 | 6 | 2004 | | 20 | 7 | 2004 | | 34 | 8 | 2004 | | 40 | 9 | 2004 | | 15 | 10 | 2004 | | 15 | 11 | 2004 | | 8.9 | 12 | 2004 | +------+------+------+ Selected statistics for one of these jobs: start: 2004-09-30 18:33:56 duration: 76103.33 pid: 6123 exitcode: 0 args: 8.0.5 JobTransforms-08-00-05-09/share/dc2.g4sim.filter.trf CPE_6785_556... -6 6 2000 4000 8923 dc2_B4_filter_frag.txt utime: 75335.86 stime: 28.88 minflt: 862341 majflt: 96386 Which Linux kernel releases were used ? How many jobs were run on a Linux 2.4.28 Kernel?

25 LIGO Inspiral Search Application l Describe… Inspiral workflow application is the work of Duncan Brown, Caltech, Scott Koranda, UW Milwaukee, and the LSC Inspiral group

26 FOAM: Fast Ocean/Atmosphere Model 250-Member Ensemble Run on TeraGrid under VDS FOAM run for Ensemble Member 1 FOAM run for Ensemble Member 2 FOAM run for Ensemble Member N Atmos Postprocessing Ocean Postprocessing for Ensemble Member 2 Coupl Postprocessing for Ensemble Member 2 Atmos Postprocessing for Ensemble Member 2 Coupl Postprocessing for Ensemble Member 2 Results transferred to archival storage Work of: Rob Jacob (FOAM), Veronica Nefedova (workflow design and execution) Remote Directory Creation for Ensemble Member 1 Remote Directory Creation for Ensemble Member 2 Remote Directory Creation for Ensemble Member N

27 FOAM and VDS Climate Supercomputer and Grad student TeraGrid and VDS Visualization courtesy Pat Behling and Yun Liu, UW Madison 160 ensemble members in 75 days 250 ensemble members in 4 days

28 Summary: Science as Workflow Executed Executing Executable Not yet executable Query Edit Schedule Execution environment What I Did What I Want to Do What I Am Doing …

29 Acknowledgements l The Virtual Data System group is: u ISI/USC: Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Mei-Hui Su, Karan Vahi u U of Chicago: Ben Clifford, Ian Foster, Mike Wilde, Yong Zhao l GriPhyN is supported by the NSF l Many research efforts involved in this work are supported by the US Department of Energy, Office of Science


Download ppt "The GriPhyN Virtual Data System Ian Foster for the VDS team."

Similar presentations


Ads by Google