Download presentation
Presentation is loading. Please wait.
1
1 Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science University of Southern California gil@isi.edu http://www.isi.edu/~gil Scientific Reproducibility through Semantic Workflows and Shared Provenance Representations
2
2 NSF Workshop on Challenges of Scientific Workflows [Gil et al IEEE Computer 2007] Despite investments on CyberInfrastructure as an enabler of a significant paradigm change in science: Reproducibility, key to scientific method, is threatened Exponential growth in Compute, Sensors, Data storage, Network BUT growth of science is not same exponential What is missing: Perceived importance of capturing and sharing process in accelerating pace of scientific advances Process (method/protocol) is increasingly complex and highly distributed Workflows are emerging as a paradigm for process-model driven science that captures the analysis itself Workflows need to be first class citizens in science CyberInfrastructure Enable reproducibility Accelerate scientific progress by automating processes Interdisciplinary and intradisciplinary research challenges Report available at http://www.isi.edu/nsf-workflows06 http://www.isi.edu/nsf-workflows06
3
3 Benefits of Workflow Systems [Taylor et al 07] Managing execution Remote job submission Dependencies among steps Failure recovery Managing distributed computation Move data when needed Managing large data sets Efficiency, reliability Security and access control Access to shared resources Provenance recording Low-cost high-fidelity reproducibility
4
4 Capabilities Available Today: Wings/Pegasus Workflows for Seismic Hazard Analysis [Gil et al 07] (see also [Maechlin et al 05] [Deelman et al 06]) Input data: a site and an earthquake forecast model thousands of possible fault ruptures and rupture variations, each a file, unevenly distributed ~110,000 rupture variations to be simulated for that site High-level template combines 11 application codes 8048 application nodes in the workflow instance generated by Wings Provenance records kept for 100,000 workflow data products Generated more than 2M triples of metadata 24,135 nodes in the executable workflow generated by Pegasus, including: data stage-in jobs, data stage-out jobs, data registration jobs Executed in USC HPCC cluster, 1820 nodes w/ dual processors) but only < 144 available Including MPI jobs, each runs on hundreds of processors for 25-33 hours Runtime was 1.9 CPU years
5
5 The Wings/Pegasus Workflow System [Gil et al 07; Deelman et al 03; Deelman et al 05; Kim et al 08; Gil et al forthcoming] Grid services condor.uwisc.edu www.globus.org Pegasus: Automated workflow refinement and execution pegasus.isi.edu WINGS: Semantic workflow environment wings.isi.edu Knowledge-based reasoning on workflows and data (W3C’s OWL) Semantic workflow catalogs Automation and assistance Execution-independent workflows Optimize for performance, cost, reliability Assign execution resources Manage execution through DAGMan Daily operational use in many domains Secure and controlled sharing of distributed services, computing, data Scalable service-oriented architecture Commercial quality, open source
6
6 Semantic Workflows in WINGS [Gil et al IEE IS 2010; Gil et al JETAI 2010; Gil et al eScience 2009; Kim et al JCCPE 2008; Gil et al 2007] Semantic workflows: More than a dataflow graph Workflow variables : each constituent (node, link, component, dataset) has a corresponding variable Semantic constraints on workflow variables, both within and across variables Semantic descriptions of collections of of data and components are concisely represented [modelerInput_not_equal_to_classifierInput: (:modelerInput wflow:hasDataBinding ?ds1) (:classifierInput wflow:hasDataBinding ?ds2) equal(?ds1, ?ds2) (?t rdf:type wflow:WorkflowTemplate) > (?t wflow:isInvalid "true"^^xsd:boolean)] (TestData dcdom:isDiscrete false) (TrainingData dcdom:isDiscrete false)
7
7 Workflow Portal for Genetic Studies of Mental Disorders (with E. Deelman and C. Mason) Existing repository of genotypic and phenotypic information Goal: develop workflows useful for data in the repository
8
8 Designing a Workflow Collection for Population Genomics Designed workflows for common analysis types Association tests CNV detection Variant discovery Family-based association analysis (TDT) Developed workflow components by encapsulating widely-used heterogeneous open software Plink (Purcell, Harvard) R (Chambers et al) PennCNV (Penn) -- Hidden Markov Models Gnosis (State, Yale) -- sliding windows Allegro (Decode, Iceland) -- Multiterminal Binary Decision Diagrams Structure (Pritchard, Chicago) -- structured association FastLink (Schaffer, NCBI) (BWA) Burrows-Wheeler Aligner (Li * Durbin) SAMTools
9
9 Wings Workflows for Genetic Studies of Mental Disorders [Gil et al, forthcoming] CNV Detection Variant Discovery from Resequencing Transmission Disequilibrium Test (TDT) Association Tests
10
10 Major Features Workflow system manages set up and execution Wings – set up Pegasus - execution Initial collection of workflows captures common genomic analyses Users can upload their own datasets Including collections of datasets User data is secure Not accessible by others
11
11 Wings Replication of Crohn’s Disease Association Study from [Duerr et al, Science 06]
12
12 Wings Replication of Early-Onset Parkinson’s Disease Study from [Bayrakli et al, Human Mutation 07]
13
13 Observations about Reproducibility with Workflows [Gil et al, forthcoming] Effort involved in reproducing results is minor 30 seconds to set up a workflow A catalog of carefully crafted workflows of select state-of- the-art methods will cover a wide range of genomic analyses Our workflows were independently developed and used “as is” Semantic representations abstract the analysis method from the software that implements it Our workflows used different analytic tools than the original studies Many implementations of same algorithm, some proprietary Semantic constraints can be added to workflows to avoid analysis errors Eg: in association analysis workflow, added constraint to remove duplicate individuals initially to avoid problems downstream
14
14 Benefits of Semantic Workflows [Gil JSP-09] Execution management: Automation of workflow execution Managing distributed computation Managing large data sets Security and access control Provenance recording Low-cost high fidelity reproducibility Semantics and reasoning: User assistance to correctly explore analysis “design space” Validation of analyses Automated generation of metadata Workflow retrieval and discovery “Conceptual” reproducibility
15
15 W3C Provenance Group (Y. Gil, chair): Goals Provide state-of-the-art understanding and develop a roadmap for development and possible standardization Articulate requirements for accessing and reasoning about provenance information Develop use cases Identify issues in provenance that are direct concern to the Semantic Web Articulate relationships with other aspects of Web architecture Report on state-of-the-art work on provenance Report on a roadmap for provenance in the Semantic Web Identify starting points for provenance representations Identifying elements of a provenance architecture that would benefit from standardization
16
16 W3C Provenance Group: Products of the Group to Date Group formed in September 2009, open to new members All information is public: http://www.w3.org/2005/Incubator/prov/wiki/ http://www.w3.org/2005/Incubator/prov/wiki/ Developed a set of key dimensions for provenance (11/09) Grouped into three major categories: content, management, use Developed use cases for provenance (12/09) More than 30 use cases, including ~10 in science but others are relevant Developed requirements for provenance from use cases (1/10) User requirements : what is the purpose of the provenance information Technical requirements : derived from the user requirements Report on “Requirements for Provenance on the Web” Currently developing state-of-the-art report (expected 6/10) Started to develop recommendations (expected 9/10) Mappings across provenance vocabularies (eg: DC, OPM, SWAN,…)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.