Download presentation
Presentation is loading. Please wait.
Published byJasmin Stone Modified over 9 years ago
1
1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science University of Southern California http://www.isi.edu/~gil With Ewa Deelman, Jihie Kim, Varun Ratanakar, Christian Fritz, Paul Groth, Gonzalo Florez, Pedro Gonzalez, Joshua Moody
2
2 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Outline Brief introduction to computational workflows Brief overview of semantic workflows The Wings/Pegasus workflow system Five benefits of semantic workflows Reproducibility Validation Metadata generation Data discovery Workflow discovery
3
3 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Scientific Data Analysis Complex processes involving a variety of algorithms/software
4
4 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 NSF Workshop on Challenges of Scientific Workflows [Gil et al, IEEE Computer 2007] Despite investments on CyberInfrastructure as an enabler of a significant paradigm change in science: Reproducibility, key to scientific method, is threatened Exponential growth in Compute, Sensors, Data storage, Network BUT growth of science is not same exponential What is missing: Perceived importance of capturing and sharing process in accelerating pace of scientific advances Process (method/protocol) is increasingly complex and highly distributed Workflows are emerging as a paradigm for process-model driven science that captures the analysis itself Workflows need to be first class citizens in science CyberInfrastructure Enable reproducibility Accelerate scientific progress by automating processes Interdisciplinary and intradisciplinary research challenges Report available at http://www.isi.edu/nsf-workflows06 http://www.isi.edu/nsf-workflows06
5
5 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Benefits of Workflow Systems [Taylor et al 07] Managing execution Dependencies among steps Failure recovery Managing distributed computation Move data when needed Managing large data sets Efficiency, reliability Security and access control Remote job submission Provenance recording Low-cost high-fidelity reproducibility
6
6 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Wings/Pegasus Workflows for Seismic Hazard Analysis [Gil et al 07] (see also [Maechlin et al 05] [Deelman et al 06]) Input data: a site and an earthquake forecast model thousands of possible fault ruptures and rupture variations, each a file, unevenly distributed ~110,000 rupture variations to be simulated for that site High-level template combines 11 application codes 8048 application nodes in the workflow instance generated by Wings Provenance records kept for 100,000 workflow data products Generated more than 2M triples of metadata 24,135 nodes in the executable workflow generated by Pegasus, including: data stage-in jobs, data stage-out jobs, data registration jobs Executed in USC HPCC cluster, 1820 nodes w/ dual processors) but only < 144 available Including MPI jobs, each runs on hundreds of processors for 25-33 hours Runtime was 1.9 CPU years
7
7 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Semantic Workflows in WINGS Workflow templates Dataflow diagram Each constituent (node, link, component, dataset) has a corresponding variable Semantic properties Constraints on workflow variables (TestData dcdom:isDiscrete false) (TrainingData dcdom:isDiscrete false)
8
8 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Semantic Constraints as Metadata Properties Constraints on reusable template (shown below) Constraints on current user request (shown above) [modelerInput_not_equal_to_classifierInput: (:modelerInput wflow:hasDataBinding ?ds1) (:classifierInput wflow:hasDataBinding ?ds2) equal(?ds1, ?ds2) (?t rdf:type wflow:WorkflowTemplate) > (?t wflow:isInvalid "true"^^xsd:boolean)]
9
9 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Why Semantic Workflows: 1) Easily Replicate Previously Published Results A catalog of carefully crafted workflows of select state-of- the-art methods to cover a wide range of common analyses Many implementations of same algorithm, some proprietary Same implementation but new versions and bug fixes Semantic workflows abstract from software implementation Representing abstract classes of software components –Instances are the implemented codes –Workflow steps refer to component classes Representing abstract kinds of data (eg exclude format) Semantic reasoning needed to specialize workflow To map the abstract workflow into an execution-ready workflow To insert lower level steps (eg data transformations)
10
10 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 The Importance of Reproducibility
11
11 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Difficulties in Replication Some software is proprietary Effort must be invested in data conversions Software installation Managing new versions
12
12 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Wings Workflows for Genetic Studies of Mental Disorders [Gil et al, forthcoming] Work with Christopher Mason from Cornell University CNV Detection Variant Discovery from Resequencing Transmission Disequilibrium Test (TDT) Association Tests
13
13 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Wings Replication of Crohn’s Disease Association Study from [Duerr et al, Science 06]
14
14 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Wings Replication of Early-Onset Parkinson’s Disease Study from [Bayrakli et al, Human Mutation 07]
15
15 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Observations [Gil et al, forthcoming] Effort involved in reproducing results is minor 30 seconds to set up a workflow A catalog of carefully crafted workflows of select state-of- the-art methods will cover a wide range of genomic analyses Our workflows were independently developed and used “as is” Semantic representations abstract the analysis method from the software that implements it Our workflows used different analytic tools than the original studies Semantic constraints can be added to workflows to avoid analysis errors Our workflow removes duplicate individuals that would cause problems in the association analysis
16
16 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Why Semantic Workflows: 2) Ensure Correct Use of State-of-the-Art Methods Analytic software and methods are well documented but all is text (papers, manuals, etc) Time consuming, hard to spot interdependencies, no validation Semantic workflows can check constraints and guide users Representing requirements of software components –Constraints on input data –Constraints on parameter settings given properties of input data Representing metadata properties of datasets Semantic reasoning needed: To check constraints of each workflow step To propagate constraints across the workflow
17
17 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 User’s Difficulties: Choosing Parameters How do I set up the workflow parameters? Association Test Max individuals per cluster (“mc”) and merge distance p-value constraint (“ppc”) Max Population If Affimetrix data, set cutoff (“miss”) to 94%, if Illumina 98%
18
18 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Wings Workflow System Assists Users to Set Up Parameters Based on Characteristics of Datasets Component Catalog [MissingnessPerIndividual1: (?c rdf:type pcdom:Create_Binary_PEDFile_Class) (?c pc:hasInput ?idv1) (?idv1 pc:hasArgumentID "PEDFile") (?c pc:hasInput ?idv2) (?idv2 pc:hasArgumentID "MissingnessPerIndividual") (?idv1 dcdom:hasGenotypingRate ?v1) equal(?v1, "0.95"^^xsd:float) -> (?idv2 pc:hasValue "0.06"^^xsd:float)]
19
19 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Why Semantic Workflows: 3) Automatic Generation of Metadata Metadata annotations are tedious and involved Often not done, an obstacle to sharing and to reuse Semantic workflows can automate the generation of metadata for analysis data products Representing expected characteristics of output dataset for each software component given the input metadata Representing metadata properties of input datasets Semantic reasoning needed: To propagate metadata for each workflow step To propagate metadata across the workflow
20
20 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Wings Metadata Generation: An Example in a Seismic Hazard Workflow [Kim et al 06; Gil et al 07] SeismogramGration RVM 127_6.rvm - source_id: 127 - rupture_id: 6 Rupture_variation 127_6.txt.variation -s0000-h0000 - source_id: 127 - rupture_id: 6 - slip_relaization_#:0 - hypo_center_#: 1 127_6.txt.variation -s0000-h0000 - source_id: 127 - rupture_id: 6 - slip_relaization_#:0 - hypo_center_#: 1 127_6.txt.variation -s0000-h0001 - source_id: 127 - rupture_id: 6 - slip_relaization_#:0 - hypo_center_#: 1 127_6.txt.variation -s0000-h0001 - source_id: 127 - rupture_id: 6 - slip_relaization_#:0 - hypo_center_#: 1 SGT 127_6.txt.variation -s0000-h0000 - source_id: 127 - rupture_id: 6 - slip_relaization_#:0 - hypo_center_#: 1 127_6.txt.variation -s0000-h0001 - source_id: 127 - rupture_id: 6 - slip_relaization_#:0 - hypo_center_#: 1 FD_SGT/PAS_1/A/SGT161 - site_name: PAS - tensor_direction: 1 - time_period: A - xyz_volumn_id: 161 127_6.txt.variation -s0000-h0001 - source_id: 127 - rupture_id: 6 - slip_realization_#:0 - hypo_center_#: 1 Seismogram Seismogram_PAS_127_6.grm - site_name: PAS - source_id: 127 - rupture_id: 6 … … SGT
21
21 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Wings Workflows for Accuracy/Quality Tradeoffs in Biomedical Image Analysis [Kumar et al 09] PIQ: Pixel Intensity Quantification (from National Center for Microscopy and Imaging Research [Chow et al 06]) Terabyte-sized out-of-core image data Need to minimize execution time while preserving highest output quality Some operations are parallelizable, others must operate on entire images For efficiency, image decomposed (layers, tiles, and chunks) but quality is affected From a workflow template, Wings can automatically generate descriptions of each individual piece of the image to manage the computations over each one
22
22 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Why Semantic Workflows: 4) Discovery of Relevant Data Need a dataset of updated common (known) loci to annotate findings, where can I find one?
23
23 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Why Semantic Workflows: 5) Retrieval of Workflows Hard to find workflows for the type of analysis a user wants Semantic information is not provided when creating the workflow –e.g., when user adds a NaiveBayesModeler, he wouldn’t be expected to define that the output of this would be a NaiveBayesModel or a Bayes Model (superclass) or not human readable However, retrieval queries are often based on metadata properties of data –e.g., “Find workflows that can normalize data which is continuous and has missing values [<- constraints on inputs] to create a decision tree model [constraint on intermediate data products]” Semantic representations are needed For workflow constituents –Metadata properties of input, intermediate and final data products –Metadata properties of workflow and component function For user queries –Express workflow sketches containing partial data descriptions (constraints) Reasoning capabilities Automatic creation of metadata for expected workflow data products Workflow matching to queries (exact and partial)
24
24 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 User’s Difficulties: Choosing an Analysis What type of analysis is appropriate for my data? CNV Detection Variant Discovery from Resequencing Transmission Disequilibrium Test (TDT) Association Test TDT analysis requires no less than 100 families Variant discovery is used for genomic data from the same individual Association tests are best for large datasets that are not within a family
25
25 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 User’s Difficulties: Choosing a Workflow What workflow is appropriate for my goals? Transmission Disequilibrium Test (TDT) Association Test Applies population stratification to remove outliers Assumes outliers have been removed Uses structured association Uses a standard test Incorporates parental phenotype information Uses CMH association
26
26 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 An Algorithm for Semantic Enrichment of Workflow Templates [Gil et al K-CAP 09] ?Model5 dcdom:isDiscrete true ?Model6 dcdom:isDiscrete true ?Model7 dcdom:isDiscrete true ?TestData dcdom:isDiscrete true ?Dataset4 dcdom:isDiscrete true ?Dataset3 dcdom:isDiscrete true ?TrainingData dcdom:isDiscrete true Model5 Model6 Model7 Problem Addressed: Semantic information is not provided when creating the workflow, but retrieval queries use it Key idea: Constraints can be available in a component catalog and propagated through the workflow Phase 1: Goal Regression Starting from final products, traverse workflow backwards For each node, query component catalog for metadata constraints on inputs Phase 2: Forward Projection Starting from input datasets, traverse workflow forwards For each node, query component catalog for metadata constraints on outputs
27
27 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Conclusions: Benefits of Semantic Workflows [Gil JSP-09] Execution management: Automation of workflow execution Managing distributed computation Managing large data sets Security and access control Provenance recording Low-cost high fidelity reproducibility Semantics and reasoning: “Conceptual” reproducibility User assistance to explore analysis “design space” Validation of analyses Automated generation of metadata Workflow retrieval and discovery
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.