1 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute.

Slides:



Advertisements
Similar presentations
Configuration management
Advertisements

Configuration management
Copyright © 2008 SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks.
1 USC INFORMATION SCIENCES INSTITUTE Modeling and Using Simulation Code for SCEC/IT Yolanda Gil Varun Ratnakar Norm Tubman USC/Information Sciences Institute.
The ADAMANT Project: Linking Scientific Workflows and Networks “Adaptive Data-Aware Multi-Domain Application Network Topologies” Ilia Baldine, Charles.
Case Tools Trisha Cummings. Our Definition of CASE  CASE is the use of computer-based support in the software development process.  A CASE tool is a.
Ewa Deelman, Integrating Existing Scientific Workflow Systems: The Kepler/Pegasus Example Nandita Mangal,
Presenter : Shih-Tung Huang Tsung-Cheng Lin Kuan-Fu Kuo 2015/6/15 EICE team Model-Level Debugging of Embedded Real-Time Systems Wolfgang Haberl, Markus.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
1 Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science University of Southern California
System Design/Implementation and Support for Build 2 PDS Management Council Face-to-Face Mountain View, CA Nov 30 - Dec 1, 2011 Sean Hardman.
A Semantic Workflow Mechanism to Realise Experimental Goals and Constraints Edoardo Pignotti, Peter Edwards, Alun Preece, Nick Gotts and Gary Polhill School.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
This chapter is extracted from Sommerville’s slides. Text book chapter
Chapter 1 Database Systems. Good decisions require good information derived from raw facts Data is managed most efficiently when stored in a database.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,
© Janice Regan, CMPT 128, Jan CMPT 128 Introduction to Computing Science for Engineering Students Creating a program.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
Designing Workflows: An Example from Image Analysis Yolanda Gil Information Sciences Institute University of Southern California October 17,
1 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Part II Designing Workflows AAAI-08 Tutorial on Computational.
1 USC INFORMATION SCIENCES INSTITUTE Modeling and Using Simulation Code for SCEC/IT Yolanda Gil Jihie Kim Varun Ratnakar Marc Spraragen USC/Information.
Web-Enabled Decision Support Systems
1 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 Part VII: Future Challenges in Computational Workflows and.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
Active Monitoring in GRID environments using Mobile Agent technology Orazio Tomarchio Andrea Calvagna Dipartimento di Ingegneria Informatica e delle Telecomunicazioni.
 To explain the importance of software configuration management (CM)  To describe key CM activities namely CM planning, change management, version management.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Configuration Management (CM)
Secure Systems Research Group - FAU Classifying security patterns E.B.Fernandez, H. Washizaki, N. Yoshioka, A. Kubo.
Intent Specification Intent Specification is used in SpecTRM
Domain-Specific Languages for Composing Signature Discovery Workflows Ferosh Jacob*, Adam Wynne+, Yan Liu+, Nathan Baker+, and Jeff Gray* *Department of.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
1 USC Information Sciences Institute Yolanda Gil AAAI-08 Tutorial July 13, 2008 AAAI-08 Tutorial on Computational Workflows for Large-Scale.
Research Design for Collaborative Computational Approaches and Scientific Workflows Deana Pennington January 8, 2007.
1 USC INFORMATION SCIENCES INSTITUTE Yolanda Gil Interactive Composition of Computational Pathways Jihie Kim Varun Ratnakar Students: Marc Spraragen (USC)
1 USC INFORMATION SCIENCES INSTITUTE CAT: Composition Analysis Tool Interactive Composition of Computational Pathways Yolanda Gil Jihie Kim Varun Ratnakar.
Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute
1. 2 Preface In the time since the 1986 edition of this book, the world of compiler design has changed significantly 3.
CIS/SUSL1 Fundamentals of DBMS S.V. Priyan Head/Department of Computing & Information Systems.
1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.
Provenance and Workflows Yolanda Gil USC/ISI March 6, 2015
Data Structures and Algorithms Dr. Tehseen Zia Assistant Professor Dept. Computer Science and IT University of Sargodha Lecture 1.
CASE (Computer-Aided Software Engineering) Tools Software that is used to support software process activities. Provides software process support by:- –
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Recommendations for caBIG to Support Semantic Workflows Yolanda Gil, PhD.
3/6: Data Management, pt. 2 Refresh your memory Relational Data Model
1 Chapter 12 Configuration management This chapter is extracted from Sommerville’s slides. Text book chapter 29 1.
Funded by the NSF OCI program grants OCI and OCI Mats Rynge, Gideon Juve, Karan Vahi, Gaurang Mehta, Ewa Deelman Information Sciences Institute,
1 The Software Development Process ► Systems analysis ► Systems design ► Implementation ► Testing ► Documentation ► Evaluation ► Maintenance.
Ewa Deelman, Virtual Metadata Catalogs: Augmenting Existing Metadata Catalogs with Semantic Representations Yolanda Gil, Varun Ratnakar,
1 Pegasus and wings WINGS/Pegasus Provenance Challenge Ewa Deelman Yolanda Gil Jihie Kim Gaurang Mehta Varun Ratnakar USC Information Sciences Institute.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 1 Database Systems.
1 USC Information Sciences InstituteYolanda Gil AAAI-08 Tutorial July 13, 2008 Part IV Workflow Mapping and Execution in Pegasus (Thanks.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Managing LIGO Workflows on OSG with Pegasus Karan Vahi USC Information Sciences Institute
Introduction: Databases and Database Systems Lecture # 1 June 19,2012 National University of Computer and Emerging Sciences.
1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
Week#3 Software Quality Engineering.
Semantic Workflows: Metadata Meets Computational Experiments
Regression Testing with its types
Seismic Hazard Analysis Using Distributed Workflows
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
USC Information Sciences Institute {jihie, gil,
Laura Bright David Maier Portland State University
Overview of Workflows: Why Use Them?
rvGAHP – Push-Based Job Submission Using Reverse SSH Connections
A General Approach to Real-time Workflow Monitoring
Presentation transcript:

1 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science University of Southern California With Ewa Deelman, Jihie Kim, Varun Ratanakar, Christian Fritz, Paul Groth, Gonzalo Florez, Pedro Gonzalez, Joshua Moody

2 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 Outline Brief introduction to computational workflows Brief overview of semantic workflows The Wings/Pegasus workflow system Five benefits of semantic workflows Reproducibility Validation Metadata generation Data discovery Workflow discovery

3 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 Scientific Data Analysis Complex processes involving a variety of algorithms/software

4 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 NSF Workshop on Challenges of Scientific Workflows [Gil et al, IEEE Computer 2007] Despite investments on CyberInfrastructure as an enabler of a significant paradigm change in science: Reproducibility, key to scientific method, is threatened Exponential growth in Compute, Sensors, Data storage, Network BUT growth of science is not same exponential What is missing: Perceived importance of capturing and sharing process in accelerating pace of scientific advances Process (method/protocol) is increasingly complex and highly distributed Workflows are emerging as a paradigm for process-model driven science that captures the analysis itself Workflows need to be first class citizens in science CyberInfrastructure Enable reproducibility Accelerate scientific progress by automating processes Interdisciplinary and intradisciplinary research challenges Report available at

5 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 Benefits of Workflow Systems [Taylor et al 07] Managing execution Dependencies among steps Failure recovery Managing distributed computation Move data when needed Managing large data sets Efficiency, reliability Security and access control Remote job submission Provenance recording Low-cost high-fidelity reproducibility

6 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 Wings/Pegasus Workflows for Seismic Hazard Analysis [Gil et al 07] (see also [Maechlin et al 05] [Deelman et al 06]) Input data: a site and an earthquake forecast model thousands of possible fault ruptures and rupture variations, each a file, unevenly distributed ~110,000 rupture variations to be simulated for that site High-level template combines 11 application codes 8048 application nodes in the workflow instance generated by Wings Provenance records kept for 100,000 workflow data products Generated more than 2M triples of metadata 24,135 nodes in the executable workflow generated by Pegasus, including: data stage-in jobs, data stage-out jobs, data registration jobs Executed in USC HPCC cluster, 1820 nodes w/ dual processors) but only < 144 available Including MPI jobs, each runs on hundreds of processors for hours Runtime was 1.9 CPU years

7 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 Semantic Workflows in WINGS Workflow templates Dataflow diagram Each constituent (node, link, component, dataset) has a corresponding variable Semantic properties Constraints on workflow variables (TestData dcdom:isDiscrete false) (TrainingData dcdom:isDiscrete false)

8 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 Semantic Constraints as Metadata Properties Constraints on reusable template (shown below) Constraints on current user request (shown above) [modelerInput_not_equal_to_classifierInput: (:modelerInput wflow:hasDataBinding ?ds1) (:classifierInput wflow:hasDataBinding ?ds2) equal(?ds1, ?ds2) (?t rdf:type wflow:WorkflowTemplate) > (?t wflow:isInvalid "true"^^xsd:boolean)]

9 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 Why Semantic Workflows: 1) Easily Replicate Previously Published Results A catalog of carefully crafted workflows of select state-of- the-art methods to cover a wide range of common analyses Many implementations of same algorithm, some proprietary Same implementation but new versions and bug fixes Semantic workflows abstract from software implementation Representing abstract classes of software components –Instances are the implemented codes –Workflow steps refer to component classes Representing abstract kinds of data (eg exclude format) Semantic reasoning needed to specialize workflow To map the abstract workflow into an execution-ready workflow To insert lower level steps (eg data transformations)

10 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 The Importance of Reproducibility

11 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 Difficulties in Replication Some software is proprietary Effort must be invested in data conversions Software installation Managing new versions

12 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 Wings Workflows for Genetic Studies of Mental Disorders [Gil et al, forthcoming] Work with Christopher Mason from Cornell University CNV Detection Variant Discovery from Resequencing Transmission Disequilibrium Test (TDT) Association Tests

13 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 Wings Replication of Crohn’s Disease Association Study from [Duerr et al, Science 06]

14 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 Wings Replication of Early-Onset Parkinson’s Disease Study from [Bayrakli et al, Human Mutation 07]

15 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 Observations [Gil et al, forthcoming] Effort involved in reproducing results is minor 30 seconds to set up a workflow A catalog of carefully crafted workflows of select state-of- the-art methods will cover a wide range of genomic analyses Our workflows were independently developed and used “as is” Semantic representations abstract the analysis method from the software that implements it Our workflows used different analytic tools than the original studies Semantic constraints can be added to workflows to avoid analysis errors Our workflow removes duplicate individuals that would cause problems in the association analysis

16 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 Why Semantic Workflows: 2) Ensure Correct Use of State-of-the-Art Methods Analytic software and methods are well documented but all is text (papers, manuals, etc) Time consuming, hard to spot interdependencies, no validation Semantic workflows can check constraints and guide users Representing requirements of software components –Constraints on input data –Constraints on parameter settings given properties of input data Representing metadata properties of datasets Semantic reasoning needed: To check constraints of each workflow step To propagate constraints across the workflow

17 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 User’s Difficulties: Choosing Parameters How do I set up the workflow parameters? Association Test Max individuals per cluster (“mc”) and merge distance p-value constraint (“ppc”) Max Population If Affimetrix data, set cutoff (“miss”) to 94%, if Illumina 98%

18 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 Wings Workflow System Assists Users to Set Up Parameters Based on Characteristics of Datasets Component Catalog [MissingnessPerIndividual1: (?c rdf:type pcdom:Create_Binary_PEDFile_Class) (?c pc:hasInput ?idv1) (?idv1 pc:hasArgumentID "PEDFile") (?c pc:hasInput ?idv2) (?idv2 pc:hasArgumentID "MissingnessPerIndividual") (?idv1 dcdom:hasGenotypingRate ?v1) equal(?v1, "0.95"^^xsd:float) -> (?idv2 pc:hasValue "0.06"^^xsd:float)]

19 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 Why Semantic Workflows: 3) Automatic Generation of Metadata Metadata annotations are tedious and involved Often not done, an obstacle to sharing and to reuse Semantic workflows can automate the generation of metadata for analysis data products Representing expected characteristics of output dataset for each software component given the input metadata Representing metadata properties of input datasets Semantic reasoning needed: To propagate metadata for each workflow step To propagate metadata across the workflow

20 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 Wings Metadata Generation: An Example in a Seismic Hazard Workflow [Kim et al 06; Gil et al 07] SeismogramGration RVM 127_6.rvm - source_id: rupture_id: 6 Rupture_variation 127_6.txt.variation -s0000-h source_id: rupture_id: 6 - slip_relaization_#:0 - hypo_center_#: 1 127_6.txt.variation -s0000-h source_id: rupture_id: 6 - slip_relaization_#:0 - hypo_center_#: 1 127_6.txt.variation -s0000-h source_id: rupture_id: 6 - slip_relaization_#:0 - hypo_center_#: 1 127_6.txt.variation -s0000-h source_id: rupture_id: 6 - slip_relaization_#:0 - hypo_center_#: 1 SGT 127_6.txt.variation -s0000-h source_id: rupture_id: 6 - slip_relaization_#:0 - hypo_center_#: 1 127_6.txt.variation -s0000-h source_id: rupture_id: 6 - slip_relaization_#:0 - hypo_center_#: 1 FD_SGT/PAS_1/A/SGT161 - site_name: PAS - tensor_direction: 1 - time_period: A - xyz_volumn_id: _6.txt.variation -s0000-h source_id: rupture_id: 6 - slip_realization_#:0 - hypo_center_#: 1 Seismogram Seismogram_PAS_127_6.grm - site_name: PAS - source_id: rupture_id: 6 … … SGT

21 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 Wings Workflows for Accuracy/Quality Tradeoffs in Biomedical Image Analysis [Kumar et al 09] PIQ: Pixel Intensity Quantification (from National Center for Microscopy and Imaging Research [Chow et al 06]) Terabyte-sized out-of-core image data Need to minimize execution time while preserving highest output quality Some operations are parallelizable, others must operate on entire images For efficiency, image decomposed (layers, tiles, and chunks) but quality is affected From a workflow template, Wings can automatically generate descriptions of each individual piece of the image to manage the computations over each one

22 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 Why Semantic Workflows: 4) Discovery of Relevant Data Need a dataset of updated common (known) loci to annotate findings, where can I find one?

23 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 Why Semantic Workflows: 5) Retrieval of Workflows Hard to find workflows for the type of analysis a user wants Semantic information is not provided when creating the workflow –e.g., when user adds a NaiveBayesModeler, he wouldn’t be expected to define that the output of this would be a NaiveBayesModel or a Bayes Model (superclass) or not human readable However, retrieval queries are often based on metadata properties of data –e.g., “Find workflows that can normalize data which is continuous and has missing values [<- constraints on inputs] to create a decision tree model [constraint on intermediate data products]” Semantic representations are needed For workflow constituents –Metadata properties of input, intermediate and final data products –Metadata properties of workflow and component function For user queries –Express workflow sketches containing partial data descriptions (constraints) Reasoning capabilities Automatic creation of metadata for expected workflow data products Workflow matching to queries (exact and partial)

24 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 User’s Difficulties: Choosing an Analysis What type of analysis is appropriate for my data? CNV Detection Variant Discovery from Resequencing Transmission Disequilibrium Test (TDT) Association Test TDT analysis requires no less than 100 families Variant discovery is used for genomic data from the same individual Association tests are best for large datasets that are not within a family

25 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 User’s Difficulties: Choosing a Workflow What workflow is appropriate for my goals? Transmission Disequilibrium Test (TDT) Association Test Applies population stratification to remove outliers Assumes outliers have been removed Uses structured association Uses a standard test Incorporates parental phenotype information Uses CMH association

26 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 An Algorithm for Semantic Enrichment of Workflow Templates [Gil et al K-CAP 09] ?Model5 dcdom:isDiscrete true ?Model6 dcdom:isDiscrete true ?Model7 dcdom:isDiscrete true ?TestData dcdom:isDiscrete true ?Dataset4 dcdom:isDiscrete true ?Dataset3 dcdom:isDiscrete true ?TrainingData dcdom:isDiscrete true Model5 Model6 Model7 Problem Addressed: Semantic information is not provided when creating the workflow, but retrieval queries use it Key idea: Constraints can be available in a component catalog and propagated through the workflow Phase 1: Goal Regression Starting from final products, traverse workflow backwards For each node, query component catalog for metadata constraints on inputs Phase 2: Forward Projection Starting from input datasets, traverse workflow forwards For each node, query component catalog for metadata constraints on outputs

27 Yolanda Gil Information Sciences InstituteFebruary 4, 2010 Conclusions: Benefits of Semantic Workflows [Gil JSP-09] Execution management: Automation of workflow execution Managing distributed computation Managing large data sets Security and access control Provenance recording Low-cost high fidelity reproducibility Semantics and reasoning: “Conceptual” reproducibility User assistance to explore analysis “design space” Validation of analyses Automated generation of metadata Workflow retrieval and discovery