Download presentation
Presentation is loading. Please wait.
1
1 Improving the Reuse of Scientific Workflows and their By-products Xiaorong Xiang National Evolutionary Synthesis Center (NESCent) Duke University, University of North Carolina - Chapel Hill, and North Carolina State University Gregory Madey Department of Computer Science and Engineering University of Notre Dame 2007 IEEE International Conference on Web Services (ICWS 2007) Salt Lake City, Utah, July 2007 Supported in part by the Indiana Center for Insect Genomics (ICIG) & the Indiana 21st Century Fund
2
2 Collaborators: Xiaorong Xiang & Jeanne Romero-Severson
3
3 Outline: two parts Production system (MoGServ) for bioinformatics workflow Bioinformatics application Productivity improvement Prototype system exploring ideas for end- user composition Workflow reuse Knowledge management/discovery
4
4 From the article “Genome Sequencing vs. Moore’s Law: Cyber Challenges for the Next Decade” by Folker Meyer in journal CTWatch Quarterly August, 2006 volume 2 number 3 Bioinformatics today Rapidly accumulating data: DNA sequences, contigs, expression data, annotations, etc. Non-standard independently developed heterogeneous data sources Data sharing and security Productivity Problem!
5
5 SOA in Bioinformatics MORE Community efforts needed to provide more shared and reliable services More demonstration projects needed => best practices, measured utility, feedback to middleware projects, etc. Recent exposure of data & analysis tools as services Large public databases and bioinformatics tools Middleware projects Provide infrastructure to compose, manage, execute, connect the distributed services
6
6 Mother of Green (MoG) project Biological science In collaboration with Prof. Jeanne Romero-Severson, Biological Sciences, University of Notre Dame. Study the deep phylogeny of plastid Computer science Provide an environment to support scientists’ investigations A case study of using SOA for data and application integration A prototype for future research in service-oriented architecture domain
7
7 Mother of Green Malaria causes 1.5 - 2.7 million deaths every year Malaria causes 1.5 - 2.7 million deaths every year 3,000 children under age five die of malaria every day 3,000 children under age five die of malaria every day Plasmodium falciparum (a protozoan parasite)causes human malariaPlasmodium falciparum (a protozoan parasite) causes human malaria Drug resistance a world-wide problem Drug resistance a world-wide problem Targeted drug design through phylogenomics Targeted drug design through phylogenomics Malaria causes 1.5 - 2.7 million deaths every year Malaria causes 1.5 - 2.7 million deaths every year 3,000 children under age five die of malaria every day 3,000 children under age five die of malaria every day Plasmodium falciparum (a protozoan parasite)causes human malariaPlasmodium falciparum (a protozoan parasite) causes human malaria Drug resistance a world-wide problem Drug resistance a world-wide problem Targeted drug design through phylogenomics Targeted drug design through phylogenomics P. falciparum
8
8 Mother of Green P. falciparum has three genomes P. falciparum has three genomes Nuclear, mitochondrial, plastid Animals and insects have only two Animals and insects have only two Target the third genome Target the third genome No harm to animals No harm to animals New antimalarial drug New antimalarial drug High risk, high tech, high payoff High risk, high tech, high payoff J. Romero-Severson Department of Biological Sciences Greg Madey & Xiaorong Xiang Department of Computer Science & Engineering J. Romero-Severson Department of Biological Sciences Greg Madey & Xiaorong Xiang Department of Computer Science & Engineering
9
9 Mother of Green Plastids are the third genome Intracellular organelles Terrestrial plants, algae, apicomplexans Functions in plants and algae Photosynthesis Oxidation of water Reduction of NADP Synthesis of ATP Fatty acid biosynthesis Aromatic amino acid biosynthesis Functions in apicomplexans ? Plastids are the third genome Intracellular organelles Terrestrial plants, algae, apicomplexans Functions in plants and algae Photosynthesis Oxidation of water Reduction of NADP Synthesis of ATP Fatty acid biosynthesis Aromatic amino acid biosynthesis Functions in apicomplexans ? Chloroplast in plant cell Plastid in Toxoplasma sp. Apicoplast in P. falciparum plastid
10
10 Mother of Green The apicoplast appears to code for <30 proteins. Repair, replication and transcription proteins Why is the apicoplast essential?
11
11 Find the ancestors of the apicoplast Identify genes in the ancestors Determine gene function Look for these genes in the P. falciparum nucleus Then study regulatory mechanisms in candidate genes Mother of Green Phylogenomics Mother of Green Phylogenomics
12
12 Phylogenomics of plastids Very old lineage (> 2.5 billion years) Cyanobacterial ancestor Three main plastid lineages Glaucophytes Group of freshwater algae Chloroplast resembles intact cyanobacteria Chlorophytes Green plant lineage Chloroplast genome reduced Many chloroplast genes now in nuclear genome Rhodophytes Red algal lineage Chloroplast genome bigger than in green plants Oomycetes Apicomplexans
13
13 Phylogenomics of plastids One cyanobacterial ancestor ? Many? Lineages are not linear One plastid origin Multiple plastid origins
14
14 The process of endosymbiosis. Horizontal Gene Transfer (arrows) from the plastid to the nucleus. The nucleomorph is a remnant of the original endosymbiont nucleus. Primitive eukaryote Endosymbiont plastid Secondary endosymbionts Second eukaryote Secondary nonphotosynthetic endosymbiont Cyanobacteria Nucleus Nucleomorph Plastid disappears
15
15 Secondary endosymbiont Tertiary endosymbionts Third eukaryote Tertiary nonphotosynthetic endosymbiont Plastid disappears Tertiary endosymbiosis. Horizontal Gene Transfer P. falciparum
16
16 The information gathering problem Rapid accumulation of raw sequence information ~100 sequenced chloroplast genomes ~57 sequenced cyanobacterial genomes Rate of accumulation is increasing Information accumulates faster than analyses finish Information in forms not readily accessible Solution Semi-automated web-services “Smart” web-services Semantic web
17
17 A typical in-silico investigation – Data driven research A: Query complete genome sequences given a taxa A: Query complete genome sequences given a taxa B: Query protein coding genes for each genome sequence B: Query protein coding genes for each genome sequence C: Eliminate vector sequences C: Eliminate vector sequences D: Sequences alignment D: Sequences alignment E: Phylogenetic analysis E: Phylogenetic analysis
18
18 Time consuming manual web-based operations Data collection Copy & paste! Analysis tool usage Copy & paste! Experiment data recording Copy & paste! Repetitive experiments for scientific discovery Copy & paste! Repeat as new data becomes available Copy & paste!
19
19 MoGServ system architecture MoGServ interface Web interface Application interface MoGServ middle layer Data access storage Data and analysis services Service and workflow registry Indexing and querying metadata Service and workflow enactment Acting in two roles: service requester and service provider
20
Web Interface Applications Application Server Data Access Services Data Access Services Data Analysis Services Data Analysis Services Job Manager Job Launcher Service/Workflow Registry Service/Workflow Registry Metadata Search Metadata Search Local Data Storage Local Data Storage Workflow/Soap Engines Services NCBI DDBJ EMBL Data/Services Providers MoGServ Middle Layer Services Access Client Others MoGServ System Architecture
21
21 Data storage and access services Local database Integrating data from multiple data sources with scientists interests Supporting repetitive investigations against several subsets of sequences Avoiding network traffic and service failure when retrieving data on-the-fly from public data sources Accessing the data in the local database by services
22
22 Service and workflow registry A table-based description with necessary properties Text description Service location Input/output Provider Version Algorithm Invocation method Not intended for supporting service discovery or composition To answer end-users questions about their results Provenance: “Which algorithm was used to generate the data and what is the source of the input data?” A repository of service and workflow used for local application developers
23
23 Indexing and querying metadata Metadata Service and workflow description Description of sequence data in order to track the origination of data Experimental data output, input, and intermediate data Indexing and querying with keyword Lucene Implemented as services
24
24 Service and workflow enactment INPUT Parameters Task Name Timer INPUT Parameters Task Name Timer Service/Workflow Registry Job Manager Find the service/workflow definition using the task name Form a Job Description Output Job ID Output Job ID Job Launcher Instances of Workflow/Service Engines Instances of Workflow/Service Engines Job Information
25
25 Implementation Development and deployment J2EE, JSP, XSLT Tomcat 5.0.18 / Axis 1.2 Database PostgresSQL 8.1 Index and search of metadata Apache Lucene library Service implementation Java2WSDL Wrap command line applications with JLaunch library Workflow Taverna workbench, part of myGrid project Freefluo workflow engine
26
26 Data and services Services, Workflows Data collection from remote database Query local database Data analysis tools, blast, clustalw, Data format conversion, readseq Management data sets and jobs Download and upload Data Complete genome sequences ATP gene sequences Sequence sets Saved jobs
27
27 Taverna workbench
28
28 A workflow created using the Taverna workbench tool
29
29 Improvement opportunities Use existing domain ontology in bioformatics community to describe services, workflows, and data Integrate the semantic web technology to support end-users workflow creation based on their knowledge of scientific domain Support users with limited knowledge of scientific processes Record various workflow representations Facilitate the discovery and reuse of prior workflows Knowledge management Knowledge discovery
30
30 Service Composition and workflows Service composition Ad-hoc Semi-automate Semantic annotation + reasoning Automated Semantic annotation + planning Scientific workflows Workflows composed based on service-oriented architecture for assisting scientists in accessing and analyzing data.
31
31 Current workflow management systems Existing workflow management system and bioinformatics middleware Taverna, Kepler, Triana, Pegasus Design, execute, monitor, re-run Support ad-hoc, semi-automated and automated service discovery and composition from scratch
32
32 Our approach Reuse the verified knowledge and workflow in the community Increase the correctness of composed workflows over time Provide more accurate guidelines for users A four level hierarchical workflow structure An enhanced workflow system
33
33 Aligning Retrieving Workflow A defined by a less experienced user using the functional definition of services queryGene clustalW Workflow B defined by an intermediate user with executable services queryGene clustalW queryGene setIds setFilter clustalW Workflow C defined by an expert user with two extra executable services to ensure the accurate output of the biological process Three user-defined workflows from different views Question: “are gene genealogies for ATP subunits α, β,and γ different?”
34
34 User Service Annotator Abstract workflow OWL DL reasoner OWL DL reasoner Ontology Create abstract workflow using ontology Annotate services using ontology Semantics enabled service registry Semantics enabled service discovery Semantics enabled service discovery Service matchmaking Workflow composer (software agent/experienced users) Find appropriate service Workflow execution engine Workflow execution engine concrete workflow Data provenance management Data provenance management Collect and manage information about data origination Knowledge base management Knowledge base management Knowledge discovery Knowledge discovery Enhanced workflow system MogServ
35
35 Encode, convert the High level definition To low-level executable Invoke a workflow with Specific input data and Record the data Provenance and Performance of services, workflows. Abstract workflow Concrete workflow Optimal workflow Workflow instance Replace individual Services with their optimal alternatives Task A Task B Service B Service A Service D Service C Service B Service A Service D Service C’ input output Service B Service A Service D Service C’ Our hierarchical workflow structure Pegasus workflow structure
36
36 Reusable knowledge Connectivity Helps to convert from abstract workflow to concrete workflow Alternative services Helps to convert from concrete workflow to optimal workflow Quality profile of services Helps discover optimal workflows Mapping of abstract workflow and concrete workflow Helps to choose reusable workflows
37
37 Connectivity identification (Match detection) Service: QueryLocal Operation: createSet performTask: mygrid:retrieving inputPara: Settype(String, mog:gene) Queryterm(String, null) outputPara: Setid(string, mog:geneset) useResource: MoG Service: ClustalW Operation: runClustalWdf performTask: mygrid:aligning inputPara: Setid(String, mog:set ) Sequencetype(String, mog:sequence) outputPara: filen(string, mygrid:sequence _alignment_report) useResource: EBI Service: FormatConversion Operation: convert performtask: mygrid: translating inputPara: filen(String, mygrid:sequence _alignment_report ) outputPara: Out(String, mygrid:nexus _paup_format) useResource: MoG Parameter (data type, semantic type) Matching rule: opertation ij → operation mn if exist parameter k is output parameter of operation ij and exist parameter o is input parameter of operation mn and data type (parameter o ) = data type (parameter k ) and semantic type (parameter o ) = semantic type(parameter k )
38
38 Need for verified service connectivity The mismatching problem TPFP FNTN Match Detection output Accurate annotation Inaccurate annotation Lack semantic annotation Inaccurate reasoning Inaccurate annotation Lack of semantic annotation Inaccurate reasoning Accurate annotation GenBankService Out:GenBank record Blastp In: protein sequence X Mediator, adaptor, shim DDBJ-XML Out: sequence data record NCBI blast In: sequence data record fasta formatSelf-defined format May be detected by experts at design time or after run Can be detected automatically X YesNo Yes No FP TN Real match
39
39 Connectivity Graph Implementation Registration process registry Automatically Identify the connectivity Knowledge base Store the connectivity Workflow Translation / Service composition process Refine, update, decompose the workflow connect (service a, operation ai, parameter c, service b, operation bi, parameter d ) identifyConnect (Single service, rdf repository) Search at syntactic level: search path between two nodes search next available service automatic composition base on input, output Implementation: shortest path algorithm Dijkstra Connectivity between services is converted to finding a path between two nodes in a graph
40
40 Generic Service Description Ontology (myGrid/Feta model) Data Services Workflows Service Domain Ontology (myGrid) MoGServ application Domain Ontology (MoGServ) Software components for annotation RDF Store Ontological modules used for semantic description of data, services & workflows
41
41 MoGServ Application Domain Ontology To better track the data origination To support the automation of workflow creation To better share the data on the web in the future propertiesdomainrange invokedbyJobUser isParentOfSet isInstanceOfJobService hasSetNameSetXML:String Ontological modules Number of ConceptsNumber of properties Object Datatype MoGServ1297 myGrid4198 myGrid/Feta model261117 Example concepts and properties defined in MoGServ
42
42 Sample service/workflow annotation Question: Which service has an operation that accepts nucleotide_sequence as a parameter Answer: Uri: http://www.ebi.ac.ukhttp://www.ebi.ac.uk …/alignment:blastn_ncbi OperationName: Run Displayed by Rdf-Gravity
43
43 Implementation of annotation and query components for data, services & workflows Sesame 1.2.6 library Supports files, RDBMS, SeRQL Sesame RDF store Annotation Templates (Data) Annotation Templates (Service) Query templates Select Y, W, X from {Y} mg:hasOperation{W} mg:inputParameter {X} rdf:type {mog:set} using namespace rdf =, mg =, mog = Query Components Annotation components result Service: http:host.cse.nd.edu/http:host.cse.nd.edu/ axis/services/ClustalW?wsdl Operation: runClustalWdf inputParameter: setid SeRQL
44
44 Experiment Used 418 concepts from domain ontology for semantic type, defined 10 concepts for data type. Randomly generate service annotation. 1 input, 1 output 1000 services connectivity graph (right side) Intel Pentium mobile 1.5GZ Number of servicesNumber of Matched pair Load RDF repository (milliseconds) Average time of match detection per single service (milliseconds) 20010154712.02 40034234613.01 60084260012.31 800138301512.35 1000225332512.51 Number of nodes724 Number of arcs587 Average path search time (milliseconds) Less than 1 Connectivity graph load time (milliseconds) 220 Length 0 = 724, length 1= 587, length 2=448, length 3= 281, Length 4=114, length 5=71 Length 6 =28, length 7=16 Length 8 = 4, length 9 = 2 Conclusion: Feasible solution.
45
45 Reuse of workflows Reuse of abstract workflows Reuse of concrete workflows Compare structural similarity of two workflows Implementation: SUBDUE algorithm SUBDUE is has a graphy match utility that is part of its data mining system Given workflow is converted to a graph and fed to the SUBDUE match algorithm Abstract example … input output query_term hasParameter task hasInput task hasNext retrieving aligning multiple_alignment_report performTask hasOutput performTask hasParameter v 1 input v 2 output v 3 task v 4 task v 5 query_term v 6 retrieving v 7 aligning v 8 multiple_aligning_report e 3 4 hasNext e 3 1 hasInput e 4 2 hasOutput e 3 6 performTask e 4 7 performTask e 1 5 hasParameter e 2 8 hasParameter SUBDUE input format Graph view
46
46 Conclusion Pro Increase the correctness of the formed workflow over time Avoid the incorrect, inaccurate semantic annotations Take advantage of verified knowledge Avoid the ontological reasoning process Better support for semi-automated and automated service composition over time Provide more accurate guideline to users over time Con The connectivity graph can be big Number of parameters Number of services Search the connectivity of a service when a service is registered in the system may take relative long time More complex matching rule Number of parameters May not have high accuracy at the beginning
47
47 Future work Integrate the GridSam into the MoGServ for execution, monitoring Integrate the Grid computing technology for resource allocation Refine the MoGServ application domain ontology Create interface for end-user workflow creation Create interface for individual workspace Evaluate the scalability, accuracy of connectivity graph approach and the graph matching approach with large number real workflows and services
48
48 Thank you Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.