Semantic Mediation in myGrid Chris Wroe Manchester University.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
On line (DNA and amino acid) Sequence Information Lecture 7.
IBM Watson Research © 2004 IBM Corporation BioHaystack: Gateway to the Biological Semantic Web Dennis Quan
How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Introduction to Web services MSc on Bioinformatics for Health Sciences May 2006 Arnaud Kerhornou Iván Párraga García INB.
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
How to use the web for bioinformatics Molecular Technologies February 11, 2005 Ethan Strauss X 1373
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
The my Grid project aims to provide middleware layers that make the Information Grid appropriate for the needs of bioinformatics. my Grid is building high.
Personal Data Management Why is this such an issue? Data Provenance Representing links v Representing data Identifying resources: Life Science Identifiers.
1 Middleware for In silico Biology Phillip Lord
How to use the web for bioinformatics Ethan Strauss X 1171
Provenance in my Grid Jun Zhao School of Computer Science The University of Manchester, U.K. 21 October, 2004.
Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.
BiodiversityWorld GRID Workshop NeSC, Edinburgh – 30 June and 1 July 2005 Metadata Agents and Semantic Mediation Mikhaila Burgess Cardiff University.
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Taverna and my Grid Basic overview and Introduction Tom Oinn
High level Knowledge-based Grid Services for Bioinformaticans Carole Goble, University of Manchester, UK myGrid project
Essential Bioinformatics and Biocomputing Module (Tutorial) Biological Databases Lecturer: Chen Yuzong Jan 2003 TAs: Cao Zhiwei Lee Teckkwong, Bernett.
1 A myGrid Project Tutorial Dr Mark Greenwood University of Manchester With considerable help from Justin Ferris, Peter Li, Phil Lord, Chris Wroe, Carole.
GGF Summer School 24th July 2004, Italy Middleware for in silico Biology Professor Carole Goble University of Manchester
Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.
Taverna and my Grid Open Workflow for Life Sciences Tom Oinn
1 The myGrid Project Professor Chris Greenhalgh University of Nottingham.
The Grid as Future Scientific Infrastructure Ian Foster Argonne National Laboratory University of Chicago Globus Alliance
Integrating BioMedical Text Mining Services into a Distributed Workflow Environment Rob Gaizauskas, Neil Davis, George Demetriou, Yikun Guo, Ian Roberts.
KAROLINSKA INSTITUTET International Biobank and Cohort Studies: Developing a Harmonious Approch February 7-8, 2005, Atlanta; GA Standards The P 3 G knowledge.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
MyGrid and the Semantic Web Phillip Lord School of Computer Science University of Manchester.
Taverna Workflows for Systems Biology Katy Wolstencroft School of Computer Science University of Manchester.
Transparent access to multiple bioinformatics information sources (TAMBIS) Goble, C.A. et al. (2001) IBM Systems Journal 40(2), Genome Analysis.
The Functional Genomics Experiment Object Model (FuGE) Andrew Jones, School of Computer Science, University of Manchester MGED Society.
©Ferenc Vajda 1 Semantic Grid Ferenc Vajda Computer and Automation Research Institute Hungarian Academy of Sciences.
Anil Wipat University of Newcastle upon Tyne, UK A Grid based System for Microbial Genome Comparison and analysis.
Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.
High level Grid Services for Bioinformaticans Carole Goble, University of Manchester, UK Robin McEntire, GSK.
LSIDs in a Nutshell Jun Zhao University of Manchester 1 st December, 2005.
MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK
Association of variations in I kappa B-epsilon with Graves' disease using classical and my Grid methodologies Peter Li School of Computing Science University.
GGF11 Semantic Grid Applications Workshop, Hilton Hawaiian Village Beach Resort & Spa, Honolulu, Thursday June 10, 2004 Exploring Williams-Beuren Syndrome.
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
© Geodise Project, University of Southampton, Knowledge Management in Geodise Geodise Knowledge Management Team Barry Tao, Colin Puleston, Liming.
My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester.
Bioinformatics Workflows Chris Wroe (based on material from the myGrid team & May Tassabehji / Hannah Tipney Medical Genetics, St Marys)
EMBOSS over a Grid 1. 1st EELA Grid School December 4th of 2006 Eduardo MURRIETA LEON Romualdo ZAYAS-LAGUNAS Pierre-Alain BRANGER Jérôme VERLEYEN Roberto.
1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.
Mining the Biomedical Research Literature Ken Baclawski.
MyGrid/Taverna Provenance Daniele Turi University of Manchester OMII f2f Meeting, London, 19-20/4/06.
Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
Using DAML+OIL Ontologies for Service Discovery in myGrid Chris Wroe, Robert Stevens, Carole Goble, Angus Roberts, Mark Greenwood
1 Open Ontology Repository initiative - Planning Meeting - Thu Co-conveners: PeterYim, LeoObrst & MikeDean ref.:
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
The my Grid Information Model Nick Sharman, Nedim Alpdemir, Justin Ferris, Mark Greenwood, Peter Li, Chris Wroe AHM2004, 1 September
Portals and my Grid Stefan Rennick Egglestone Mixed Reality Laboratory University of Nottingham.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
Life Science Identifiers Chris Wroe (based on material from myGrid team and IBM Life Sciences)
OGSA-DQP Steven Lynden University of Manchester. Data access & integration with OGSA-DAI: GGF 17 2 Introduction OGSA-DQP is a service based distributed.
MyGrid: Personalised Bioinformatics on the Information Grid Robert Stevens, Alan Robinson & Carole Goble University of Manchester & EBI, UK myGrid project.
Web services and genome annotation in GRID by DNA Data Bank of Japan (DDBJ) Center for Information Biology and DNA Data Bank of Japan National Institute.
Workflow and myGrid Justin Ferris IT Innovation Centre 7 October 2003 Life Sciences Grid GGF9.
Exploring Taverna 2 Katy Wolstencroft myGrid University of Manchester.
Designing, Executing and Sharing Workflows with Taverna 2.4 Different Service Types Katy Wolstencroft Helen Hulme myGrid University of Manchester.
Katy Wolstencroft University of Manchester
Knowledge Based Workflow Building Architecture
A myGrid Project Tutorial
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Semantic Mediation in myGrid Chris Wroe Manchester University

UK e-Science Pilot Project. Oct 2001 – April £3.4 million. £0.4 million studentships. Newcastle Nottingham Manchester Southampton Hinxton Sheffield

Data-intensive bioinformatics ID MURA_BACSU STANDARD; PRT; 429 AA. DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASE DE (EC ) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINE DE ENOLPYRUVYL TRANSFERASE) (EPT). GN MURA OR MURZ. OS BACILLUS SUBTILIS. OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE; OC BACILLUS. KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE. FT ACT_SITE BINDS PEP (BY SIMILARITY). FT CONFLICT S -> A (IN REF. 3). SQ SEQUENCE 429 AA; MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI

Web Service (Grid Service) communication fabric AMBIT Text Extraction Service Provenance Personalisation Event Notification Gateway Service and Workflow Discovery myGrid Information Repository Ontology Mgt Metadata Mgt Work bench TavernaTalisman Native Web Services SoapLab Web Portal Legacy apps Registries Ontologies FreeFluo Workflow Enactment Engine OGSA-DQP Distributed Query Processor Bioinformaticians Tool Providers Service Providers Applications Core services External services Service Stack Views Legacy apps GowLab

Workflow approach Grave’s Disease

Workflow approach II

Issues Connecting web services together –Shim services Connecting data to web services –Data provenance delivered by LSIDs Connecting data to data –Distributed Query Processing

Technology –Resource Description Framework Representing metadata about data and services –Ontology Web Language Representing concepts and classifications

myGrid & Bioinformatics world Automating mainstream, well known tasks Well known mature data formats Often no formal description of formats Lots of code to manipulate formats already exists (BioPerl, BioJava …) Semantic mediation work in progress..

Williams-Beuren Syndrome Workflow Main Bioinformatics Applications Explore gaps regions within the W-B Critical Region Main Bioinformatics Services Main Bioinformatics Application SHIM Services

Williams Example (simple) Genbank retrieval service Genscan Gene predication service Genbank record has_part genomic sequence genomic sequence in Genbank recordFASTA sequence Semantic level Syntactic level

Sample Genbank Record LOCUS AY bp mRNA linear VRT 07-MAY-2004 DEFINITION Oncorhynchus nerka RH1 opsin mRNA, complete cds. ACCESSION AY VERSION AY GI: KEYWORDS. SOURCE Oncorhynchus nerka (sockeye salmon) ORGANISM Oncorhynchus nerka Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Actinopterygii; Neopterygii; Teleostei; Euteleostei; Protacanthopterygii; Salmoniformes; Salmonidae; Oncorhynchus. REFERENCE 1 (bases 1 to 1065) AUTHORS Dann,S.G., Allison,W.T., Levin,D.B., Taylor,J.S. and Hawryshyn,C.W. TITLE Salmonid opsin sequences undergo positive selection and indicate an alternate evolutionary relationship in oncorhynchus JOURNAL J. Mol. Evol. 58 (4), (2004) PUBMED REFERENCE 2 (bases 1 to 1065) AUTHORS Dann,S.G., William,A.E., David,L.B. and Craig,H.W. TITLE Direct Submission JOURNAL Submitted (08-JAN-2003) Biology, University of Victoria, PO Box 3020 Stn CSC, Victoria, British Columbia V8W 3N5, Canada FEATURES Location/Qualifiers source /organism="Oncorhynchus nerka" /mol_type="mRNA" /db_xref="taxon:8023" CDS /codon_start=1 /product="RH1 opsin" /protein_id="AAP " /db_xref="GI: " /translation="MNGTEGPDFYVPMSNATGIVRNPYEYPQYYLVSPAAYSLMAAYM FFLILTGFPINFLTLYVTIEHKKLRTALNYILLNLAVADLFMVIGGFTTTMYTSMHGY FVFGRTGCNIEGFCATHGGEIALWSLVVLAIERWLVVCKPISNFRFSETHAIIGVAFT WVMAAACSVPPLLGWSRYIPEGMQCSCGIDYYTRAPDINNESFVIHMFVVHFMIPLFI ISFCYGNLLCAVKAAAAAQQESETTQRAEREVTRMVIMMVVSFLVCWVPYASVAWYIF CNQGTEFGPVFMTIPAFFAKSSSLYNPLIYVLMNKQFRNCMITTLCCGKNPFEEEEGA STTASKTEASSVSSSSVAPA" ORIGIN 1 atgaacggca cagagggacc agatttctac gtccctatgt ccaatgctac tggcattgtt 61 aggaacccct atgaataccc ccagtactac cttgtcagcc cagcggcgta ctcactcatg 121 gctgcctaca tgttcttcct catcctcacc ggcttcccca tcaacttcct cacactctat 181 gtcaccatcg agcacaaaaa gctgaggacc gccctgaact acatcctgct gaacctggct 241 gtggccgatc tcttcatggt aatcggaggc ttcaccacta cgatgtacac ctccatgcat 301 ggctatttcg tctttggaag aacgggctgc aacatcgagg gattctgtgc tacccatggt 361 ggtgagattg ccctatggtc cctggttgtc ctggctattg agaggtggtt ggtcgtctgc 421 aaacctatta gcaacttccg cttcagtgag acccatgcca tcataggcgt ggcctttacc 481 tgggtcatgg ctgctgcttg ctccgtcccc cctctgcttg ggtggtcccg ctatatcccc 541 gaaggcatgc agtgctcatg tggaattgac tactacacgc gcgcccctga catcaacaat 601 gagtcctttg tcatccacat gttcgttgtc cactttatga ttcccctgtt catcatctcc 661 ttctgctacg gcaacctgct ctgcgctgtc aaggcagctg ccgccgccca gcaggagtct 721 gagaccaccc agagggctga gagggaagtg acccgcatgg tcatcatgat ggtcgtctcc 781 ttcctagtgt gctgggtgcc ctacgccagc gtggcctggt atatcttctg caaccaggga 841 acagagttcg gccccgtctt catgacaatt ccggcattct ttgccaagag ttcgtccctg 901 tacaaccctc tcatctacgt gttgatgaac aagcagttcc gcaactgcat gatcaccacc 961 ctgtgctgtg ggaagaaccc cttcgaggag gaggagggag cctccaccac tgcctccaag 1021 accgaggcct cctccgtgtc ctccagctcc gtggctcctg cataa //

FASTA >gi| |gb|AY | Oncorhynchus nerka RH1 opsin mRNA, complete cds ATGAACGGCACAGAGGGACCAGATTTCTACGTCCCTATGTCCAATGCTACTGGCATTGTTAGGAACCCCT ATGAATACCCCCAGTACTACCTTGTCAGCCCAGCGGCGTACTCACTCATGGCTGCCTACATGTTCTTCCT CATCCTCACCGGCTTCCCCATCAACTTCCTCACACTCTATGTCACCATCGAGCACAAAAAGCTGAGGACC GCCCTGAACTACATCCTGCTGAACCTGGCTGTGGCCGATCTCTTCATGGTAATCGGAGGCTTCACCACTA CGATGTACACCTCCATGCATGGCTATTTCGTCTTTGGAAGAACGGGCTGCAACATCGAGGGATTCTGTGC TACCCATGGTGGTGAGATTGCCCTATGGTCCCTGGTTGTCCTGGCTATTGAGAGGTGGTTGGTCGTCTGC AAACCTATTAGCAACTTCCGCTTCAGTGAGACCCATGCCATCATAGGCGTGGCCTTTACCTGGGTCATGG CTGCTGCTTGCTCCGTCCCCCCTCTGCTTGGGTGGTCCCGCTATATCCCCGAAGGCATGCAGTGCTCATG TGGAATTGACTACTACACGCGCGCCCCTGACATCAACAATGAGTCCTTTGTCATCCACATGTTCGTTGTC CACTTTATGATTCCCCTGTTCATCATCTCCTTCTGCTACGGCAACCTGCTCTGCGCTGTCAAGGCAGCTG CCGCCGCCCAGCAGGAGTCTGAGACCACCCAGAGGGCTGAGAGGGAAGTGACCCGCATGGTCATCATGAT GGTCGTCTCCTTCCTAGTGTGCTGGGTGCCCTACGCCAGCGTGGCCTGGTATATCTTCTGCAACCAGGGA ACAGAGTTCGGCCCCGTCTTCATGACAATTCCGGCATTCTTTGCCAAGAGTTCGTCCCTGTACAACCCTC TCATCTACGTGTTGATGAACAAGCAGTTCCGCAACTGCATGATCACCACCCTGTGCTGTGGGAAGAACCC CTTCGAGGAGGAGGAGGGAGCCTCCACCACTGCCTCCAAGACCGAGGCCTCCTCCGTGTCCTCCAGCTCC GTGGCTCCTGCATAA

Williams Example (simple) Genbank retrieval service Genscan Gene predication service Genbank record has_part genomic sequence genomic sequence in Genbank recordFASTA sequence Semantic level Syntactic level EMBOSS seqret service Genbank service

Graves disease Array ExpressGene clustering service Microarray expression data out Microarray expression data in Affymetrix CEL fileTreeview format Semantic level Syntactic level

Example data CellHeader=X Y MEAN STDV NPIXELS CEL format Probe_Id Sample _at _at _at -59 Treeview format Template Cell header Probe ID _at _at _at

Graves disease Array ExpressGene clustering service Microarray expression data out Microarray expression data in Affymetrix CEL fileTreeview format Semantic level Syntactic level AffyR service Template file

Classification of shims Shim service FILTER MAPPER DEREFERENCER TRANSLATOR syntax (e.g. GenBank to EMBL) data (e.g. DNA to protein) TRANSFORMER SIFTER (sql SELECT type operation) PARSER (sql PROJECT type operation) - also known as SPLITTER or DECOMPOSER COMPARER SORTER Defn: experimentally neutral service used to connect domain services that don’t quite fit

Providing more assistance Taverna workbench 1. Register Taverna workbench 3. Query Pedro 2. Annotate

operation name, description input output task method resource application workflow bioMoby service WSDL operation Soaplab service service name, description author organisation WSDL service parameter name, description semantic type format transport type collection type collection format myGrid’s model of services

Service Description Flow Discovery Client Semantic Indexing Component Registry XML document describing service Extract service descriptions to reason over Pedro Jena RDF repository Instance Store FACT DL reasoner

execute ….. Pedro XML

RDF Queries possible within RDF repository: Find me an operation called “exec*” Find me a service provided by groups working on Williams disease Find me an operation which performs aligning? RDF a1234 a2 “execute” a3 #service #local_pairwise_aligning #operation published_by type subclass name task #aligning hasOperation

RDF a1234 a2 “execute” a3 #service #local_pairwise_aligning #operation published_by type subclass name task #aligning Queries not possible: Find me an operation which performs aligning which is local? Where does this service fit into a classification hasOperation

OWL classes #service #local_pairwise_aligning #operation Owl property restriction: hasOperation Owl property restriction: performsTask Most specific class expression extracted Definition: Service which has an operation which performs the task local pairwise aligning

OWL classes service aligning service local aligning service pairwise local aligning service Each service class has its own property based OWL definition a1234 Instance store indexes our service instance in the appropriate place Classification calculated by the FACT reasoner using property based definitions

Query by navigation Service browser Service classified by task

Use of ontologies Property based classification requires property based modelling Advantages –Explicit, machine interpretable, easier to maintain large ontologies with polyhierarchies Disadvantages –Complex definitions take time/ skill to author, require expert domain knowledge –Difficult to present back to the user

Property based classification on steroids RNA sequence data DNA sequence data nucleic acid sequence data Data

Property based classification on steroids RNA sequence DNA sequence nucleic acid sequence RNA sequence data DNA sequence data nucleic acid sequence data encodes DataFeature

Property based classification on steroids RNA DNA nucleic acid RNA sequence DNA sequence nucleic acid sequence RNA sequence data DNA sequence data nucleic acid sequence data encodessequence_of DataFeatureBiological Concept

Property based classification on steroids ribonucleotide deoxyribonucleotide nucleotide RNA DNA nucleic acid RNA sequence DNA sequence nucleic acid sequence RNA sequence data DNA sequence data nucleic acid sequence data encodessequence_ofpolymer_of DataFeatureBiological Concept

Property based classification on steroids ribonucleotide deoxyribonucleotide nucleotide RNA DNA nucleic acid RNA sequence DNA sequence nucleic acid sequence RNA sequence data DNA sequence data nucleic acid sequence data encodessequence_ofpolymer_of DataFeatureBiological Concept

Human readable ontologies GROWL parser OWL API Reasoner OWL API GROWL renderer

Only data to hand Metadata associated with data items. Life science identifier (LSID) protocol used to retrieve metadata. Metadata model similar to service parameter Data item name, description semantic type format collection type collection format

Workflow run Workflow design Experiment design Project Person Organisation Process Service Event Data item data derivation e.g. output data derived from input data knowledge statements e.g. similar protein sequence to instanceOf partOf componentProcess e.g. web service invocation of NCBI componentEvent e.g. completion of a web service invocation at 12.04pm runBy e.g. NCBI run for Organisation level provenanceProcess level provenance Data/ knowledge level provenance Provenance (1) User can add templates to each workflow process to determine links between data items.

AC Homo sapiens BAC clone CTA-315H11 from 7, complete sequence AC Homo sapiens BAC clone RP11-622P13 from 7, complete sequence AL Human DNA sequence from clone RP11-553N16 on chromosome 1, complete sequence AL Homo sapiens chromosome 21 segment HS21C AL Human chromosome 14 DNA sequence BAC R-775G15 of library RPCI-11 from chromosome 14 of Homo sapiens (Human), complete sequence BX Homo sapiens mRNA; cDNA DKFZp686G08119 (from clone DKFZp686G08119) AC Homo sapiens 12q22 BAC RPCI11-256L6 (Roswell Park Cancer Institute Human BAC Library) complete sequence AK Homo sapiens cDNA FLJ45040 fis, clone BRAWH AC Homo sapiens chromosome 17, clone RP11-104J23, complete sequence AL Human DNA sequence from clone RP4-715N11 on chromosome 20q Contains two putative novel genes, ESTs, STSs and GSSs, complete sequence AC Homo sapiens BAC clone RP11-731I19 from 2, complete sequence AC Homo sapiens chromosome 15, clone RP11-342M21, complete sequence AL Human DNA sequence from clone RP11-461K13 on chromosome 10, complete sequence AC Homo sapiens PAC clone RP3-368G6 from X, complete sequence AC Homo sapiens chromosome 4 clone B200N5 map 4q25, complete sequence AF Homo sapiens chromosome 21q22.3 PAC 171F15, complete sequence >gi| |gb|AC | Homo sapiens BAC clone CTA-315H11 from 7, complete sequence AAGCTTTTCTGGCACTGTTTCCTTCTT CCTGATAACCAGAGAAGGAAAAGATC TCCATTTTACAGATGAG GAAACAGGCTCAGAGAGGTCAAGGCT CTGGCTCAAGGTCACACAGCCTGGGA ACGGCAAAGCTGATATTC AAACCCAAGCATCTTGGCTCCAAAGC CCTGGTTTCTGTTCCCACTACTGTCAG TGACCTTGGCAAGCCCT GTCCTCCTCCGGGCTTCACTCTGCAC ACCTGTAACCTGGGGTTAAATGGGCT CACCTGGACTGTTGAGCG urn:lsid:taverna:datathing:15..BLAST_Report rdf:type urn:lsid:taverna:datathing:13..similar_sequences_to.. nucleotide_sequence rdf:type service invocation..created_by workflow invocation workflow definition experiment definition project person group service description organisation..described_by..run_during..invocation_of..part_of..works_for..part_of..author..run_for AB..masked_sequence_of..filtered_version_of Relationship BLAST report has with other items in the repository Other classes of information related to BLAST report Provenance tracking

Using IBM’s Haystack GenBank record Portion of the Web of provenance Managing collection of sequences for review

Storage LSID has no protocol for storage Taverna/ Freefluo implements its own data/ metadata storage protocol Taverna/ Freefluo Metadata Store Data store Publish interface data metadata

Retrieval LSID protocol used to retrieve data and metadata Query handled separately Metadata Store Data store LSID interface LSID aware client Query RDF aware client

Queries within Workflows Grid Data Service query query result Semantic content of result depends on query and data source schema Select GO_ID FROM GO WHERE GO.term LIKE “enzyme activity”; Select GO_Annotation_ID FROM GOA WHERE GO.term LIKE “enzyme activity”; Gene ontology term ID protein ID

Distributed Query Processing DQP linked with the OGSA-DAI activity Built within myGrid project Plans execution of a query over multiple Grid Data Services Each Grid Data Service provides schema metadata Currently no semantic mediation

Example query select p.proteinId, blast(p.sequence) from p in protein, t in proteinTerm where t.termId = 'GO: ' and p.proteinId = t.proteinId “Select proteins and homologous proteins from SWISS-PROT which have been annotated with GO:008372” Gene ontology databaseSWISS-PROT protein database t.proteinId p.proteinId Data encoding the identity of a protein in SWISS-PROT namespace = DQP Plan

Query 1: Select motifs for antigenic human proteins that participate in apoptosis and are homologous to the lymphocyte associated receptor of death (also known as lard). Translation: Select patterns in the proteins that invoke an immunological response and participate in programmed cell death that are similar in their sequence of amino acids to the protein that is associated with triggering cell death in the white cells of the immune system. (A) Ontology expression: Motif which <isComponentOf (Protein which <hasOrganismClassification Species functionsInProcess Apoptosis hasFunction Antigen isHomologousTo Protein which )>)> Species: Is instantiated by value “human” ProteinName: Is instantiated by value “lard” TAMBIS I

TAMBIS II Informal query plan: Select proteins with protein name “lard” from SWISS-PROT Execute a BLAST sequence alignment process against SWISS-PROT results Check the entries for apoptosis process and antigen function Pass the resultant sequences to PROSITE to scan for their motifs CPL expression: set-unique {(#motif1:motif1)I \protein3 <- get-sp-entries-by-de("lard"), \protein2 <- do-blastp-by-sq-in-entry(protein3), Check-sp-entries-by-kwd("apoptosis",protein2), check-sp-entries-by-de("antigen",protein2), Check-sp-entry-for-species("human",protein2), \motif1 <- do-ps-scan-by-sq-in-entry(protein2)}

select p.proteinId, blast(p.sequence) from p in protein, t in proteinTerm where t.termId = 'GO: ' and p.proteinId = t.proteinId

How we did it in the past –Service type directory How we currently plan to do it –Shims, genbank, microarray How we may want to do it in the future –DQP & TAMBIS

Overview We’re not attacking the same problem When would your problem become our problem Common descriptions of the core entities involved. –Data items, Datasets, Services.