Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester

……or how to use metadata and semantics to add value in a ‘standards free’ environment

Outline Introduction to Taverna, my Grid and myExperiment Bioinformatics – use of Web services and other services Semantic Service Discovery in my Grid my Grid ontology Our experiences BioCatalogue – bioinformatics service registry

Taverna Workflow Workbench Design and execution of workflows Access to local and remote resources and analysis tools Automation of data flow Iteration over large data sets Part of the my Grid project

Taverna Workflow Enactor myExperiment Web Interface 3 rd Party Resources (Web Services, Grid Services) Workflow Warehouse Service / Component Catalogue Custom Datasets Default Results Provenance Warehouse Resources Taverna Workbench GUI Client Applications Feta Information Services LogBook Provenance Management Service Management Service Ontology Provenance Ontology my Grid

Lots of Resources NAR 2008 – over 1000 databases

Where From? Over 3500 services available Major Service Providers –European Bioinformatics Institute –DNA DataBank of Japan –NCBI – USA ‘Boutique’ Services –Individual research labs producing public data sets –Specialist tools for niche experiments

What types of services? HTML WSDL Web Services BioMart R-processor BioMoby Soaplab Local Java services Beanshell Workflows Variable or non-existent documentation or help

Taverna in a ‘open’ world Advantages Connection to lots of resources Flexible system Can adapt to new technologies Disadvantages Services are developed for other purposes We can’t control how that work We have to deal with the heterogeneity

Taverna Use Users worldwide Over 48500 downloads Bioinformatics – largest group of users Other users from –astronomy, –chemoinformatics, –health informatics –Systems Biology –Social sciences

http://www.genomics.liv.ac.uk/tryps/trypsindex.html Andy Brass Steve Kemp Paul Fisher Sleeping Sickness in African Cattle Caused by infection by parasite (Trypanosoma brucei) Some cattle breeds more resistant than others Differences between resistant and susceptible cattle? Can we breed cattle resistant to infection? Fisher et al (2007).Nucleic Acids Res.35(16):5625-33 High throughput experiments Microarray QTL analysis

Bioinformatics Workflows Workflows allow high throughput experiments and automation Workflows are encapsulations of experiments Workflows developed for one experiment can be reused for others Easier to share, reuse and repurpose The METHODS section of a scientific publication

Workflow Reuse Downloaded 836 times Viewed 799 times Jo Pennock, lab biologist with no bioinformatics experience – Mouse whipworm infection Identified no candidate genes in 2 years with manual analysis Identified candidate genes in several hours using Paul’s workflow

Workflows are combinations of different services Locations and descriptions of services required at the design phase Reusing workflows – need to understand what they do In Silico Science Life Cycle

Finding Services When using services, scientists need to: Find them – in distributed locations, produced by different host institutions Interpret them – what do the services do - what experiments can they perform using them? Know how to invoke them – what data and initial parameters do they need to supply?

We could Google for them… If a service is called by the name you expect, you’ll find it –Search for ‘clustalw’ and ‘web service’ What if its not? –The clustalw program from emboss is called ‘emma’ –What if it’s the only web service version of clustalw? –Does it stop you designing your workflow?

Metadata from a WSDL Pathport Web service from the Virginia Bioinformatics Institute http://pathport.vbi.vt.edu/services/wsdls/beta/glimmer.wsd Name of the service Uninformative names for parameters What kind of string?

Semantics and Web Services SAWSDL – Semantic Annotations for WSDL working group Virtually no uptake by bioinformatics service providers Doesn’t address non-WSDL services

Adding Semantics – Annotating Services Find services by their function instead of their name The services might be distributed, but a registry of service descriptions can be central and queried We need to annotate services with semantics In my Grid, we use the Feta Semantic Discovery tool and a semantic annotation tool – and expert curation

my Grid Ontology Logically separated into two parts: Service ontology Physical and operational features of (web) services Domain ontology Annotation vocabulary for core bioinformatics data, data types and their relationships

Service Ontology Models services from the point of view of the scientist –Where is it? – How many inputs/outputs? – Who hosts it? Invocation details are hidden by the Taverna workbench Differs from related initiatives in this respect

Domain Ontology Informatics: captures the key concepts of data, data structures, databases and metadata. Bioinformatics: The domain-specific data sources (e.g. the model organism sequencing databases), and domain-specific algorithms for searching and analyzing data (e.g. the sequence alignment algorithm, clustalw). Molecular biology: Concepts include examples such as, protein sequence, and nucleic acid sequence. Formats: A hierarchy describing bioinformatics file formats. For example, fasta format for sequence data, or phylip format for phylogenetic data Tasks: A hierarchy describing the generic tasks a service operation can perform. Examples include retrieving, displaying, and aligning.

Specialises my Grid Ontology Web Service ontology Task ontology Informatics ontology Molecular Biology ontology Bioinformatics ontology Contributes to sequence biological_sequence protein_sequence nucleotide_sequence DNA_sequence protein_structure_feature BLASTp service Similarity Search Service BLAST service InterProScan service

Example Service Annotation Example : BLAST from the DDBJ –Performs task: Alignment –Uses Method: Similarity Search Algorithm –Uses Resources: DNA/Protein sequence databases –Inputs: biological sequence (and format) database name (and format) blast program (and format) –Outputs: Blast Report Minimum Information model

Minimum Models in Biology MIBBI – Minimum Information about Biomedical and Biological Investigations –MIAME – Microarray experiments –MIAPE - Proteomics –MIRIAM – Biochemical models (SBML models) –Etc –MIOAWS – Minimum Information About the Operation of the Web Service

my Grid Ontology First version of the ontology ~ 2002 Originally developed in DAML+OIL Now developed in OWL and a version exported to RDFS Number of classes in the ontology ~750 Domain and service ontology used by my Grid users and developers of my Grid related plugins Service ontology also used by BioMoby W3C compliant WRT ontology modelling

How do we use the ontology? Two methods of service description 1.Decision Making - reasoning Single description – whole service model Ontology used to build a single, complete service description and annotations are classified Enables automated composition of workflows 2. Decision Support - querying Composite matches to ontology terms Multiple terms are used to query the annotations

Originally – Decision Making Difficult and time consuming to produce the detailed service descriptions Assumption that people would want automated workflow composition Repeat Masker Web service Gene Prediction Web Service Blast Web Service Sequence Predicted Genes out Only 1 exists Many different algorithms – effective with different organisms etc Works over underlying databases

Resource Compatibility Difference? Scientists choice – can they be sure the experiments are equivalent? Example: Nucleotide sequence databases GenBank - USA EMBL - Europe DDBJ - Japan Nightly updates – mirrored data BUT the sequence annotation could be different

my Grid – Decision Support –Reducing the list of know services from thousands to several –Scientist makes the final decision about which of a selection of services to use –Services are ‘tagged’ with terms from the ontology – very simple! –No requirement for OWL-DL reasoning –Generating service annotations is much easier

So why do we need OWL? Building workflows is a two-stage process 1.Assembly – identifying services that perform the scientific functions needed for the experiment 2.Gluing – identifying how (or more usually, if) theses services are compatible If they are incompatible – we need services that convert data formats and act as connectors – we call these services Shims

Cases for using the OWL version Automatic shim integration –Shims don’t do anything scientific, so choosing one over another makes no difference Detecting mismatches –A scientist has built a workflow and the output of processor 1 is incompatible with processor 2

Limitations of the Current Model Feta discovery tool is only accessible from the Taverna Workbench Only pertinent to Taverna users – other people need to find and use web services Focuses on finding services, but not workflows. For reuse, we need to do both Closed annotation system - myGrid curator provides service descriptions – only 700 so far!

BioCatalogue: Public Bioinformatics Service Registry Collaboration between University of Manchester and EBI Expanding from a service for Taverna users to a service for anyone using bio web services Combine service and workflow discovery Accelerating the process of gathering service descriptions/annotations by engaging the scientific community Combines the myGrid initiative with BioMoby etc

Combining Service and Workflow Discovery myExperiment – social networking – Web 2.0 Workflows tagged No formal model No control Services – semantically described, ontology terms Access each through the same interface Exchanging metadata objects

Screen shot of bio Service shopping site ‘Shopping’ for Services and Workflows

Getting the Minimum Community annotation Must be easy and quick Must allow partial descriptions Multiple annotations of the same service What is the minimum information to enable –service discovery –service invocation Tagging terms to formal models – OWL, SKOS intermediate?

Grading Services Bronze – enough to locate the service. Example of service invocation Silver Gold Platinum – full description. All properties annotated – including dependencies between them – reliability metrics etc

Annotation Provenance Who said what about what? Harvesting community annotation Verifying and augmenting by a curator ‘Trust’ Models Annotation versions –In a workflow context –As stand alone services

Annotation Process

Open Issues ‘Open’ world means we cannot impose metadata standards Lots of heterogeneity Ontology modelling stable standards to build upon Web services – shifting standards – need flexibility for future-proofing Other services as well as web services Combining and exchanging metadata objects behind interfaces Can we adopt something from the digital library community? e.g. OAI and ORE (Open Archives Initiative Object Reuse and Exchange )

my Grid acknowledgements Carole Goble, Norman Paton, Robert Stevens, Anil Wipat, David De Roure, Steve Pettifer OMII-UK Tom Oinn, Katy Wolstencroft, Daniele Turi, June Finch, Stuart Owen, David Withers, Stian Soiland, Franck Tanoh, Matthew Gamble, Alan Williams, Ian Dunlop Research Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, Antoon Goderis, Alastair Hampshire, Qiuwei Yu, Wang Kaixuan. Current contributors Matthew Pocock, James Marsh, Khalid Belhajjame, PsyGrid project, Bergen people, EMBRACE people. User Advocates and their bosses Simon Pearce, Claire Jennings, Hannah Tipney, May Tassabehji, Andy Brass, Paul Fisher, Peter Li, Simon Hubbard, Tracy Craddock, Doug Kell, Marco Roos, Matthew Pocock, Mark Wilkinson Past Contributors Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Juri Papay, Savas Parastatidis, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Victor Tan, Paul Watson, and Chris Wroe. Industrial Dennis Quan, Sean Martin, Michael Niemi (IBM), Chimatica. Funding EPSRC, Wellcome Trust. http://www.mygrid.org.uk http://www.myexperiment.org

Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Similar presentations

Presentation on theme: "Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Similar presentations

Presentation on theme: "Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester."— Presentation transcript:

Similar presentations

About project

Feedback