An Identity Crisis in the Life Sciences Jun Zhao, Carole Goble and Robert Stevens The University of Manchester, UK Thanks to: Tom Oinn, Matthew Pocock,

Slides:



Advertisements
Similar presentations
Taverna: From Biology to Astronomy Dr Katy Wolstencroft University of Manchester my Grid OMII-UK.
Advertisements

Sandra Gesing Division for Simulation of Biological Systems Eberhard-Karls-Universität Tübingen Portals for Life.
Sandra Gesing Eberhard-Karls-Universität Tübingen Requirements on a portal for MoSGrid (Molecular Simulation.
Center for Bioinformatics, University of Tübingen
IPAW'08 – Salt Lake City, Utah, June 2008 Data lineage model for Taverna workflows with lightweight annotation requirements Paolo Missier, Khalid Belhajjame,
Peter Rice Bioinformatics and Grid: Progress and Potential Peter Rice, EBI ISGC, April 2005.
Classical and myGrid approaches to data mining in bioinformatics
Taverna the story from up-above Antoon Goderis The University of Manchester, UK DART workshop, Brisbane,
IBM Watson Research © 2004 IBM Corporation BioHaystack: Gateway to the Biological Semantic Web Dennis Quan
Workflow discovery in e-science Antoon Goderis Peter Li Carole Goble University of Manchester, UK
Doing it again: Workflows and Ontologies Supporting Science Phillip Lord Frank Gibson Newcastle University.
Workflows within Taverna Stuart Owen University of Mancester, UK
Jiten Bhagat University of myExperiment A Social VRE for Research Objects JISC Roadshow | February.
The my Grid project aims to provide middleware layers that make the Information Grid appropriate for the needs of bioinformatics. my Grid is building high.
Personal Data Management Why is this such an issue? Data Provenance Representing links v Representing data Identifying resources: Life Science Identifiers.
The Representation of Scientific Data
Chapter 1 Overview of Databases and Transaction Processing.
Predicting Missing Provenance Using Semantic Associations in Reservoir Engineering Jing Zhao University of Southern California Sep 19 th,
Provenance in my Grid Jun Zhao School of Computer Science The University of Manchester, U.K. 21 October, 2004.
Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI,
Deciding Semantic Matching of Stateless Services Duncan Hull †, Evgeny Zolin †, Andrey Bovykin ‡, Ian Horrocks †, Ulrike Sattler † and Robert Stevens †
USC Viterbi School of Engineering Scientific Workflows and Systems Ewa Deelman.
Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK
The Taverna Workbench: Integrating and analysing biological and clinical data with computerised workflows Dr Katy Wolstencroft myGrid University of Manchester.
Taverna and my Grid Basic overview and Introduction Tom Oinn
Provenance Metadata for Shared Product Model Databases Etiel Petrinja, Vlado Stankovski & Žiga Turk University of Ljubljana Faculty of Civil and Geodetic.
Designing, Executing, Reusing and Sharing Workflows: Taverna and myExperiment Supporting the in silico Experiment Life Cycle Katy Wolstencroft Paul Fisher.
An Introduction to Taverna Workflows Franck Tanoh my Grid University of Manchester.
Usage of `provenance’: A Tower of Babel Luc Moreau.
Miguel Branco CERN/University of Southampton Enabling provenance on large-scale e-Science applications.
OMII-UK Software Activities Steven Newhouse, Director.
(Bio)Web Services at the INB BioMOBY. Instituto Nacional de Bioinformática.
Taverna and my Grid Open Workflow for Life Sciences Tom Oinn
Taverna: A Workbench for the Design and Execution of Scientific Workflows Dr Katy Wolstencroft myGrid University of Manchester.
Going with the Flow Distributed Computing for Systems Biology Using Taverna Prof Carole Goble The University of Manchester, UK
Taverna Workflows for Systems Biology Katy Wolstencroft School of Computer Science University of Manchester.
Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester.
VBI Web Services Workshop May 2005 Performing In silico Experiments in a Service Based Architecture: Solutions and Issues Chris Wroe, Phillip Lord,
Paolo Missier (1), Bertram Luda ̈ scher (2), Shawn Bowers (3), Saumen Dey (2), Anandarup Sarkar (3), Biva Shrestha (4), Ilkay Altintas (5), Manish Kumar.
The Functional Genomics Experiment Object Model (FuGE) Andrew Jones, School of Computer Science, University of Manchester MGED Society.
Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.
LSIDs in a Nutshell Jun Zhao University of Manchester 1 st December, 2005.
Exploring Williams-Beuren Syndrome using my Grid R.D. Stevens, a H.J. Tipney, b C.J. Wroe, a T.M. Oinn, c M. Senger, c P.W. Lord, a C.A. Goble, a A. Brass,
Taverna Workbench Stuart Owen University of Mancester, UK
My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester.
Recording the Context of Action for Process Documentation Ian Wootten Cardiff University, UK
INTRODUCTION lecture1 1. Data base concept Data is a meaningless static value. What does 3421 means? Information is the data you process in a manner that.
First International Workshop on Portals for Life Sciences Sandra Gesing
MyGrid/Taverna Provenance Daniele Turi University of Manchester OMII f2f Meeting, London, 19-20/4/06.
National e-Science Centre, Edinburgh 27/11/06 (Ontology-based) Metadata: What is it, Where and How can we use it, and How can we share it?
EScience Case Studies Using Taverna Dr. Georgina Moulton The University of Manchester
Khalid Belhajjame 1, Paolo Missier 2, and Carole A. Goble 1 1 University of Manchester 2 University of Newcastle Detecting Duplicate Records in Scientific.
The Semantic Web, Service Oriented Architectures, the my Grid Experience Carole Goble
McGraw-Hill/Irwin © 2008 The McGraw-Hill Companies, All Rights Reserved Chapter 7 Storing Organizational Information - Databases.
Portals and my Grid Stefan Rennick Egglestone Mixed Reality Laboratory University of Nottingham.
Selected Workflow and Semantic Experiences from my Grid Professor Carole Goble The University of Manchester, UK
Life Science Identifiers Chris Wroe (based on material from myGrid team and IBM Life Sciences)
An Introduction to Taverna caBIG monthly workspace call and Taverna, Franck Tanoh.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Introduction to Workflows with Taverna and myExperiment Aleksandra Pawlik University of Manchester materials by Katy Wolstencroft and Aleksandra Pawlik.
Introduction to Workflows with Taverna and myExperiment Aleksandra Pawlik University of Manchester materials by Dr Katy Wolstencroft.
Recording and Reasoning Over Data Provenance in Web and Grid Services Martin Szomszor and Luc Moreau University of Southampton.
Katy Wolstencroft University of Manchester
Provenance: Problem, Architectural issues, Towards Trust
LSIDs in Taverna Daniele Turi University of Manchester
Enrico Fattibene INFN-CNAF
Distributed Computing for System Biology using Taverna Workflows
Database Design Hacettepe University
Scientific Workflows Lecture 15
Presentation transcript:

An Identity Crisis in the Life Sciences Jun Zhao, Carole Goble and Robert Stevens The University of Manchester, UK Thanks to: Tom Oinn, Matthew Pocock, Daniele Turi And our users And the EPSRC

UK e-Science project Middleware for in silico experiments by individual life scientists, stuck in under-resourced labs, who use other people’s applications acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt

Bioinformatics workflows Taverna workflow workbench collected metabolic pathway computed BLAST report Data pipelines Collect data Compute data Frequently updated public resources Open world Get the same data product in different experiment context Bioinformatician users

Workflow outcomes A record of outcome data and its provenance. Store data outcomes with a unique id, link together in a typed graph. In fact store all provenance as graph

urn:data:f2 urn:data1 urn:data2 urn:compareinvocation3 urn:data12 Blast_report [input] [output] [input] [distantlyDerivedFrom] SwissProt_seq [instanceOf] Sequence_hit [hasHits] urn:hit2…. urn:hit1… urn:hit50….. [instanceOf] [similar_sequence_to] Data generated by services/workflows Concepts [ ] [performsTask] Find similar sequence [contains] Services urn:data:3 urn:hit8…. urn:hit5… urn:hit10….. [contains] [instanceOf] urn:BlastNInvocation3 urn:invocation5 urn:data:f1 [output] New sequence Missed sequence [hasName] literals DatumCollection [type] LSDatum [type] Properties [instanceOf] [output] [directlyDerivedFrom] Concept Data

Fusion between different data models using shared concepts and shared data outputOf createdFrom contains_similiar_seq_t o urn:genbank2 … urn:genbank1 … urn:genbank5 0… Blast_reportDNA_sequence urn:BlastNInvocation3 urn:data:3 urn:data2 inputOf Blast_servic e instanceOf urn:williams A urn:run5 urn:data2 urn:run7 urn:williamsB GenBankUniProt runOf inputOf runOf createdBy LSID createdB y urn:data: f2 urn:data1 urn:data2 urn:compareinvocation3 urn:data1 2 Blast_report [input] [output] [input] [distantlyDerivedFrom] SwissProt_seq [instanceOf] Sequence_hit [hasHits] urn:hit2…. urn:hit1… urn:hit50 ….. [instanceOf] [similar_sequence_to] Data generated by services/workfl ows Concepts [ ] [performsTask] Find similar sequence [contains] Services urn:data:3 urn:hit8…. urn:hit5… urn:hit10 ….. [contains] [instanceOf] urn:BlastNInvocation 3 urn:invocation 5 urn:data: f1 [output] New sequence Missed sequence [hasName ] literals DatumCollection [type] LSDatum [type] Properties [instanceOf] [output] [directlyDerivedFro m] Add assertions, Add rules Reason over assertions

Putting Provenance to Use Single workflow –audit trail –recipe Multiple workflow runs (versions) –Aggregation - gathering –Integration - merging –Comparison - differencing

Any idea? gi: Life Science Identifier A ruddy great lump of RDF

URIs for Data urn:lsid:mygrid.ac.uk:data:49841:1 Life Science Identifier Protocol for allocation and resolution Adopted by a range of data providers LSIDs in the data providers databases we collect during workflow execution LSIDs for the data products we computed during the workflow execution acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt

RDF provenance graph urn:lsid:tav:brpt1 urn:lsid:tav:seqcollection1urn:lsid:tav:seq1 my:derivedFrom my:hasElement A graph, with URIs for resources as the nodes, and their provenance relationships as the edges my:derivedFrom my = tav = taverna.sourceforge.net

Having a BLAST in every workflow! Seq GenBank Report database score BLAST BLAST_simplifer GenBank_retrieve BlastReport A list of Sequences

Alignment of sequence AC005089

Data products in each run Computed data product –BLAST report –GenBank report Collected data product –Sequence contained within the content of a BLAST report –Sequence extracted by the simplifier service Collection and Atomic BLAST Report Sequence contains 1..m SEQ aListOf 1..m 1 1 Computed data Collected data

Computed Collections and Collected data items BLAST Report Sequence 1 Sequence 2 Sequence 3 BLAST Report Sequence 1 Sequence 2 Sequence 3 BLAST Report Sequence 1 Sequence 2 Sequence 4 SEQ listOf BLAST simplifer SEQ listOf BLAST simplifer SEQ listOf BLAST simplifer

BLAST Report Sequence 1 Sequence 2 Sequence 3 BLAST Report Sequence 1 Sequence 2 Sequence 4 SEQ listOf BLAST simplifer SEQ listOf BLAST simplifer Equivalent data Corresponding data Data Co-references Context of the workflow

Run2 Run1 Aggregation of repeated run AC BLASTReport urn:lsid:tav:ic531 urn:lsid:tav:ic537 urn:lsid:tav:ic538 urn:lsid:tav:57b6 urn:lsid:tav:57b13 urn:lsid:tav:57b14 refersTo derivedFrom DNASeq derivedFrom refersTo rdf:type

External Duplicates gi: ac urn:lsid:myg:ac mmu:11423 Different providers A replica Different tool providers Sequence

LSID Assignment Process Workflow enactor Provenance service Data service External domain service Data storage group wfEvents Taverna LSID Authority MySQL relational stores KAVE BAKLAVA Customized DB Customized DB Jena/Sesame RDF store Equivalent data in repeated runs Duplicate ids for these data

Provenance from two repeated runs my:derivedFrom my:hasElement my:derivedFrom my:hasElement Run1 Run2 No convergence urn:lsid:tav:brpt1 urn:lsid:tav:brpt2 urn:lsid:tav:seqcollection1 urn:lsid:tav:seqcollection2 urn:lsid:tav:seq1 urn:lsid:tav:seq2 my:derivedFrom

Duplicated identities in these two runs Run2 Run1 b1 c1 s1 b2 c2 s2 SEQ BLAST Report Sequence

urn:lsid:tav:brpt1 urn:lsid:tav:brpt2 urn:gb:seq1 Sequence 1 Execution duplicates BLASTBLAST_simplifer BlastReport A list of Seq GenBank_retrieve But hidden!! urn:gb:seq1 BLAST report

BLASTBLAST_simplifer BlastReport A list of Seq GenBank_retrieve SEQ1 Sequence 1 Sequence 2 Sequence 3 listOf urn:tav:seqc1 urn:tav:seq1 urn:gb:seq1 SEQ1 listOf urn:tav:seqc2 urn:tav:seq2 Sequence 1 Sequence 2 Sequence 3 Execution duplicates

SEQ aListOf 1..m 1 GBRPT GBrpt Seq aListOf 1 1..m Species isOf 1 1 collection data computed by iterations, e.g. a list of GenBank reports from GenBank_retrieve nested collected data products, e.g. the species data object in the sequence data Execution duplicates 3

Migration duplicates C:/MyDocuments/ WBS/Run2/ Seq 1

Managing identity co-reference Identity co-reference: –Identifying duplicate identities that refer to the same object but kept context An approach: –An IDSet entity Identity equivalence for collected data Identity correspondence for computed data –An identity service –Identity normalisation and cleansing activity

IDSet entity IDSet(BLASTrpt) = {{urn:tav:brpt1}, {urn:tav:brpt3}} urn:gb:seq1 Sequence Query by its identity Query by its content IDSet 1 merge IDSet created by another organization IDSet 3 urn:lsid:tav:brpt1 BLASTreport

Extended Architecture Workflow enactor Provenance service Data service External domain service Data storage group wfEvents Taverna LSID Authority MySQL relational stores BAKLAVA Customized DB Customized DB Identity service KAVE + Jena/Sesame RDF store MySQL relational store Identity store KAVE

Identifying collected product Identity service urn:gb:seq1 Identity store Receive an identity Look for or create Its IDSet KAVE urn:gb:seq1 Store the id and the IDSet 1 urn:gb:seq1

Identifying a collection product Identity service Identity store Receive an identity Look for or create Its IDSet KAVE Store the id and the IDSet urn:lsid:seqc1 Seq 1 Seq 2 Seq 3 SEQ2 listOf unr:lsid:seqc2 Look for equivalent collection unr:lsid:seqc1 unr:lsid:seqc2

Putting the Identity Service to Use Provenance Integration Provenance Aggregation Identity Management Provenance Normalization Run2 Run1 b1 c1 s1 b2 c2 s2

Discussion Scalability issues: –Normalizing provenance graphs –Building IDSet for collections with multiple hierarchies Open world data type-free context Use experimental context more effectively – workflows are not independently executed. Granularity of identity Identity aware operations in workflow Multiple naming schemes Migration duplicates Compacting data results

Conclusion Combining provenance kind of depends on finding points of commonality. Like data identity. Duplicate identities will occur in an open world Hard to achieve uniqueness without community commitment Different types of equivalent objects How much can be avoided? And how much has to be repaired?