Download presentation
Presentation is loading. Please wait.
Published byBrooke Kelly Modified over 9 years ago
1
An Identity Crisis in the Life Sciences Jun Zhao, Carole Goble and Robert Stevens The University of Manchester, UK Thanks to: Tom Oinn, Matthew Pocock, Daniele Turi And our users And the EPSRC
2
UK e-Science project Middleware for in silico experiments by individual life scientists, stuck in under-resourced labs, who use other people’s applications. 12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt
5
Bioinformatics workflows Taverna workflow workbench collected metabolic pathway computed BLAST report Data pipelines Collect data Compute data Frequently updated public resources Open world Get the same data product in different experiment context Bioinformatician users
6
Workflow outcomes A record of outcome data and its provenance. Store data outcomes with a unique id, link together in a typed graph. In fact store all provenance as graph
7
urn:data:f2 urn:data1 urn:data2 urn:compareinvocation3 urn:data12 Blast_report [input] [output] [input] [distantlyDerivedFrom] SwissProt_seq [instanceOf] Sequence_hit [hasHits] urn:hit2…. urn:hit1… urn:hit50….. [instanceOf] [similar_sequence_to] Data generated by services/workflows Concepts [ ] [performsTask] Find similar sequence [contains] Services urn:data:3 urn:hit8…. urn:hit5… urn:hit10….. [contains] [instanceOf] urn:BlastNInvocation3 urn:invocation5 urn:data:f1 [output] New sequence Missed sequence [hasName] literals DatumCollection [type] LSDatum [type] Properties [instanceOf] [output] [directlyDerivedFrom] Concept Data
8
Fusion between different data models using shared concepts and shared data outputOf createdFrom contains_similiar_seq_t o urn:genbank2 … urn:genbank1 … urn:genbank5 0… Blast_reportDNA_sequence urn:BlastNInvocation3 urn:data:3 urn:data2 inputOf Blast_servic e instanceOf urn:williams A urn:run5 urn:data2 urn:run7 urn:williamsB GenBankUniProt runOf inputOf runOf createdBy LSID createdB y urn:data: f2 urn:data1 urn:data2 urn:compareinvocation3 urn:data1 2 Blast_report [input] [output] [input] [distantlyDerivedFrom] SwissProt_seq [instanceOf] Sequence_hit [hasHits] urn:hit2…. urn:hit1… urn:hit50 ….. [instanceOf] [similar_sequence_to] Data generated by services/workfl ows Concepts [ ] [performsTask] Find similar sequence [contains] Services urn:data:3 urn:hit8…. urn:hit5… urn:hit10 ….. [contains] [instanceOf] urn:BlastNInvocation 3 urn:invocation 5 urn:data: f1 [output] New sequence Missed sequence [hasName ] literals DatumCollection [type] LSDatum [type] Properties [instanceOf] [output] [directlyDerivedFro m] Add assertions, Add rules Reason over assertions
9
Putting Provenance to Use Single workflow –audit trail –recipe Multiple workflow runs (versions) –Aggregation - gathering –Integration - merging –Comparison - differencing
10
Any idea? 30350027 gi:30350027 Life Science Identifier A ruddy great lump of RDF
11
URIs for Data urn:lsid:mygrid.ac.uk:data:49841:1 Life Science Identifier Protocol for allocation and resolution Adopted by a range of data providers LSIDs in the data providers databases we collect during workflow execution LSIDs for the data products we computed during the workflow execution 12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt http://www.omg.org/cgi-bin/doc?lifesci/2003-12-02
12
RDF provenance graph urn:lsid:tav:brpt1 urn:lsid:tav:seqcollection1urn:lsid:tav:seq1 my:derivedFrom my:hasElement A graph, with URIs for resources as the nodes, and their provenance relationships as the edges my:derivedFrom my = http://www.mygrid.org.uk/provenance#http://www.mygrid.org.uk/provenance# tav = taverna.sourceforge.net
13
Having a BLAST in every workflow! Seq GenBank Report database score BLAST BLAST_simplifer GenBank_retrieve BlastReport A list of Sequences
14
Alignment of sequence AC005089
15
Data products in each run Computed data product –BLAST report –GenBank report Collected data product –Sequence contained within the content of a BLAST report –Sequence extracted by the simplifier service Collection and Atomic BLAST Report Sequence contains 1..m SEQ aListOf 1..m 1 1 Computed data Collected data
16
Computed Collections and Collected data items BLAST Report Sequence 1 Sequence 2 Sequence 3 BLAST Report Sequence 1 Sequence 2 Sequence 3 BLAST Report Sequence 1 Sequence 2 Sequence 4 SEQ listOf BLAST simplifer SEQ listOf BLAST simplifer SEQ listOf BLAST simplifer
17
BLAST Report Sequence 1 Sequence 2 Sequence 3 BLAST Report Sequence 1 Sequence 2 Sequence 4 SEQ listOf BLAST simplifer SEQ listOf BLAST simplifer Equivalent data Corresponding data Data Co-references Context of the workflow
18
Run2 Run1 Aggregation of repeated run AC005089 BLASTReport urn:lsid:tav:ic531 urn:lsid:tav:ic537 urn:lsid:tav:ic538 urn:lsid:tav:57b6 urn:lsid:tav:57b13 urn:lsid:tav:57b14 refersTo derivedFrom DNASeq derivedFrom refersTo rdf:type
20
External Duplicates gi:15145617 ac073846 urn:lsid:myg:ac073846 mmu:11423 Different providers A replica Different tool providers Sequence
21
LSID Assignment Process Workflow enactor Provenance service Data service External domain service Data storage group wfEvents Taverna LSID Authority MySQL relational stores KAVE BAKLAVA Customized DB Customized DB Jena/Sesame RDF store Equivalent data in repeated runs Duplicate ids for these data
22
Provenance from two repeated runs my:derivedFrom my:hasElement my:derivedFrom my:hasElement Run1 Run2 No convergence urn:lsid:tav:brpt1 urn:lsid:tav:brpt2 urn:lsid:tav:seqcollection1 urn:lsid:tav:seqcollection2 urn:lsid:tav:seq1 urn:lsid:tav:seq2 my:derivedFrom
23
Duplicated identities in these two runs Run2 Run1 b1 c1 s1 b2 c2 s2 SEQ BLAST Report Sequence
24
urn:lsid:tav:brpt1 urn:lsid:tav:brpt2 urn:gb:seq1 Sequence 1 Execution duplicates BLASTBLAST_simplifer BlastReport A list of Seq GenBank_retrieve But hidden!! urn:gb:seq1 BLAST report
25
BLASTBLAST_simplifer BlastReport A list of Seq GenBank_retrieve SEQ1 Sequence 1 Sequence 2 Sequence 3 listOf urn:tav:seqc1 urn:tav:seq1 urn:gb:seq1 SEQ1 listOf urn:tav:seqc2 urn:tav:seq2 Sequence 1 Sequence 2 Sequence 3 Execution duplicates
26
SEQ aListOf 1..m 1 GBRPT GBrpt Seq aListOf 1 1..m Species isOf 1 1 collection data computed by iterations, e.g. a list of GenBank reports from GenBank_retrieve nested collected data products, e.g. the species data object in the sequence data Execution duplicates 3
27
Migration duplicates C:/MyDocuments/ WBS/Run2/ Seq 1
28
Managing identity co-reference Identity co-reference: –Identifying duplicate identities that refer to the same object but kept context An approach: –An IDSet entity Identity equivalence for collected data Identity correspondence for computed data –An identity service –Identity normalisation and cleansing activity
29
IDSet entity IDSet(BLASTrpt) = {{urn:tav:brpt1}, {urn:tav:brpt3}} urn:gb:seq1 Sequence Query by its identity Query by its content IDSet 1 merge IDSet created by another organization IDSet 3 urn:lsid:tav:brpt1 BLASTreport
30
Extended Architecture Workflow enactor Provenance service Data service External domain service Data storage group wfEvents Taverna LSID Authority MySQL relational stores BAKLAVA Customized DB Customized DB Identity service KAVE + Jena/Sesame RDF store MySQL relational store Identity store KAVE
31
Identifying collected product Identity service urn:gb:seq1 Identity store Receive an identity Look for or create Its IDSet KAVE+ 1 2 3 3 urn:gb:seq1 Store the id and the IDSet 1 urn:gb:seq1
32
Identifying a collection product Identity service Identity store Receive an identity Look for or create Its IDSet KAVE+ 1 2 3 3 Store the id and the IDSet urn:lsid:seqc1 Seq 1 Seq 2 Seq 3 SEQ2 listOf unr:lsid:seqc2 Look for equivalent collection unr:lsid:seqc1 unr:lsid:seqc2
33
Putting the Identity Service to Use Provenance Integration Provenance Aggregation Identity Management Provenance Normalization Run2 Run1 b1 c1 s1 b2 c2 s2
34
Discussion Scalability issues: –Normalizing provenance graphs –Building IDSet for collections with multiple hierarchies Open world data type-free context Use experimental context more effectively – workflows are not independently executed. Granularity of identity Identity aware operations in workflow Multiple naming schemes Migration duplicates Compacting data results
35
Conclusion Combining provenance kind of depends on finding points of commonality. Like data identity. Duplicate identities will occur in an open world Hard to achieve uniqueness without community commitment Different types of equivalent objects How much can be avoided? And how much has to be repaired?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.