Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Identity Crisis in the Life Sciences Jun Zhao, Carole Goble and Robert Stevens The University of Manchester, UK Thanks to: Tom Oinn, Matthew Pocock,

Similar presentations


Presentation on theme: "An Identity Crisis in the Life Sciences Jun Zhao, Carole Goble and Robert Stevens The University of Manchester, UK Thanks to: Tom Oinn, Matthew Pocock,"— Presentation transcript:

1 An Identity Crisis in the Life Sciences Jun Zhao, Carole Goble and Robert Stevens The University of Manchester, UK Thanks to: Tom Oinn, Matthew Pocock, Daniele Turi And our users And the EPSRC

2 UK e-Science project Middleware for in silico experiments by individual life scientists, stuck in under-resourced labs, who use other people’s applications. 12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt

3

4

5 Bioinformatics workflows Taverna workflow workbench collected metabolic pathway computed BLAST report Data pipelines Collect data Compute data Frequently updated public resources Open world Get the same data product in different experiment context Bioinformatician users

6 Workflow outcomes A record of outcome data and its provenance. Store data outcomes with a unique id, link together in a typed graph. In fact store all provenance as graph

7 urn:data:f2 urn:data1 urn:data2 urn:compareinvocation3 urn:data12 Blast_report [input] [output] [input] [distantlyDerivedFrom] SwissProt_seq [instanceOf] Sequence_hit [hasHits] urn:hit2…. urn:hit1… urn:hit50….. [instanceOf] [similar_sequence_to] Data generated by services/workflows Concepts [ ] [performsTask] Find similar sequence [contains] Services urn:data:3 urn:hit8…. urn:hit5… urn:hit10….. [contains] [instanceOf] urn:BlastNInvocation3 urn:invocation5 urn:data:f1 [output] New sequence Missed sequence [hasName] literals DatumCollection [type] LSDatum [type] Properties [instanceOf] [output] [directlyDerivedFrom] Concept Data

8 Fusion between different data models using shared concepts and shared data outputOf createdFrom contains_similiar_seq_t o urn:genbank2 … urn:genbank1 … urn:genbank5 0… Blast_reportDNA_sequence urn:BlastNInvocation3 urn:data:3 urn:data2 inputOf Blast_servic e instanceOf urn:williams A urn:run5 urn:data2 urn:run7 urn:williamsB GenBankUniProt runOf inputOf runOf createdBy LSID createdB y urn:data: f2 urn:data1 urn:data2 urn:compareinvocation3 urn:data1 2 Blast_report [input] [output] [input] [distantlyDerivedFrom] SwissProt_seq [instanceOf] Sequence_hit [hasHits] urn:hit2…. urn:hit1… urn:hit50 ….. [instanceOf] [similar_sequence_to] Data generated by services/workfl ows Concepts [ ] [performsTask] Find similar sequence [contains] Services urn:data:3 urn:hit8…. urn:hit5… urn:hit10 ….. [contains] [instanceOf] urn:BlastNInvocation 3 urn:invocation 5 urn:data: f1 [output] New sequence Missed sequence [hasName ] literals DatumCollection [type] LSDatum [type] Properties [instanceOf] [output] [directlyDerivedFro m] Add assertions, Add rules Reason over assertions

9 Putting Provenance to Use Single workflow –audit trail –recipe Multiple workflow runs (versions) –Aggregation - gathering –Integration - merging –Comparison - differencing

10 Any idea? 30350027 gi:30350027 Life Science Identifier A ruddy great lump of RDF

11 URIs for Data urn:lsid:mygrid.ac.uk:data:49841:1 Life Science Identifier Protocol for allocation and resolution Adopted by a range of data providers LSIDs in the data providers databases we collect during workflow execution LSIDs for the data products we computed during the workflow execution 12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt http://www.omg.org/cgi-bin/doc?lifesci/2003-12-02

12 RDF provenance graph urn:lsid:tav:brpt1 urn:lsid:tav:seqcollection1urn:lsid:tav:seq1 my:derivedFrom my:hasElement A graph, with URIs for resources as the nodes, and their provenance relationships as the edges my:derivedFrom my = http://www.mygrid.org.uk/provenance#http://www.mygrid.org.uk/provenance# tav = taverna.sourceforge.net

13 Having a BLAST in every workflow! Seq GenBank Report database score BLAST BLAST_simplifer GenBank_retrieve BlastReport A list of Sequences

14 Alignment of sequence AC005089

15 Data products in each run Computed data product –BLAST report –GenBank report Collected data product –Sequence contained within the content of a BLAST report –Sequence extracted by the simplifier service Collection and Atomic BLAST Report Sequence contains 1..m SEQ aListOf 1..m 1 1 Computed data Collected data

16 Computed Collections and Collected data items BLAST Report Sequence 1 Sequence 2 Sequence 3 BLAST Report Sequence 1 Sequence 2 Sequence 3 BLAST Report Sequence 1 Sequence 2 Sequence 4 SEQ listOf BLAST simplifer SEQ listOf BLAST simplifer SEQ listOf BLAST simplifer

17 BLAST Report Sequence 1 Sequence 2 Sequence 3 BLAST Report Sequence 1 Sequence 2 Sequence 4 SEQ listOf BLAST simplifer SEQ listOf BLAST simplifer Equivalent data Corresponding data Data Co-references Context of the workflow

18 Run2 Run1 Aggregation of repeated run AC005089 BLASTReport urn:lsid:tav:ic531 urn:lsid:tav:ic537 urn:lsid:tav:ic538 urn:lsid:tav:57b6 urn:lsid:tav:57b13 urn:lsid:tav:57b14 refersTo derivedFrom DNASeq derivedFrom refersTo rdf:type

19

20 External Duplicates gi:15145617 ac073846 urn:lsid:myg:ac073846 mmu:11423 Different providers A replica Different tool providers Sequence

21 LSID Assignment Process Workflow enactor Provenance service Data service External domain service Data storage group wfEvents Taverna LSID Authority MySQL relational stores KAVE BAKLAVA Customized DB Customized DB Jena/Sesame RDF store Equivalent data in repeated runs Duplicate ids for these data

22 Provenance from two repeated runs my:derivedFrom my:hasElement my:derivedFrom my:hasElement Run1 Run2 No convergence urn:lsid:tav:brpt1 urn:lsid:tav:brpt2 urn:lsid:tav:seqcollection1 urn:lsid:tav:seqcollection2 urn:lsid:tav:seq1 urn:lsid:tav:seq2 my:derivedFrom

23 Duplicated identities in these two runs Run2 Run1 b1 c1 s1 b2 c2 s2 SEQ BLAST Report Sequence

24 urn:lsid:tav:brpt1 urn:lsid:tav:brpt2 urn:gb:seq1 Sequence 1 Execution duplicates BLASTBLAST_simplifer BlastReport A list of Seq GenBank_retrieve But hidden!! urn:gb:seq1 BLAST report

25 BLASTBLAST_simplifer BlastReport A list of Seq GenBank_retrieve SEQ1 Sequence 1 Sequence 2 Sequence 3 listOf urn:tav:seqc1 urn:tav:seq1 urn:gb:seq1 SEQ1 listOf urn:tav:seqc2 urn:tav:seq2 Sequence 1 Sequence 2 Sequence 3 Execution duplicates

26 SEQ aListOf 1..m 1 GBRPT GBrpt Seq aListOf 1 1..m Species isOf 1 1 collection data computed by iterations, e.g. a list of GenBank reports from GenBank_retrieve nested collected data products, e.g. the species data object in the sequence data Execution duplicates 3

27 Migration duplicates C:/MyDocuments/ WBS/Run2/ Seq 1

28 Managing identity co-reference Identity co-reference: –Identifying duplicate identities that refer to the same object but kept context An approach: –An IDSet entity Identity equivalence for collected data Identity correspondence for computed data –An identity service –Identity normalisation and cleansing activity

29 IDSet entity IDSet(BLASTrpt) = {{urn:tav:brpt1}, {urn:tav:brpt3}} urn:gb:seq1 Sequence Query by its identity Query by its content IDSet 1 merge IDSet created by another organization IDSet 3 urn:lsid:tav:brpt1 BLASTreport

30 Extended Architecture Workflow enactor Provenance service Data service External domain service Data storage group wfEvents Taverna LSID Authority MySQL relational stores BAKLAVA Customized DB Customized DB Identity service KAVE + Jena/Sesame RDF store MySQL relational store Identity store KAVE

31 Identifying collected product Identity service urn:gb:seq1 Identity store Receive an identity Look for or create Its IDSet KAVE+ 1 2 3 3 urn:gb:seq1 Store the id and the IDSet 1 urn:gb:seq1

32 Identifying a collection product Identity service Identity store Receive an identity Look for or create Its IDSet KAVE+ 1 2 3 3 Store the id and the IDSet urn:lsid:seqc1 Seq 1 Seq 2 Seq 3 SEQ2 listOf unr:lsid:seqc2 Look for equivalent collection unr:lsid:seqc1 unr:lsid:seqc2

33 Putting the Identity Service to Use Provenance Integration Provenance Aggregation Identity Management Provenance Normalization Run2 Run1 b1 c1 s1 b2 c2 s2

34 Discussion Scalability issues: –Normalizing provenance graphs –Building IDSet for collections with multiple hierarchies Open world data type-free context Use experimental context more effectively – workflows are not independently executed. Granularity of identity Identity aware operations in workflow Multiple naming schemes Migration duplicates Compacting data results

35 Conclusion Combining provenance kind of depends on finding points of commonality. Like data identity. Duplicate identities will occur in an open world Hard to achieve uniqueness without community commitment Different types of equivalent objects How much can be avoided? And how much has to be repaired?


Download ppt "An Identity Crisis in the Life Sciences Jun Zhao, Carole Goble and Robert Stevens The University of Manchester, UK Thanks to: Tom Oinn, Matthew Pocock,"

Similar presentations


Ads by Google