Presentation is loading. Please wait.

Presentation is loading. Please wait.

E-SI Theme: Exploiting Diverse Sources of Scientific Data Exploiting Diverse Sources of Scientific Data the vision, what has been achieved and what next…

Similar presentations


Presentation on theme: "E-SI Theme: Exploiting Diverse Sources of Scientific Data Exploiting Diverse Sources of Scientific Data the vision, what has been achieved and what next…"— Presentation transcript:

1 e-SI Theme: Exploiting Diverse Sources of Scientific Data Exploiting Diverse Sources of Scientific Data the vision, what has been achieved and what next… Prof. Jessie Kennedy

2 Exploiting Diverse Sources of Scientific Data Science & Scientific Data  Science and Scientific Data are Complex…

3 Biochemistry Climatology Taxonomy Meteorology Nomenclature Paleontology Genomics Proteomics Hydrology Morphology Geology Oceanography Geography Ecology

4 Biochemistry Climatology Taxonomy Meteorology Nomenclature Paleontology Genomics Proteomics Hydrology Morphology Geology Oceanography Ecology Geography Organism Name Taxon concept Gene sequence Pathway Protein Location Temperature Depth

5 Exploiting Diverse Sources of Scientific Data Individual Scientist Small Scientific Community Large Scientific Community Scientific Laboraotory Scientific Community: complex

6 Biochemistry Climatology Taxonomy Meteorology Nomenclature Paleontology Genomics Proteomics Hydrology Morphology Geology Oceanography Ecology Geography Organism Name Taxon concept Gene sequence Pathway Protein Location Temperature Depth Biochemistry Climatology Taxonomy Meteorology Nomenclature Paleontology Genomics Proteomics Hydrology Morphology Geology Oceanography Ecology Geography Organism Name Taxon concept Gene sequence Pathway Protein Location Temperature Depth Biochemistry Climatology Taxonomy Meteorology Nomenclature Paleontology Genomics Proteomics Hydrology Morphology Geology Oceanography Ecology Geography Organism Name Taxon concept Gene sequence Pathway Protein Location Temperature Depth Biochemistry Climatology Taxonomy Meteorology Nomenclature Paleontology Genomics Proteomics Hydrology Morphology Geology Oceanography Ecology Geography Organism Name Taxon concept Gene sequence Pathway Protein Location Temperature Depth

7 Exploiting Diverse Sources of Scientific Data Science & Scientific Data  Are continually changing  Conclusions become foundations for new hypotheses  New experiments invalidate existing knowledge  Knowledge is open to interpretation  Different opinions  World continually changing observation experiment hypothesis conclusion

8 Exploiting Diverse Sources of Scientific Data Exploiting Diverse Sources of Scientific Data: the vision  To provide scientists with technological solutions to exploit the wealth and diversity of Scientific Data  Discovery  Access  Sharing  Integration/Linking  Analysis  Which would thereby improve the potential for new scientific discovery

9 Exploiting Diverse Sources of Scientific Data Projects in most sciences: ESG

10 SEEK (Scientific Environment for Ecological Knowledge): Vision Research, develop, and capitalize upon advances in information technology to radically improve the type and scale of ecological science that can be addressed –Scalable synthesis Michener

11 Data Dispersion Challenges Data are massively dispersed –Ecological field stations and research centers (100’s) –Natural history museums and biocollection facilities (100’s) –Agency data collections (10’s to 100’s) –Individual scientists (1000’s) –Maintenance must be local Michener

12 Data Integration Challenges Data are heterogeneous –Syntax (format) –Schema (model) –Semantics (meaning) Jones

13 Ecological Modeling Challenges Analysis and modeling tools are: –Specialized –Disconnected –Proprietary It is: –Difficult to revise analyses –Hard to document analyses –Impossible to reliably publish models to share with colleagues –Hard to re-use models and analyses from colleagues –Difficult to use grid-computing for demanding computations –Labor-intensive to manage data in popular analysis software Michener

14 Exploiting Diverse Sources of Scientific Data Exploiting Diverse Sources of Scientific Data: the approaches  Data Discovery/Access  Metadata  To describe the data sets  Ontologies  To define the terminology used  Standardisation of formats  For the exchange of data  Life Science Identifiers (LSIDs)  To uniquely identify and resolve data objects  Provenance of data  To record where the data has come from  And what has happened to it en route.  GRID/Web technology  Distributed data management

15 Exploiting Diverse Sources of Scientific Data Exploiting Diverse Sources of Scientific Data: the approaches  Data Integration/Linking  Metadata  To know how to interpret the data sets  Ontologies  To know how data in the data sets might be related  To aid automatic transformation of the data  Standardisation of formats  To ease integration  Life Science Identifiers (LSIDs)  To know when 2 things are the same  Workflows  To enable refinement and repetition of integration

16 Exploiting Diverse Sources of Scientific Data Exploiting Diverse Sources of Scientific Data: the approaches  Data Analysis  Metadata  To know how to interpret the data sets  Ontologies  To know analytical/transformation processes appropriate  Workflow Tools  To ease analytical processes  Recording/reuse of analytical processes  Provenance  Recording life history of data  To enable validation

17 Exploiting Diverse Sources of Scientific Data Exploiting Diverse Sources of Scientific Data: the technologies  Standardisation of formats  Metadata  Ontologies  Life Science Identifiers (LSIDs)  Provenance  Workflow Tools  GRID/Web technology

18 Exploiting Diverse Sources of Scientific Data Exploiting Diverse Sources of Scientific Data: the technologies  Standardisation of formats  Metadata  Ontologies  Life Science Identifiers (LSIDs)  Provenance  Workflow Tools  GRID/Web technology

19 Exploiting Diverse Sources of Scientific Data Meta Data: the vision  Meta data - "data about data"  keywords, title, creator ….  If scientists marked up their data with the agreed meta data it would be trivial to find highly relevant data (sub-)sets for analysis…  Meta-utopia….

20 Exploiting Diverse Sources of Scientific Data Meta-utopia  A world of complete, reliable metadata.  In meta-utopia,  Everyone uses the same language  and means the same thing…  The guardians of epistemology have rationally mapped out a schema or hierarchy of ideas.  that everyone adheres to…  Scientists accurately describe their methods, processes and results.  so anyone can do anything with it in the future… Cory Doctorow

21 Exploiting Diverse Sources of Scientific Data Meta Data: the approach  Common language  XML Schemas to describe data/meta data  Domain specific exchange schemas  Explosion of these in every domain  Exchanging data  Archiving data

22 knb.ecoinformatics.org Ecological Metadata Language A look inside the meta-utopia of ecology

23 knb.ecoinformatics.org Identification: dataset elements

24 knb.ecoinformatics.org Identification: resource elements

25 knb.ecoinformatics.org Identification: party elements

26 knb.ecoinformatics.org Discovery: coverage elements Geographic Temporal Taxonomic

27 knb.ecoinformatics.org Evaluation Level Information

28 knb.ecoinformatics.org Evaluation: Method Information

29 knb.ecoinformatics.org Evaluation: Project Information L3

30 knb.ecoinformatics.org Access: Permissions Information L4

31 knb.ecoinformatics.org Access: Physical Information

32 knb.ecoinformatics.org Access: Physical formatting details

33 knb.ecoinformatics.org Access: Distribution Information L4

34 knb.ecoinformatics.org Integration Level Information

35 knb.ecoinformatics.org Integration Level: Attribute structure

36 knb.ecoinformatics.org Integration Level: attribute domains

37 knb.ecoinformatics.org Integration Level: attribute domains

38 knb.ecoinformatics.org Integration Level: measurementScale

39 Exploiting Diverse Sources of Scientific Data Meta Data: the approach  Common language  XML Schemas to describe data/meta data  Domain specific exchange schemas  Explosion of these in every domain  Exchanging data  Archiving data  Turned into extensive specifications  Difficult to know where to stop…

40 Exploiting Diverse Sources of Scientific Data  but even this wasn’t enough…..  It’s not good enough to have meta-data, we need to know what the terms in the meta-data (schema or data values) mean.

41 Exploiting Diverse Sources of Scientific Data Ontologies – the vision  If we understood the meaning of the schema and the terms used in the meta-data or databases we would be able to:  find things more reliably,  integrate things more easily,  reason about what things are comparable….  because we have support for automatic inference

42 Exploiting Diverse Sources of Scientific Data Ontologies – the approach  Common Language…  OWL?  RDF, OWL lite, OWL DL, OWL full…..  Domain specific ontologies  or project specific?  Map different ontologies  Modularise the ontologies  Reuse..  Build upper ontologies to which domain ontologies extend/link

43 Biodiversity Base Ontology

44 Core Layer

45 BDI Core Taxon Name

46 BDI Core Taxon Concept

47 BDI Core BioSpecimen

48 BDI Core BioObservation Similar to…

49 SEEK Observation ontology Josh Madin

50 An extension point for domain-specific terms entity Josh Madin

51 Characteristic Josh Madin

52 All the units, scales, indices, classifications, and lists used for ‘measuring’ a characteristic Measurement standard Similar to… Josh Madin

53 Exploiting Diverse Sources of Scientific Data Semantic Web for Earth and Environmental Terminology (SWEET) Ontologies revised and validated Jan 26, 2006 Biosphere Data Data Center Human Activity Material Thing Numerics Sensor Space Time Units Earth Realm Physical Phenomena Physical Process Physical Property Physical Substance Sun Realm Takes us back to…

54 BDI Taxon Concept Ontology …is really just a schema for representing …

55 Exploiting Diverse Sources of Scientific Data Biological Taxonomy  Classify and name all organisms in the world  So we can talk about them, experiment with them  Do life science…  The longest running attempt at building an ontology?  Linnaeus binomial system of nomenclature started in 1758  An attempt to resolve a long standing problem in biology  Many ways to classify things  Understanding continually changes with new discoveries & technologies  Classifications continually being redone  New things defined, New definitions given for things in existence  Lots of classifications over time  Many compete at any one point in time

56 Exploiting Diverse Sources of Scientific Data Aus aus L.1758 Aus L.1758 Aus bea Archer 1965 Aus L.1758 Aus aus L.1758 Linneaus 1758 Aus L.1758 Aus aus L.1758 Aus bea Archer 1965 Aus cea BFry 1989 Fry 1989 Aus L.1758 Xus beus (Archer) Pargiter 2003. Aus ceus BFry 1989 (vi) Xus Pargiter 2003 Pargiter 2003 Aus aus L. 1758 Aus bea and Aus cea noted as invalid names and replaced with Aus beus and Aus ceus. Aus aus L.1758 Tucker 1991 Aus L.1758 Aus cea BFry 1989 Taxonomic history of imaginary genus Aus L. 1758 Pyle 1990 5 Revisions of Aus 1 name spelling change

57 Exploiting Diverse Sources of Scientific Data Aus aus L.1758 Aus L.1758 Aus bea Archer 1965 Aus L.1758 Aus aus L.1758 Linneaus 1758 Aus L.1758 Aus aus L.1758 Aus bea Archer 1965 Aus cea BFry 1989 Fry 1989 Aus L.1758 Xus beus (Archer) Pargiter 2003. Aus ceus BFry 1989 (vi) Xus Pargiter 2003 Pargiter 2003 Aus aus L. 1758 Aus bea and Aus cea noted as invalid names and replaced with Aus beus and Aus ceus. Aus aus L.1758 Tucker 1991 Aus L.1758 Aus cea BFry 1989 Taxonomic history of imaginary genus Aus L. 1758 Pyle 1990 8 Names 2 genus 6 species

58 N4 - Aus beus Archer 1965 N1 - Aus aus L.1758 N1 C1.5 C1.4 C1.3 C1.2 C1.1 C1.1 - Aus aus L.1758 sec. Linneaeus 1758 C1.2 - Aus aus L.1758 sec. Archer 1965 C1.3 - Aus aus L.1758 sec. Fry 1989 C1.4 - Aus aus L.1758 sec. Tucker 1991 C1.5 - Aus aus L.1758 sec. Pargiter 2003 N2 - Aus bea Archer 1965 N5 C5.5 N5 - Aus ceus Fry 1989 C5.5 - Aus ceus Fry 1989 sec. Fry 1989 C6.5 N6 N6 - Xus beus Pargiter 2003 C6.6 - Xus beus Pargiter 2003 sec. Pargiter 2003 N2 C2.3 C2.2 C2.2 - Aus bea Archer 1965 sec. Archer 1965 C2.3 - Aus bea Archer 1965 sec. Fry 1989 N3 N4 C3.4 C3.3 N3 - Aus cea Fry 1989 C3.3 - Aus cea Fry 1989 sec. Fry 1989 C3.4 - Aus cea Fry 1989 sec. Tucker 1991 N0 - Aus L.1758 N0 C0.5 C0.4 C0.3 C0.2 C0.1 C0.1 - Aus L.1758 sec. Linneaeus 1758 C0.2 - Aus L.1758 sec. Archer 1965 C0.3 - Aus L.1758 sec. Fry 1989 C0.4 - Aus L.1758 sec. Tucker 1991 C0.5 - Aus L.1758 sec. Pargiter 2003 C7.5 N7 N7 - Xus Pargiter 2003 C7.6 - Xus Pargiter 2003 sec. Pargiter 2003 8 Names 17 Concepts Results in many concepts for each name

59 Exploiting Diverse Sources of Scientific Data Possible interpretations of Aus aus L. 1758  Request data sets about Aus aus (N1)  what’s returned?  Original concept: C1.1  Most recent concept: C1.5  Preferred Authority (e.g. Fry 1989): C1.3  Everything ever named N1: Union(C1.1,C1.2,C1.3,C1.4,C1.5)  Best fit according to some matching algorithm Best(C1.1,C1.2,C1.3,C1.4,C1.5)  New concept containing only those features common to all concepts with the name N1: Intersection(C1.1,C1.2,C1.3,C1.4,C1.5)  Is it appropriate to link or merge data on this?  Depends on the user’s purpose  Level of precision required N1 - Aus aus L.1758 N1 C1.5 C1.4 C1.3 C1.2 C1.1

60 Exploiting Diverse Sources of Scientific Data C1.5C5.5 C0.5 C1.4C3.4 C0.4 C1.1 C0.1 C1.2 C2.2 C0.2 C1.3 C2.3 C3.3 C0.3 C6.5 C7.5 N0 N7 N1 N2 N5 N6 N3 N4 Classifications synonymy relationships between concepts and names. In the literature taxonomists tell us names that are synonymous with their concepts Parent child relationships in 5 revisions Names for each of the concepts

61 Exploiting Diverse Sources of Scientific Data C1.5C5.5 C0.5 C1.4C3.4 C0.4 C1.1 C0.1 C1.2 C2.2 C0.2 C1.3 C2.3 C3.3 C0.3 C6.5 C7.5 N0 N7 N1 N2 N5 N6 N3 N4 Classifications synonymy relationships between concepts and names. Which can result in anything being returned for Aus aus by traversing the synonymy links

62 Exploiting Diverse Sources of Scientific Data C1.5C5.5 C0.5 C1.4C3.4 C0.4 C1.1 C0.1 C1.2 C2.2 C0.2 C1.3 C2.3 C3.3 C0.3 C6.5 C7.5 N1 N5 N6 N2 N3 N4 N0 N7 == Classifications with set relationships between concepts.             What we need are the set relationships from concepts in a revision to earlier concepts and name changes related to earlier names We can build systems to return data suit for purpose     

63 Exploiting Diverse Sources of Scientific Data Real Taxonomic Revisions  German mosses  14 classifications in 73 years  covering 1548 taxa  only 35% thought to be stable concepts  65% of names used in legacy data sets are ambiguous  and we don’t know which ones??  we need computers to help understand this…  Smaller classifications are combined into large classifications  ITIS – integrated taxonomy (also changing) approx. 250,000 taxa  Taxonomic Revision of genus Alteromonas  34 years: from 1972 to 2006  Thanks to George Garrity, Michigan State Univ.

64 macleodii (T) communis Alteromonas 1972 vaga

65 communis vaga haloplanktis Alteromonas macleodii (T) 1972 1973

66 communis vaga haloplanktis rubra Alteromonas 1972 1973 1976 macleodii (T)

67 communis vaga haloplanktis rubra citrea Alteromonas 1972 1973 1976 1977 macleodii (T)

68 communis vaga haloplanktis rubra citrea esperjiana undina Alteromonas 1972 1973 1976 1977 1978 macleodii (T)

69 communis vaga haloplanktis rubra citrea esperjiana undina aurantia Alteromonas 1972 1973 1976 1977 1978 1979 macleodii (T)

70 communis vaga haloplanktis rubra citrea esperjiana undina aurantia putrifaciens hanedai Alteromonas 1972 1973 1976 1977 1978 1979 1981 macleodii (T)

71 communis vaga haloplanktis rubra citrea esperjiana undina aurantia putrifaciens hanedai luteoviolaceae Alteromonas 1972 1973 1976 1977 1978 1979 1981 1982 macleodii (T)

72 communis vaga haloplanktis rubra citrea esperjiana undina aurantia putrifaciens hanedai luteoviolaceae vaga communis (T) MarinomonasAlteromonas commune vagum 1972 1973 1976 1977 1978 1979 1981 1982 1984 multiglobiferum japonicum minutium biejerinckii maris hiroshimense pelagicum pusillum jannaschii kreigii Oceanosprillum maris williamsae linum (T) macleodii (T)

73 communis vaga haloplanktis rubra citrea esperjiana undina aurantia putrifaciens hanedai vagabenthica hanedai MarinomonasAlteromonas putrifaciens (T) Shewanella japonicum minutium biejerinckii maris hiroshimense multiglobiferum pelagicum pusillum commune jannaschii kreigii vagum Oceanosprillum maris williamsae 1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 luteoviolaceae communis (T) linum (T) macleodii (T)

74 1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 communis vaga haloplanktis rubra citrea esperjiana undina aurantia hanedai luteoviolaceae denitrificans vagabenthica hanedai MarinomonasAlteromonasShewanella japonicum minutium biejerinckii maris hiroshimense multiglobiferum pelagicum pusillum commune jannaschii kreigii vagum Oceanosprillum maris williamsae putrifaciens putrifaciens (T) communis (T) linum (T) macleodii (T)

75 communis vaga haloplanktis rubra citrea esperjiana undina aurantia putrifaciens hanedai luteoviolaceae denitrificans vagabenthica hanedai MarinomonasAlteromonasShewanella japonicum minutium biejerinckii maris hiroshimense multiglobiferum pelagicum pusillum commune jannaschii kreigii vagum Oceanosprillum maris williamsae 1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 colwelliana putrifaciens (T) communis (T) linum (T) macleodii (T)

76 vagabenthica hanedai MarinomonasShewanella japonicum minutium biejerinckii maris hiroshimense multiglobiferum pelagicum pusillum commune jannaschii kreigii vagum biejerinckii pelagicum maris hiroshimense Oceanosprillum maris williamsae communis vaga haloplanktis rubra citrea esperjiana undina aurantia putrifaciens hanedai luteoviolaceae denitrificans tetradonis Alteromonas colwelliana 1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 colwelliana putrifaciens (T) communis (T) linum (T) macleodii (T)

77 vagabenthica hanedai colwelliana algae MarinomonasShewanella communis vaga haloplanktis rubra citrea esperjiana undina aurantia putrifaciens hanedai luteoviolaceae denitrificans tetradonis atlantica carageenovora Alteromonas colwelliana 1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 japonicum minutium biejerinckii maris hiroshimense multiglobiferum pelagicum pusillum commune jannaschii kreigii vagum biejerinckii pelagicum maris hiroshimense Oceanosprillum maris williamsae putrifaciens (T) communis (T) linum (T) macleodii (T)

78 vagabenthica hanedai colwelliana algae MarinomonasShewanella communis vaga haloplanktis putrifaciens hanedai denitrificans rubra citrea esperjiana undina aurantia luteoviolaceae tetradonis atlantica carageenovora Alteromonas colwelliana 1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 japonicum minutium biejerinckii maris hiroshimense multiglobiferum pelagicum pusillum commune jannaschii kreigii vagum biejerinckii pelagicum maris hiroshimense Oceanosprillum maris williamsae distincta fuliginea putrifaciens (T) communis (T) linum (T) macleodii (T)

79 vagabenthica hanedai colwelliana algae MarinomonasShewanella communis vaga haloplanktis putrifaciens hanedai denitrificans rubra citrea esperjiana undina aurantia luteoviolaceae tetradonis atlantica carageenovora Alteromonas colwelliana 1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 japonicum minutium biejerinckii maris hiroshimense multiglobiferum pelagicum pusillum commune jannaschii kreigii vagum biejerinckii pelagicum maris hiroshimense Oceanosprillum maris williamsae distincta fuliginea atlantica aurantia carrageenovora citrea esperjiana luteoviolacea nigrifaciens pisicida rubra haloplanktis haloplanktis (T) Pseudoalteromonas undina haloplanktis tetradonis putrifaciens (T) communis (T) linum (T) macleodii (T)

80 vagabenthica hanedai colwelliana algae MarinomonasShewanella communis vaga haloplanktis rubra citrea esperjiana undina aurantia putrifaciens hanedai luteoviolaceae denitrificans tetradonis atlantica carageenovora Alteromonas colwelliana 1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 japonicum minutium biejerinckii maris hiroshimense multiglobiferum pelagicum pusillum commune jannaschii kreigii vagum biejerinckii pelagicum maris hiroshimense Oceanosprillum maris williamsae distincta fuliginea atlantica aurantia carrageenovora citrea esperjiana luteoviolacea nigrifaciens pisicida rubra Pseudoalteromonas undina antartica elyakoviii haloplanktis tetradonis haloplanktis haloplanktis (T) putrifaciens (T) communis (T) linum (T) macleodii (T)

81 vagabenthica hanedai colwelliana algae MarinomonasShewanella communis vaga haloplanktis rubra citrea esperjiana undina aurantia putrifaciens hanedai luteoviolaceae denitrificans tetradonis atlantica carageenovora Alteromonas colwelliana 1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000 japonicum minutium biejerinckii maris hiroshimense multiglobiferum pelagicum pusillum commune jannaschii kreigii vagum biejerinckii pelagicum maris hiroshimense Oceanosprillum maris williamsae distincta fuliginea atlantica aurantia carrageenovora citrea esperjiana luteoviolacea nigrifaciens pisicida rubra Pseudoalteromonas undina antartica elyakoviii fridgidimarina geldimarina woodyii amazonensis baltica oneidensis pealeana violacea bacteriolytica prydzensis tunicata distincta elyakovii peptidolytica haloplanktis tetradonis mediterannea haloplanktis haloplanktis (T) putrifaciens (T) communis (T) linum (T) macleodii (T)

82 vagabenthica hanedai colwelliana algae MarinomonasShewanella communis vaga haloplanktis rubra citrea esperjiana undina aurantia putrifaciens hanedai luteoviolaceae denitrificans tetradonis atlantica carageenovora Alteromonas colwelliana 1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000 2001 japonicum minutium biejerinckii maris hiroshimense multiglobiferum pelagicum pusillum commune jannaschii kreigii vagum biejerinckii pelagicum maris hiroshimense Oceanosprillum maris williamsae distincta fuliginea atlantica aurantia carrageenovora citrea esperjiana luteoviolacea nigrifaciens pisicida rubra Pseudoalteromonas undina antartica elyakoviii fridgidimarina geldimarina woodyii amazonensis baltica oneidensis pealeana violacea bacteriolytica prydzensis tunicata distincta elyakovii peptidolytica tetrodonis japonica haloplanktis tetradonis mediterannea haloplanktis haloplanktis (T) putrifaciens (T) communis (T) linum (T) macleodii (T)

83 vagabenthica hanedai colwelliana algae MarinomonasShewanella communis vaga haloplanktis rubra citrea esperjiana undina aurantia putrifaciens hanedai luteoviolaceae denitrificans tetradonis atlantica carageenovora Alteromonas colwelliana 1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000 2001 2002 japonicum minutium biejerinckii maris hiroshimense multiglobiferum pelagicum pusillum commune jannaschii kreigii vagum biejerinckii pelagicum maris hiroshimense Oceanosprillum maris williamsae distincta fuliginea Pseudoalteromonas elyakoviii fridgidimarina geldimarina woodyii amazonensis baltica oneidensis pealeana violacea japonica denitrificans livingstonensis alleyanna atlantica aurantia carrageenovora citrea esperjiana luteoviolacea nigrifaciens pisicida rubra undina antartica bacteriolytica prydzensis tunicata distincta elyakovii peptidolytica tetrodonis haloplanktis tetradonis mediterannea haloplanktis haloplanktis (T) putrifaciens (T) communis (T) linum (T) macleodii (T)

84 vagabenthica hanedai colwelliana algae MarinomonasShewanella communis vaga haloplanktis rubra citrea esperjiana undina aurantia putrifaciens hanedai luteoviolaceae denitrificans tetradonis atlantica carageenovora Alteromonas colwelliana 1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000 2001 2002 2004 japonicum minutium biejerinckii maris hiroshimense multiglobiferum pelagicum pusillum commune jannaschii kreigii vagum biejerinckii pelagicum maris hiroshimense Oceanosprillum maris williamsae distincta fuliginea Pseudoalteromonas elyakoviii fridgidimarina geldimarina woodyii amazonensis baltica oneidensis pealeana violacea japonica denitrificans livingstonensis alleyanna atlantica aurantia carrageenovora citrea esperjiana luteoviolacea nigrifaciens pisicida rubra undina antartica bacteriolytica prydzensis tunicata distincta elyakovii peptidolytica tetrodonis haloplanktis tetradonis 12 others mariniintestina saire schlegeliana gaetbuli mediterannea primoryensis haloplanktis haloplanktis (T) putrifaciens (T) communis (T) linum (T) macleodii (T) stellipolaris litorea 5 others

85 vagabenthica hanedai colwelliana algae MarinomonasShewanella communis vaga haloplanktis rubra citrea esperjiana undina aurantia putrifaciens hanedai luteoviolaceae denitrificans tetradonis atlantica carageenovora Alteromonas colwelliana 1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000 2001 2002 2004 2005 japonicum minutium biejerinckii maris hiroshimense multiglobiferum pelagicum pusillum commune jannaschii kreigii vagum biejerinckii pelagicum maris hiroshimense Oceanosprillum maris williamsae distincta fuliginea Pseudoalteromonas elyakoviii fridgidimarina geldimarina woodyii amazonensis baltica oneidensis pealeana violacea japonica denitrificans livingstonensis alleyanna atlantica aurantia carrageenovora citrea esperjiana luteoviolacea nigrifaciens pisicida rubra undina antartica bacteriolytica prydzensis tunicata distincta elyakovii peptidolytica tetrodonis haloplanktis tetradonis 14 others mariniintestina saire schlegeliana gaetbuli mediterannea primoryensis haloplanktis haloplanktis (T) putrifaciens (T) communis (T) linum (T) macleodii (T) stellipolaris litorea 8 others 2 others

86 vagabenthica hanedai colwelliana algae MarinomonasShewanella communis vaga haloplanktis rubra citrea esperjiana undina aurantia putrifaciens hanedai luteoviolaceae denitrificans tetradonis atlantica carageenovora Alteromonas colwelliana 1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000 2001 2002 2004 2005 2006 japonicum minutium biejerinckii maris hiroshimense multiglobiferum pelagicum pusillum commune jannaschii kreigii vagum biejerinckii pelagicum maris hiroshimense Oceanosprillum maris williamsae distincta fuliginea Pseudoalteromonas elyakoviii fridgidimarina geldimarina woodyii amazonensis baltica oneidensis pealeana violacea japonica denitrificans livingstonensis alleyanna atlantica aurantia carrageenovora citrea esperjiana luteoviolacea nigrifaciens pisicida rubra undina antartica bacteriolytica prydzensis tunicata distincta elyakovii peptidolytica tetrodonis haloplanktis tetradonis 14 others mariniintestina saire schlegeliana gaetbuli mediterannea primoryensis haloplanktis haloplanktis (T) putrifaciens (T) communis (T) linum (T) macleodii (T) stellipolaris litorea 13 others 2 others

87 Alteromonas Alteromonadacea Alteromonadales Gammaproteobacteria Alishewanella Aestuariibacter Ferrimonas Colwellia Idiomarina Glaciecola Marinobacterium Marinobacter Pseudoalteromonas Microbulbifer Incertae sedis Psychromonas Teredinibacter Shewanella Thalassomonas Ferrimonadacea Idiomarinaceae Moritella Moritellaceae Pseudoalteromonadaceae Ferrimonas Idiomarina Pseudoalteromonas Psychromonadaceae Algicola Psychromonas Moritella Shewanellaceae Shewanella Incertae sedis Teredinibacter Agarvorans Alishewanella Marinobacterium Marinobacter Microbulbifer Salinomonas Colwelliaceae Thalassomonas May 2004 November 2004 At the species level 18 “emendations” 21 new species 19 species reassigned to 4 genera 3 new combinations 6 synonyms 2 species to subspecies 2 subspecies to species 50 names, five genera, five families, and two classes but…. only 5 validly published species. At the higher level 1 Family 16 genera -> 8 families 12 genera 1 unclassified genus -> 7 unclassified genera Which is correct? Which is supported/recorded in the data? What is the impact on Analysis?

88 Exploiting Diverse Sources of Scientific Data Meta-utopia - a pipe dream?  What is meta-data?  Your meta data is my data…  Depends on your perspective  How you see the world  What’s important to you  What you want to do with the “data” Ecological Data set Meta data Taxonomic Data META DATA DATA PinaceaePicea Picea rubens PiceaPicea abies Higher TaxonTaxon Name:Linnaeus Year:1758 Data  It’s all data anyway…..  But it’s useful to differentiate for certain purposes

89 Exploiting Diverse Sources of Scientific Data Meta-utopia - a pipe dream?  Schemas aren't neutral  Presumes there is a "correct" way of modelling or categorising ideas  that, given enough time and incentive, people can agree on the correct way…  Any hierarchy of concepts necessarily implies the importance of some axes over others.

90 Exploiting Diverse Sources of Scientific Data  Geographic/cartographic perspective  Instance of Picea rubens is-a feature that can be mapped  Features inherently have geospatial coordinates. Pinaceae Picea Picea rubensPicea abies Building Feature Observation Organism occurrence Picea rubens  Taxonomic perspective  Instance of Picea rubens is a specimen of some biological taxon  Taxa inherently have characteristics used in classification

91 Exploiting Diverse Sources of Scientific Data Meta-utopia - a pipe dream?  There's more than one way to describe something

92 Exploiting Diverse Sources of Scientific Data

93 Meta-utopia - a pipe dream?  There's more than one way to describe something  Reasonable people can disagree forever on how to describe something.  Requiring scientists to use the same vocabulary to describe their data enforces homogeneity in ideas.  Which could limit science…

94 Exploiting Diverse Sources of Scientific Data Meta-utopia - a pipe dream?  Metrics influence results  Agreeing to a common metric for measuring important things in a domain necessarily privileges the items that score high on that metric, regardless of those items' overall suitability.  Ranking axes are mutually exclusive  software that scores high for security scores low for convenience,  Everyone wants to emphasize their high-scoring axes  and de-emphasize (or, if possible, ignore altogether) their low-scoring axes.

95 Exploiting Diverse Sources of Scientific Data Meta-utopia - a pipe dream?  People are not altruistic  Scientists have their own immediate deliverables  Doesn’t leave time for thinking about who else might do what with their data  Metadata exists in a competitive world.  People want their work cited and will (ab)use meta-data to do so.  People are busy  e-Scientists understand the importance of excellent metadata  Jo-scientist is mainly concerned about publishing the results.  No time for added extras

96 Exploiting Diverse Sources of Scientific Data Meta-utopia - a pipe dream?  People make mistakes  Even when there's a positive benefit to creating good metadata, people don’t exercise enough care and diligence in their metadata creation.  Mission Impossible?  Simple observation demonstrates people are poor observers of their own behaviours.  Therefore any meta data will be a poor representation

97 Exploiting Diverse Sources of Scientific Data Life Science Identifiers (LSIDs): the vision  WWW provides a globally distributed communication framework  LSID and the LSID Resolution System  will provide a simple mechanism to globally resolve locally named objects distributed over the WWW.  LSIDs will allow us to know  what kind of object it is,  who originated it,  who is responsible for it,  how to interface to it and  what computations might be carried out on it.  Adoption of LSIDs  will facilitate more reliable integration of multiple knowledge bases,  each of which has partial information of a shared domain  will encourage stronger global collaboration in life sciences. Clark T., Martin S., Liefeld T. Globally Distributed Object Identification for Biological Knowledgebases Briefings in Bioinformatics 5.1:59-70, March 1, 2004.

98 Exploiting Diverse Sources of Scientific Data  URI based naming scheme  urn:lsid:ipni.org:names:1234-1  retrieval framework  http://lsid.sourceforge.net/ http://lsid.sourceforge.net/ Life Science Identifiers LSID resolver Get data Get metadata Data record RDF An LSID has data - gene sequence in GenBank - ecological data set (in excel, or in a text file) - image The data should never change - can version An LSID has metadata - format of the data - display title for clients - Dublin core metadata -anything you want The metadata can change

99 Exploiting Diverse Sources of Scientific Data Issues For Each Community  What gets an LSID?  Real life objects  Biological specimen  Abstract concepts  Taxon concept or name – Bellis perennis  Electronic representations of things  Image of specimen, description of specimen or concept  For each thing, what’s the data and metadata?  LSIDs  Data doesn’t change but Meta data can  Should all data become meta data?  Maybe it implies a temporal database approach

100 Exploiting Diverse Sources of Scientific Data Issues For Each Community  Who issues LSIDs?  Owner of data  Not always clear who owns data especially legacy data  A central authority  One authority responsible for issuing LSID for specific types of information  This would help enforce a 1:1 mapping of LSIDs and data items  It MAY also reduce the likelihood of LSIDs becoming unresolvable  A respected authority  This would help enforce a 1:1 mapping for those who use the authority  It may also be more feasible  Free for all (possibly with an index)  List your LSID authority in an index so your LSIDs are easy to find  Perhaps structured delegation has best potential to globally unite science

101 Exploiting Diverse Sources of Scientific Data Organizations Using LSIDs  Biopathways consortium  National Center for Biotech Information (NCBI)  Pubmed, Genbank  European Bioinformatics Institute (EBI)  BioMOBY – an biological database interoperability program (biomoby.org)  represent all entities in MOBY Ontologies (Object, Service, and Namespace), as well as all instances of BioMOBY services.  myGrid (mygrid.org.uk)  used throughout as object naming device  TDWG (tdwg.org)  IPNI – plant names  Index Fungorum – fungi names  US Long Term Ecological Research Network (LTER)  SEEK (seek.ecoingformatics.org)  Used in Kepler – actors, components, TOS – taxon concepts…

102 Use of LSIDs Lined seahorse Hippocampus erectus Perry 1810 urn:lsid:biocast.org:concept:347 Hippocampus marginalis Kaup, 1856 Hippocampus tetragonous Mitchill, 1814 Hippocampus erectus 347 TAX 347 Ecological Data Sets

103 Exploiting Diverse Sources of Scientific Data Moving to a world of LSIDs  Using LSIDs alone will not address all issues of data sharing  Data repositories must (re)use LSIDs to cross reference data  within and outwith their own repository.  it is important that we use the same LSID to refer to the same entity  If multiple LSIDs exist for the same entity we would be required to decide whether or not two LSIDs were really the same thing.  We would be in a worse situation than we are today,  for example when trying to decide if two taxonomic names mean the same.  Generating LSIDs for any self contained data set is a fairly trivial task  Appointing LSIDs to existing data from an authoritative repository to re-use them is more challenging  Investigate what’s involved…

104 Exploiting Diverse Sources of Scientific Data Specimen Publication Concept Name Hexacorallia Data Triple Store Person Hexacorallia Data Provider Map to ontology Convert Data Provider to use LSIDs Original data repository (target) RDF Data to be updated with LSIDs from authority providers LSID + RDF LSID + RDF LSID + RDF LSID + RDF LSID + RDF Map to ontology Match data from repository with data in LSID resolvers and return LSID to repository Linker Tool Match data from repository with data in LSID resolvers and return LSID to repository Authority LSID resolution services (source)

105 Exploiting Diverse Sources of Scientific Data Linking…. WASABI Service Request Dispatcher LSID SPARQL OAI WASABI Service Request Dispatcher LSIDSPARQL Linker OAI authoritative (“source”) provider & linker local (“target”) provider Linker Client Hexacorallia Thematic Triple Store Person Triple Store Request linkable classes and select one to be linked

106 Exploiting Diverse Sources of Scientific Data Linking…. WASABI Service Request Dispatcher LSID SPARQL OAI WASABI Service Request Dispatcher LSIDSPARQL Linker OAI authoritative (“source”) provider & linker local (“target”) provider Linker Client Hexacorallia Thematic Triple Store Person Triple Store Select class to be linked

107 Exploiting Diverse Sources of Scientific Data Linking…. WASABI Service Request Dispatcher LSID SPARQL OAI WASABI Service Request Dispatcher LSIDSPARQL Linker OAI authoritative (“source”) provider & linker local (“target”) provider Linker Client Hexacorallia Thematic Triple Store Person Triple Store Request possible LSIDs

108 Exploiting Diverse Sources of Scientific Data Confirm/Skip Annotations Person to find LSID for Choice of possible persons with LSIDs

109 Exploiting Diverse Sources of Scientific Data Issues in converting to LSIDs  Mapping to ontology  LSIDs  RDF  schema?  ontology?  agreement on ontology - problem?  Replace or annotate existing data?  If we replace an author with a person LSID  what is returned when resolving that LSID won’t likely be what data was stored in DB for an author.  Dependencies between objects with LSIDs  If you link via a taxon name LSID – the resolved name should have embedded an LSID for a publication – so there shouldn’t be any need (in principal) to match publications for names  What about authorities that issues LSIDs but don’t map to other authorities  e.g. name providers not mapping to either publication or specimen providers

110 Exploiting Diverse Sources of Scientific Data Issues in converting to LSIDs  What support would a linking tool need to provide end users?  How would users want to process this data  How much automation?  E.g. above a certain confidence level  Would this be trusted?  Order of matching  E.g. match all instances of persons at once  Match of persons by publication?  Other Issues…  Performance of existing linking tool approach  Lots of data passing going on  Need more efficient approach which matches user needs  Finding authorities that provide linking services  How do scientists find out about authorities with linking services?  How do you they which ones to use?

111 Exploiting Diverse Sources of Scientific Data To Summarise….  We have seen that (Life) Science is  Complex & Changing  The fundamental challenges of science that have always been there are still here  Now we have additional opportunities associated with the explosion of scientific information and the move to a virtual world  And now the challenge is how best to exploit these….  e-Science uses computation to aid scientists  By providing appropriate infrastructure and tool support  Speed up scientific processes  Do them repeatedly  Re-evaluation  Can give scientists time for more thoughtful science…  May require a change of emphasis in how scientists work  Must support the inherent features of science, scientists and scientific data

112 Exploiting Diverse Sources of Scientific Data e-Science: Complex Science  Support decomposition of scientific domains, problems and associated data  Fundamental to data & software analysis and design  Support re-composition, linking or building on the components  Need to know when components or links have changed  Identify the overlaps/linkages in the different domains  Need useful approximations of things to simplify linked domain  Need to understand the approximations or linking points well  Raise level of abstraction  Artefact of storage mechanisms  Implies lingua franca  Need more evaluation of the different approaches

113 Exploiting Diverse Sources of Scientific Data e-Science: Changing Science  Science is full of legacy data  Today’s scientific research is tomorrow’s legacy data  Provide long-term persistent storage  Any published scientific discovery should store the data as evidence  Data needs to be accurately annotated  Sufficient to repeat analyses to test hypotheses  e-Science already changing the way scientists do science  But to be effective it needs to change even more…  More emphasis on well curated, accessible, persistent data  Evidence for results

114 Exploiting Diverse Sources of Scientific Data Meta Data & Ontologies?  Do we throw out meta data/ontologies, then?  No…  To benefit from stored data we need to know what it means!  However, there are no large-scale benefits while there is insufficient coverage of meta data  if only 10% data has meta data people won’t use meta data…  Need to reach the tipping point…  Controlled vocabulary and schemas shown useful for large projects or small communities with common goal  Need long-term projects to see if they sustain their value as the community and the science evolves.

115 Exploiting Diverse Sources of Scientific Data Describe or Prescribe?  Descriptions become a vocabularies used by others  Folksonomy or ontologies?  Informal versus formal or free versus constrained  Informal can be basis for something formal  Move towards common vocabularies  with built in flexibility and extensibility  Issue of what language(s)…  Need more research evaluating these issues…

116 Exploiting Diverse Sources of Scientific Data Reliability of Meta Data  Automatic recording of meta data  From machines, software, workflows…  Avoids labour  Starting to happen  Helps reach critical mass of available meta data  Still need to decide what it is that the machines/software are collecting…  Human input still needed  Purpose of experiment, deviations from planned protocol etc.

117 Exploiting Diverse Sources of Scientific Data Support  Community ontologies need to be easily available to all scientists  Listing the known ontologies on a web site is not enough  Need to understand when (meta) data is fit for purpose  Accurate enough, not overly precise  Need collaborative approaches to extending ontologies  Allow users to be involved to achieve community buy-in  Ontologies are difficult for people to comprehend  Need good visualisation  Need to trust system

118 Exploiting Diverse Sources of Scientific Data Tools  Simple tools would go a long way to help  Contextual data is consistent for many data sets  e.g. observer/location  Tools should support collection and re-use of this data  Make use of (incorporate) existing ontologies into tools  Get the software to do as much work as possible  Good at repetitive tasks, faster than humans  Personalisation  How application specific do tools have to be to be useful  Generic/ Domain specific/ Individual?  The more generic the more widely applicable  Pluggable components for personalisation?

119 Exploiting Diverse Sources of Scientific Data Finally…  It will take time and commitment for any of these approaches to work.  Focus on central important resources that are reused in many (sub-)domains  Ensure the data are well managed and curated, identified, described, easily available, lasting and evolving  Observe whether they benefit the community or act as a straight jacket  A good test case for this approach is the development of a taxon concept name resolution service  To allow scientists to find correct names for the concepts they are working with,  Mark up their data,  Resolve their concepts against other scientists’ data so they know they are talking about the same thing.  Is central to communication in all life sciences  Poses many computational, social and data research issues

120 Exploiting Diverse Sources of Scientific Data Acknowledgements  E-Science Institute for sponsoring theme leadership  Malcolm Atkinson  For support and many interesting discussions on exploiting scientific data.  Collaborators  on SEEK project,  Matt Jones, Bill Michener, Aimee Stewart, Robert Gales, Josh Madin, Shaun Bowers  Collaborators in TDWG/GBIF  Robert Kukla, Roger Hyam,  funding, slides, interesting problems


Download ppt "E-SI Theme: Exploiting Diverse Sources of Scientific Data Exploiting Diverse Sources of Scientific Data the vision, what has been achieved and what next…"

Similar presentations


Ads by Google