Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Semantic Web: New-style data-integration (and how it works for life-scientists too!) Frank van Harmelen AI Department Vrije Universiteit Amsterdam.

Similar presentations


Presentation on theme: "The Semantic Web: New-style data-integration (and how it works for life-scientists too!) Frank van Harmelen AI Department Vrije Universiteit Amsterdam."— Presentation transcript:

1 The Semantic Web: New-style data-integration (and how it works for life-scientists too!) Frank van Harmelen AI Department Vrije Universiteit Amsterdam

2 What’s the problem? (data-mess in bio-inf)

3 The Study of Genes... Chromosomal location Sequence Sequence Variation Splicing Protein Sequence Protein Structure

4 … and Their Function Homology Motifs Publications Expression HTS In Vivo/Vitro Functional Characterization

5 Understanding Mechanisms of Disease Metabolic and regulatory pathway induction

6 Development of Drugs, Vaccines, Diagnostics Differing types of Drugs, Vaccines, and Diagnostics Small molecules Protein therapeutics Gene therapy In vitro, In vivo diagnostics Development requires Preclinical research Clinical trials Long-term clinical research All of which often feeds back into ongoing Genomics research and discovery.

7 Sample Problem: Hyperprolactinemia Over production of prolactin –prolactin stimulates mammary gland development and milk production Hyperprolactinemia is characterized by: –inappropriate milk production –disruption of menstrual cycle –can lead to conception difficulty

8 Understanding transcription factors for prolactin production “Show me all genes in the public literature that are putatively related to hyperprolactinemia, have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells, and are homologous to known transcription factors.” “Show me all genes that are homologous to known transcription factors” SEQUENCE “Show me all genes that have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells” EXPRESSION “Show me all genes in the public literature that are putatively related to hyperprolactinemia” LITERATURE (Q1Q2Q3)(Q1Q2Q3)

9 The Industry’s Problem Too much unintegrated data: –from a variety of incompatible sources –no standard naming convention –each with a custom browsing and querying mechanism (no common interface) –and poor interaction with other data sources

10 ESTC Sept, 2008 Andy Law’s First Law “The first step in developing a new genetic analysis algorithm is to decide how to make the input data file format different from all pre-existing analysis data file formats.” ESTC Sept, 2008

11 ESTC Sept, 2008 Andy Law’s Second Law “The second step in developing a new genetic analysis algorithm is to decide how to make the output data file format incompatible with all pre-existing analysis data file input formats.” ESTC Sept, 2008

12 What are the Data Sources? Flat Files URLs Proprietary Databases Public Databases Data Marts Spreadsheets Emails …

13 Stitching this all together by hand? Source: Stephens et al. J Web Semantics 2006

14 Why would Semantic Web technology help?

15 Semantic Web Approach 1.Convert all data sources to RDF representation (local or distributed) 2.Optional: Collect the data to scalable semantic repository 3.Apply light-weight reasoning to specify formal interpretations of the data, e.g.: l remove redundancy, l establish equalities, etc 4.Derive new implicit knowledge ESTC Sept, 2008

16 machine accessible meaning (What it’s like to be a machine)  drug administration disease IS-A alleviates META-DATA

17 What is meta-data? it's just data it's data describing other data its' meant for machine consumption disease name symptoms drug administration

18 Required are: 1. one or more standard vocabularies l so search engines, producers and consumers all speak the same language 2. a standard syntax, l so meta-data can be recognised as such 3. lots of resources with meta-data attached mechanisms for attribution and trust

19 no shared understanding Conceptual and terminological confusion Actors: both humans and machines Agree on a conceptualization Make it explicit in some language. world concept language What are ontologies & what are they used for

20 standard vocabularies (“Ontologies”) Identify the key concepts in a domain Identify a vocabulary for these concepts Identify relations between these concepts Make these precise enough so that they can be shared between l humans and humans l humans and machines l machines and machines

21 Real life examples handcrafted l music: CDnow (2410/5), MusicMoz (1073/7)CDnow MusicMoz l biomedical: SNOMED (200k), GO (15k), Emtree(45k+190k Systems biologyGO Systems biology ranging from lightweight l Yahoo, UNSPC, Open directory (400k) to heavyweight (Cyc (300k)) Yahoo ranging from small ( METAR ) to large ( UNSPC ) METAR

22 Biomedical ontologies (a few..) Mesh l Medical Subject Headings, National Library of Medicine l 22.000 descriptions EMTREE l Commercial Elsevier, Drugs and diseases l 45.000 terms, 190.000 synonyms UMLS l Integrates 100 different vocabularies SNOMED l 200.000 concepts, College of American Pathologists Gene Ontology l 15.000 terms in molecular biology NCBI Cancer Ontology: l 17,000 classes (about 1M definitions),

23 Remember “required are”: ü one or more standard vocabularies l so search engines, producers and consumers all speak the same language 2. a standard syntax, l so meta-data can be recognised as such 3. lots of resources with meta-data attached

24 Stack of languages

25 Bluffer’s guide to RDF (1) Object --Attribute-> Value triples objects are web-resources Value is again an Object: l triples can be linked l data-model = graph pers05 ISBN... Author-of pers05 ISBN... Author-of MIT ISBN... Publ- by Author-of Publ- by

26 What does RDF Schema add? Defines vocabulary for RDF Organizes this vocabulary in a typed hierarchy Class, subClassOf, type Property, subPropertyOf domain, range Person TeacherStudent subClassOf Marta type supervises domain range Frank type supervises

27 RDF Triples in Life Sciences

28 OWL: things RDF Schema can’t do equality enumeration number restrictions l Single-valued/multi-valued l Optional/required values inverse, symmetric, transitive boolean algebra l Union, complement …

29 Web of Data: a nybody can say anything about anything All identifiers are URL's (= on the Web) l Allows total decoupling of data vocabulary meta-data x T [ IsOfType ] different owners & locations

30 RDF(S) have a (very small) formal semantics Defines what other statements are implied by a given set of RDF(S) statements Ensures mutual agreement on minimal content between parties without further contact In the form of “entailment rules” Very simple to compute (and not explosive in practice)

31 RDF(S) semantics: examples Aspirin isOfType Painkiller Painkiller subClassOf Drug  Aspirin isOfType Drug aspirin alleviates headache alleviates range symptom  headache isOfType symptom

32 RDF(S) semantics: examples  isOfType   subClassOf    isOfType    range    isOfType 

33 RDF(S) semantics X R Y + R domain T  X IsOfType T X R Y + R range T  Y IsOfType T T1 SubClassOf T2 + T2 SubClassOf T3  T1 SubClassOf T3 X IsOfType T1 + T1 SubClassOf T2  X IsOfType T1

34 OWL also has a formal semantics Defines what other statements are implied by a given set of statements Ensures mutual agreement on content (both minimal and maximal ) between parties without further contact Can be used for integrity/ consistency checking Hard to compute (and rarely/sometime/always explosive in practice)

35 OWL semantics: minimal vanGogh isOfType Impressionist Impressionist subClassOf Painter  vanGogh isOfType Painter vanGogh painter-of sunflowers painter-of domain painter  vanGogh isOfType painter

36 OWL semantics: maximal vanGogh isOfType Impressionist Impressionist disjointFrom Cubist  NOT: vanGogh isOfType Cubist painted-by has-cardinality 1 sun-flowers painted-by vanGogh Picasso different-individual-from vanGogh  NOT: sun-flowers painted-by Picasso

37 Remember “required are”: ü one or more standard vocabularies l so search engines, producers and consumers all speak the same language ü a standard syntax, l so meta-data can be recognised as such 3. lots of resources with meta-data attached

38 Question: who writes the ontologies? Professional bodies, scientific communities, companies, publishers, …. See previous slide on Biomedical ontologies l Same developments in many other fields Good old fashioned Knowledge Engineering Convert from DB-schema, UML, etc.

39 Question: Who writes the meta-data ? -Automated learning -shallow natural language analysis -Concept extraction amsterdam trade antwerp europe amsterdam merchant city town center netherlands merchant city town Example: Encyclopedia Britannica on “Amsterdam”

40 Remember “required are” ü one or more standard vocabularies l so search engines, producers and consumers all speak the same language ü a standard syntax, l so meta-data can be recognised as such lots of resources with meta-data attached

41 How to handle multiple ontologies: ontology matching Linguistics & structure Shared vocabulary Instance-based matching Shared background knowledge

42 Q Matching through shared vocabulary

43 Matching through shared instances

44 shared background knowledge Matching using shared background knowledge ontology 1 ontology 2

45 Some working examples? Linked Life Data http://www.linkedlifedata.com http://www.linkedlifedata.com DOPE HCLS http://www.w3.org/2001/sw/hcls/http://www.w3.org/2001/sw/hcls/

46 ESTC Sept, 2008 Linked Life Data Overview LinkedLifeData - statistics: – Number of statements: 1,159,857,602 – Number of explicit statements: 403,361,589 – Number of entities: 128,948,564 Platform to automate the process: – Infrastructure to store and inferences – Transform the structured data sources to RDF – Provide web interface to access the data Currently operates over OWLIM semantic repository Publicly available at: http://www.linkedlifedata.com ESTC Sept, 2008

47 ESTC Sept, 2008 Light Weight Reasoning in Linked Life Data ESTC Sept, 2008 rdf:type rdf:seeAlso urn:intact:1007 urn:uniprot:P104172 urn:uniprot:Protein urn:biogrid:Interaction urn:biogrid:15904 urn:biogrid:FBgn00134235 urn:biogrid:FBgn0068575 urn:pubmed:15904 urn:uniprot:FBgn0068575 urn:uniprot:FBgn00134235 rdf:type urn:intact:Interaction urn:uniprot:Q709356 interactsWith hasParticipant rdf:type sameAs Resolve the syntactic differences in the identifiersUse relationships to derive new implicit knowledge These are only examples resource names

48 ESTC Sept, 2008 ESTC Sept, 2008 DatabaseDatasetSchemaDescription UniprotCurated entries Original by the providerProtein sequences and annotations Entrez-GeneCompleteCustom RDF schemaGenes and annotation iProClassCompleteCustom RDF schemaProtein cross- references Gene OntologyCompleteSchema by the providerGene and gene product annotation thesaurus BioGRIDCompleteBioPAX 2.0 (custom generated)Protein interactions extracted from the literature NCI - Pathway Interaction Database CompleteBioPAX 2.0 (original by the provider) Human pathway interaction database The Cancer Cell MapCompleteBioPAX 2.0 (original by the provider) Cancer pathways database ReactomeCompleteBioPAX 2.0 (original by the provider) Human pathways and interactions BioCartaCompleteBioPAX 2.0 (original by the provider) Pathway database KEGGCompleteBioPAX 1.0 (original by the provider) Molecular Interaction BioCycCompleteBioPAX 1.0 (original by the provider) Pathway database NCBI TaxonomyCompleteCustom RDF schemaOrganisms

49 Some working examples? Linked Life Data http://www.linkedlifedata.com http://www.linkedlifedata.com DOPE HCLS http://www.w3.org/2001/sw/hcls/http://www.w3.org/2001/sw/hcls/

50 The Data Document repositories: l ScienceDirect: approx. 500.000 fulltext articles l MEDLINE: approx. 10.000.000 abstracts Extracted Metadata l The Collexis Metadata Server: concept- extraction ("semantic fingerprinting") Thesauri and Ontologies l EMTREE: 60.000 preferred terms 200.000 synonyms

51

52

53

54

55

56

57

58 Summarising… Data integration on the Web: l machine processable data besides human processable data Syntax for meta-data l (not discussed in any detail) Vocabularies for meta-data l Lot’s of them in bio-inf. Actual meta-data: l Lot’s in bio-inf. Will enable: l Better search engines (recall, precision, concepts) l Combining information across pages (inference) l …

59 Things to do for you Practical: Use existing software to construct new use-scenario’s Conceptual: Create on ontology for some area of bio-medical expertise l from scratch l as a refinement of an existing ontology Technical: Transform an existing data-set in meta-data format, and provide a query interface (for humans and machines)


Download ppt "The Semantic Web: New-style data-integration (and how it works for life-scientists too!) Frank van Harmelen AI Department Vrije Universiteit Amsterdam."

Similar presentations


Ads by Google