Download presentation
Presentation is loading. Please wait.
1
The Semantic Web: New-style data-integration (and how it works for life-scientists too!) Frank van Harmelen AI Department Vrije Universiteit Amsterdam
2
What’s the problem? (data-mess in bio-inf)
3
Life Science Data Recent focus on genetic data “genomics: the study of genes and their function. Recent advances in genomics are bringing about a revolution in our understanding of the molecular mechanisms of disease, including the complex interplay of genetic and environmental factors. Genomics is also stimulating the discovery of breakthrough healthcare products by revealing thousands of new biological targets for the development of drugs, and by giving scientists innovative ways to design new drugs, vaccines and DNA diagnostics. Genomics-based therapeutics include "traditional" small chemical drugs, protein drugs, and potentially gene therapy.” The Pharmaceutical Research and Manufacturers of America - http://www.phrma.org/genomics/lexicon/g.html Study of genes and their function Understanding molecular mechanisms of disease Development of drugs, vaccines, and diagnostics Kenneth Griffiths and Richard Resnick Tut. At Intell. Systems for Molec. Biol., 2003
4
The Study of Genes... Chromosomal location Sequence Sequence Variation Splicing Protein Sequence Protein Structure
5
… and Their Function Homology Motifs Publications Expression HTS In Vivo/Vitro Functional Characterization
6
Understanding Mechanisms of Disease Metabolic and regulatory pathway induction
7
Development of Drugs, Vaccines, Diagnostics Differing types of Drugs, Vaccines, and Diagnostics Small molecules Protein therapeutics Gene therapy In vitro, In vivo diagnostics Development requires Preclinical research Clinical trials Long-term clinical research All of which often feeds back into ongoing Genomics research and discovery.
8
The Industry’s Problem Too much unintegrated data: –from a variety of incompatible sources –no standard naming convention –each with a custom browsing and querying mechanism (no common interface) –and poor interaction with other data sources
9
What are the Data Sources? Flat Files URLs Proprietary Databases Public Databases Data Marts Spreadsheets Emails …
10
Sample Problem: Hyperprolactinemia Over production of prolactin –prolactin stimulates mammary gland development and milk production Hyperprolactinemia is characterized by: –inappropriate milk production –disruption of menstrual cycle –can lead to conception difficulty
11
Understanding transcription factors for prolactin production “Show me all genes in the public literature that are putatively related to hyperprolactinemia, have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells, and are homologous to known transcription factors.” “Show me all genes that are homologous to known transcription factors” SEQUENCE “Show me all genes that have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells” EXPRESSION “Show me all genes in the public literature that are putatively related to hyperprolactinemia” LITERATURE (Q1Q2Q3)(Q1Q2Q3)
12
The Complexity of Biological Data
13
Source: PhRMA & FDA 2003 Pharmaceutical Productivity
14
Stitching this all together by hand? Source: Stephens et al. J Web Semantics 2006
15
The Medical tower of Babel Mesh l Medical Subject Headings, National Library of Medicine l 22.000 descriptions EMTREE l Commercial Elsevier, Drugs and diseases l 45.000 terms, 190.000 synonyms UMLS l Integrates 100 different vocabularies SNOMED l 200.000 concepts, College of American Pathologists Gene Ontology l 15.000 terms in molecular biology NCI Cancer Ontology: l 17,000 classes (about 1M definitions),
16
Problem with the Current WWW
17
Why would Semantic Web technology help?
18
machine accessible meaning (What it’s like to be a machine) symptoms drug administration disease IS-A alleviates META-DATA
19
What is meta-data? it's just data it's data describing other data its' meant for machine consumption disease name symptoms drug administration
20
Required are: 1. one or more standard vocabularies l so search engines, producers and consumers all speak the same language 2. a standard syntax, l so meta-data can be recognised as such 3. lots of resources with meta-data attached mechanisms for attribution and trust is this page really about Pamela Anderson?
21
no shared understanding Conceptual and terminological confusion Actors: both humans and machines Agree on a conceptualization Make it explicit in some language. world concept language What are ontologies & what are they used for
22
standard vocabularies (“Ontologies”) Identify the key concepts in a domain Identify a vocabulary for these concepts Identify relations between these concepts Make these precise enough so that they can be shared between l humans and humans l humans and machines l machines and machines
23
Shared content-vocabularies: Ontologies Formal, explicit specification of a shared conceptualisation Abstract model of some domain Consensual knowledge concepts, properties, relations, functions machine processable
24
Real life examples handcrafted l music: CDnow (2410/5), MusicMoz (1073/7)CDnow MusicMoz l biomedical: SNOMED (200k), GO (15k), Emtree(45k+190k Systems biologyGO Systems biology ranging from lightweight l Yahoo, UNSPC, Open directory (400k) to heavyweight (Cyc (300k)) Yahoo ranging from small ( METAR ) to large ( UNSPC ) METAR
25
Biomedical ontologies (a few..) Mesh l Medical Subject Headings, National Library of Medicine l 22.000 descriptions EMTREE l Commercial Elsevier, Drugs and diseases l 45.000 terms, 190.000 synonyms UMLS l Integrates 100 different vocabularies SNOMED l 200.000 concepts, College of American Pathologists Gene Ontology l 15.000 terms in molecular biology NCBI Cancer Ontology: l 17,000 classes (about 1M definitions),
26
What’s inside an ontology? terms + specialisation hierarchy classes + class-hierarchy instances slots/values inheritance (multiple? defaults?) restrictions on slots (type, cardinality) properties of slots (symm., trans., …) relations between classes (disjoint, covers) reasoning tasks: classification, subsumption Increasing semantic “weight”
27
NB: we’re not doing philosophy Ontologies are not definitive descriptions of what exists in the world (= philosphy) Ontologies are models of the world constructed to facilitate communication Yes, ontologies exist (because we build them)
28
Remember “required are”: ü one or more standard vocabularies l so search engines, producers and consumers all speak the same language 2. a standard syntax, l so meta-data can be recognised as such 3. lots of resources with meta-data attached
29
Stack of languages
30
XML: l Surface syntax, no semantics XML Schema: l Describes structure of XML documents RDF: l Datamodel for “relations” between “things” RDF Schema: l RDF Vocabular Definition Language OWL: l A more expressive Vocabular Definition Language
31
Why XML Structuring data in documents on the internet HTML not meant to store data: The Netherlands Geography Capital: Amsterdam (The Hague is the seat of the government) Neighboring countries: Germany, Belgium The Netherlands Geography Capital: Amsterdam (The Hague is the seat of the government) Neighboring countries: Germany, Belgium
32
Why XML - 2 Humans understand information written in HTML Computers cannot “work” with it: l meaning of pieces of text, and their relation? XML makes this partly explicit by: l giving possibly meaningful names to tags l allowing the nesting of tags (tags inside tags) The Hague is the seat of the government Germany Belgium
33
XML data model document is an ordered, labeled tree, with nodes to represent the document entity, elements, attributes, processing instructions, and comments country capital geographyname remark neighboring country The Netherlands The Hague is the seat of the government Germany Amsterdam neighboring country Belgium comment Should be... name
34
Structuring methods DTDs l document type definitions l traditional; inherited from SGML l PCDATA = parsed character data l no other datatypes <!ATTLIST country name CDATA #REQUIRED> <!ATTLIST capital name CDATA #REQUIRED> ….
35
Structuring methods - 2 XML Schema l quite new (Rec. 02 May 2001) l same function as DTD: prescribes structure but has some advantages: l XML Schema is XML itself l simple datatyping l richer grammar l type hierarchy with derivation
36
<element name="version" type="string” minOccurs="0” maxOccurs="1" default="W98"/> <element name="includedBrowser" type="string” minOccurs="0" maxOccurs="1" fixed="Internet Explorer"/> <element name="version" type="string” minOccurs="0” maxOccurs="1" default="W98"/> <element name="includedBrowser" type="string” minOccurs="0" maxOccurs="1" fixed="Internet Explorer"/> XML Schema - richer grammar content models grouping, by choice, sequence or all cardinality attributes: minOccurs, maxOccurs defaults and constants attributes: default, fixed
37
Stack of languages XML: l Surface syntax, no semantics XML Schema: l Describes structure of XML documents RDF: l Datamodel for “relations” between “things” RDF Schema: l RDF Vocabular Definition Language OWL: l A more expressive Vocabular Definition Language
38
RDF Triples in Life Sciences
39
Bluffer’s guide to RDF (1) Object --Attribute-> Value triples objects are web-resources Value is again an Object: l triples can be linked l data-model = graph pers05 ISBN... Author-of pers05 ISBN... Author-of MIT ISBN... Publ- by Author-of Publ- by
40
Bluffer’s guide to RDF (2) Every identifier is a URL = world-wide unique naming! Has XML syntax Any statement can be an object graphs can be nested pers05 ISBN... Author-of NYT claims ISBN...
41
What does RDF Schema add? Defines vocabulary for RDF Organizes this vocabulary in a typed hierarchy Class, subClassOf, type Property, subPropertyOf domain, range Person TeacherStudent subClassOf Marta type supervises domain range Frank type supervises
42
Stack of languages XML: l Surface syntax, no semantics XML Schema: l Describes structure of XML documents RDF: l Datamodel for “relations” between “things” RDF Schema: l RDF Vocabular Definition Language OWL: l A more expressive Vocabular Definition Language
43
OWL: things RDF Schema can’t do equality enumeration number restrictions l Single-valued/multi-valued l Optional/required values inverse, symmetric, transitive boolean algebra l Union, complement …
44
OWL: more expressivity Full DL Lite OWL Full Allow meta-classes etc OWL DL Negation Disjunction Full Cardinality Enumerated types OWL Light (sub)classes, individuals (sub)properties, domain, range conjunction (in)equality cardinality 0/1 datatypes inverse, transitive, symmetric hasValue someValuesFrom allValuesFrom RDF Schema
45
Remember “required are”: ü one or more standard vocabularies l so search engines, producers and consumers all speak the same language ü a standard syntax, l so meta-data can be recognised as such 3. lots of resources with meta-data attached
46
Question: who writes the ontologies? Professional bodies, scientific communities, companies, publishers, …. See previous slide on Biomedical ontologies l Same developments in many other fields Good old fashioned Knowledge Engineering Convert from DB-schema, UML, etc.
47
Question: Who writes the meta-data ? -Automated learning -shallow natural language analysis -Concept extraction amsterdam trade antwerp europe amsterdam merchant city town center netherlands merchant city town Example: Encyclopedia Britannica on “Amsterdam”
48
exploit existing legacy-data l Amazon l Lab equipment? side-effect from user interaction l MIT Lab photo-annotator NOT from manual effort Web 2.0 community/social interaction Question: Who writes the meta-data ?
49
Remember “required are” ü one or more standard vocabularies l so search engines, producers and consumers all speak the same language ü a standard syntax, l so meta-data can be recognised as such lots of resources with meta-data attached
50
Some working examples? DOPE HCLS (http://www.w3.org/2001/sw/hcls/)http://www.w3.org/2001/sw/hcls/
51
DOPE: Background Vertical Information Provision l Buy a topic instead of a Journal ! l Web provides new opportunities Business driver: drug development l Rich, information-hungry market l Good thesaurus (EMTREE)
52
The Data Document repositories: l ScienceDirect: approx. 500.000 fulltext articles l MEDLINE: approx. 10.000.000 abstracts Extracted Metadata l The Collexis Metadata Server: concept- extraction ("semantic fingerprinting") Thesauri and Ontologies l EMTREE: 60.000 preferred terms 200.000 synonyms
53
RDF Schema EMTREE Query interface RDF Datasource 1 RDF Datasource n …. Architecture:
54
GUI: Spectacle (Aduna) Metadata Server (Collexis) EMTREE Thesaurus (RDFS) Mediator: Sesame (Aduna) http requests Java Client SOAP Document Model (RDFS) Source Model (RDF) SeRQL Additional Source of Data Source Model (RDF) SeRQL Gene Thesaurus (RDFS)
66
Some working examples? DOPE Community analysis http://flink.semanticweb.org http://flink.semanticweb.org
67
Author teams In HIV research?
68
Some working examples? DOPE Community analysis http://flink.semanticweb.org http://flink.semanticweb.org Biological pathway database http://pkb.stanford.edu/ http://pkb.stanford.edu/
69
Source: http://pkb.stanford.edu/ Stanford University Use Case
70
Summarising… Data integration on the Web: l machine processable data besides human processable data Syntax for meta-data l XML (not much meaning) l RDF (some meaning) l RDF Schema (some meaning) l OWL (more meaning Vocabularies for meta-data l Lot’s of them in bio-inf. Actual meta-data: l Lot’s in bio-inf. Will enable: l Better search engines (recall, precision, concepts) l Combining information across pages (inference) l …
71
Things to do for you Practical: Use existing software to construct new use-scenario’s Conceptual: Create on ontology for some area of bio-medical expertise l from scratch l as a refinement of an existing ontology Technical: Transform an existing data-set in meta-data format, and provide a query interface (for humans and machines)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.