The Semantic Web: New-style data-integration (and how it works for life-scientists too!) Frank van Harmelen AI Department Vrije Universiteit Amsterdam
What’s the problem? (data-mess in bio-inf)
Source: PhRMA & FDA 2003 Pharmaceutical Productivity
The Industry’s Problem Too much unintegrated data: –from a variety of incompatible sources –no standard naming convention –each with a custom browsing and querying mechanism (no common interface) –and poor interaction with other data sources Kenneth Griffiths and Richard Resnick Tut. At Intell. Systems for Molec. Biol., 2003
What are the Data Sources? Flat Files URLs Proprietary Databases Public Databases Data Marts Spreadsheets s …
Sample Problem: Hyperprolactinemia Over production of prolactin –prolactin stimulates mammary gland development and milk production Hyperprolactinemia is characterized by: –inappropriate milk production –disruption of menstrual cycle –can lead to conception difficulty
Understanding transcription factors for prolactin production “Show me all genes in the public literature that are putatively related to hyperprolactinemia, have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells, and are homologous to known transcription factors.” “Show me all genes that are homologous to known transcription factors” SEQUENCE “Show me all genes that have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells” EXPRESSION “Show me all genes in the public literature that are putatively related to hyperprolactinemia” LITERATURE (Q1Q2Q3)(Q1Q2Q3)
The Medical tower of Babel Mesh l Medical Subject Headings, National Library of Medicine l descriptions EMTREE l Commercial Elsevier, Drugs and diseases l terms, synonyms UMLS l Integrates 100 different vocabularies SNOMED l concepts, College of American Pathologists Gene Ontology l terms in molecular biology NCI Cancer Ontology: l 17,000 classes (about 1M definitions),
Stitching this all together by hand? Source: Stephens et al. J Web Semantics 2006
Why would Semantic technology help?
machine accessible meaning (What it’s like to be a machine) symptoms drug administration disease IS-A alleviates META-DATA
What is meta-data? it's just data it's data describing other data its' meant for machine consumption disease name symptoms drug administration
Required are: 1. one or more standard vocabularies l so search engines, producers and consumers all speak the same language 2. a standard syntax, l so meta-data can be recognised as such 3. lots of resources with meta-data attached mechanisms for attribution and trust is this page really about Pamela Anderson?
no shared understanding Conceptual and terminological confusion Actors: both humans and machines Agree on a conceptualization Make it explicit in some language. world concept language What are ontologies & what are they used for
standard vocabularies (“Ontologies”) Identify the key concepts in a domain Identify a vocabulary for these concepts Identify relations between these concepts Make these precise enough so that they can be shared between l humans and humans l humans and machines l machines and machines
Biomedical ontologies (a few..) Mesh l Medical Subject Headings, National Library of Medicine l descriptions EMTREE l Commercial Elsevier, Drugs and diseases l terms, synonyms UMLS l Integrates 100 different vocabularies SNOMED l concepts, College of American Pathologists Gene Ontology l terms in molecular biology NCBI Cancer Ontology: l 17,000 classes (about 1M definitions),
Remember “required are”: ü one or more standard vocabularies l so search engines, producers and consumers all speak the same language 2. a standard syntax, l so meta-data can be recognised as such 3. lots of resources with meta-data attached
Stack of languages
XML: l Surface syntax, no semantics XML Schema: l Describes structure of XML documents RDF: l Datamodel for “relations” between “things” RDF Schema: l RDF Vocabular Definition Language OWL: l A more expressive Vocabular Definition Language
Remember “required are”: ü one or more standard vocabularies l so search engines, producers and consumers all speak the same language ü a standard syntax, l so meta-data can be recognised as such 3. lots of resources with meta-data attached
Question: who writes the ontologies? Professional bodies, scientific communities, companies, publishers, …. See previous slide on Biomedical ontologies l Same developments in many other fields Good old fashioned Knowledge Engineering Convert from DB-schema, UML, etc.
Question: Who writes the meta-data ? -Automated learning -shallow natural language analysis -Concept extraction amsterdam trade antwerp europe amsterdam merchant city town center netherlands merchant city town Example: Encyclopedia Britannica on “Amsterdam”
exploit existing legacy-data l Databases l Lab equipment l (Amazon) side-effect from user interaction l keyword extraction NOT from manual effort Question: Who writes the meta-data ?
Remember “required are” ü one or more standard vocabularies l so search engines, producers and consumers all speak the same language ü a standard syntax, l so meta-data can be recognised as such lots of resources with meta-data attached
Some working examples? DOPE
DOPE: Background Vertical Information Provision l Buy a topic instead of a Journal ! l Web provides new opportunities Business driver: drug development l Rich, information-hungry market l Good thesaurus (EMTREE)
The Data Document repositories: l ScienceDirect: approx fulltext articles l MEDLINE: approx abstracts Extracted Metadata l The Collexis Metadata Server: concept- extraction ("semantic fingerprinting") Thesauri and Ontologies l EMTREE: preferred terms synonyms
RDF Schema EMTREE Query interface RDF Datasource 1 RDF Datasource n …. Architecture:
Ontology disambiguates query
Ontology groups results
Ontology clusters results
Ontology refines query
Some working examples? DOPE HCLS (
RDF Schema EMTREE Query interface RDF Datasource 1 RDF Datasource n …. Architecture: RDF Schema Gene Ontology ….
Summarising… Data integration on the Web: l machine processable data besides human processable data Syntax for meta-data l Representation l Inference Vocabularies for meta-data l Lot’s of them in bio-inf. Actual meta-data: l Lot’s in bio-inf. Will enable: l Better search engines (recall, precision, concepts) l Combining information across pages (inference) l …
Things to do for you Practical: Use existing software to construct new use-scenario’s Conceptual: Create on ontology for some area of bio-medical expertise l from scratch l as a refinement of an existing ontology Technical: Transform an existing data-set in meta-data format, and provide a query interface (for humans and machines)