The Semantic Web: New-style data-integration (and how it works for life-scientists too!) Frank van Harmelen AI Department Vrije Universiteit Amsterdam.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

1 ICS-FORTH EU-NSF Semantic Web Workshop 3-5 Oct Christophides Vassilis Database Technology for the Semantic Web Vassilis Christophides Dimitris Plexousakis.
Meta Data Larry, Stirling md on data access – data types, domain meta-data discovery Scott, Ohio State – caBIG md driven architecture semantic md Alexander.
Introduction to Semantic Web What? Why? How? So far? Next? Frank van Harmelen AI Department Vrije Universiteit Amsterdam Creative Commons License: allowed.
CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
Consistent and standardized common model to support large-scale vocabulary use and adoption Robust, scalable, and common API to reduce variation in clinical.
RDF Schemata (with apologies to the W3C, the plural is not ‘schemas’) CSCI 7818 – Web Technologies 14 November 2001 Van Lepthien.
CS570 Artificial Intelligence Semantic Web & Ontology 2
By Ahmet Can Babaoğlu Abdurrahman Beşinci.  Suppose you want to buy a Star wars DVD having such properties;  wide-screen ( not full-screen )  the extra.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
RDF Briefing Frank van Harmelen Vrije Universiteit Amsterdam.
Frank van Harmelen Semantics: where are we now, where should we go? Creative Commons CC BY 3.0: allowed to share & remix (also commercial) but must attribute.
Semantic Web research anno 2006: main streams, popular falacies, current status, future challenges Frank van Harmelen Vrije Universiteit Amsterdam.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Ontology Notes are from:
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Information Retrieval in Practice
Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.
The Semantic Web – WEEK 5: RDF Schema + Ontologies The “Layer Cake” Model – [From Rector & Horrocks Semantic Web cuurse]
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
COMP 6703 eScience Project Semantic Web for Museums Student : Lei Junran Client/Technical Supervisor : Tom Worthington Academic Supervisor : Peter Strazdins.
The Semantic Web: New-style data-integration (and how it works for life-scientists too!) Frank van Harmelen AI Department Vrije Universiteit Amsterdam.
1 Bluffers Guide to The Semantic Web Frank van Harmelen CS Department Vrije Universiteit Amsterdam Data wants to be free.
The Semantic Web: New-style data-integration (and how it works for life-scientists too!) Frank van Harmelen AI Department Vrije Universiteit Amsterdam.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
The RDF meta model: a closer look Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations.
1 DCS861A-2007 Emerging IT II Rinaldo Di Giorgio Andres Nieto Chris Nwosisi Richard Washington March 17, 2007.
Overview of Search Engines
OIL: An Ontology Infrastructure for the Semantic Web D. Fensel, F. van Harmelen, I. Horrocks, D. L. McGuinness, P. F. Patel-Schneider Presenter: Cristina.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Nancy Ide Vassar College USA Resource Definition Framework A Tutorial EUROLAN 2003 July 28 - August 8 Bucharest - Romania.
Ontologies: Making Computers Smarter to Deal with Data Kei Cheung, PhD Yale Center for Medical Informatics CBB752, February 9, 2015, Yale University.
Amarnath Gupta Univ. of California San Diego. An Abstract Question There is no concrete answer …but …
Logics for Data and Knowledge Representation
Linked Life Data for annotation of Medline Semantic data-integration and search in the life science domain Vassil Momtchev (Ontotext)
1 Ontology-based Semantic Annotatoin of Process Template for Reuse Yun Lin, Darijus Strasunskas Depart. Of Computer and Information Science Norwegian Univ.
Metadata. Generally speaking, metadata are data and information that describe and model data and information For example, a database schema is the metadata.
Dimitrios Skoutas Alkis Simitsis
Value Set Resolution: Build generalizable data normalization pipeline using LexEVS infrastructure resources Explore UIMA framework for implementing semantic.
The Semantic Web from ft Frank van Harmelen Creative Commons License: allowed to share & remix, but must attribute & non-commercial.
Semantic Web - an introduction By Daniel Wu (danielwujr)
©Ferenc Vajda 1 Semantic Grid Ferenc Vajda Computer and Automation Research Institute Hungarian Academy of Sciences.
Oracle Database 11g Semantics Overview Xavier Lopez, Ph.D., Dir. Of Product Mgt., Spatial & Semantic Technologies Souripriya Das, Ph.D., Consultant Member.
Sharing Ontologies in the Biomedical Domain Alexa T. McCray National Library of Medicine National Institutes of Health Department of Health & Human Services.
EEL 5937 Ontologies EEL 5937 Multi Agent Systems Lecture 5, Jan 23 th, 2003 Lotzi Bölöni.
Majid Sazvar Knowledge Engineering Research Group Ferdowsi University of Mashhad Semantic Web Reasoning.
The future of the Web: Semantic Web 9/30/2004 Xiangming Mu.
Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.
Trustworthy Semantic Webs Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #4 Vision for Semantic Web.
BBN Technologies Copyright 2009 Slide 1 The S*QL Plugin for Cytoscape Visual Analytics on the Web of Linked Data Rusty (Robert J.) Bobrow Jeff Berliner,
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Mining the Biomedical Research Literature Ken Baclawski.
The RDF meta model Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations of XML compared.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
1 Open Ontology Repository initiative - Planning Meeting - Thu Co-conveners: PeterYim, LeoObrst & MikeDean ref.:
Clinical research data interoperbility Shared names meeting, Boston, Bosse Andersson (AstraZeneca R&D Lund) Kerstin Forsberg (AstraZeneca R&D.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Sept, 2008 Life Science Knowledge Collider Vassil Momtchev (Ontotext)
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
26/02/ WSMO – UDDI Semantics Review Taxonomies and Value Sets Discussion Paper Max Voskob – February 2004 UDDI Spec TC V4 Requirements.
Semantic Web. P2 Introduction Information management facilities not keeping pace with the capacity of our information storage. –Information Overload –haphazardly.
OWL (Ontology Web Language and Applications) Maw-Sheng Horng Department of Mathematics and Information Education National Taipei University of Education.
Information Retrieval in Practice
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
The Semantic Web By: Maulik Parikh.
Search Engine Architecture
Bio68: Bioinformatics Databases
RDF For Semantic Web Dhaval Patel 2nd Year Student School of IT
LOD reference architecture
Presentation transcript:

The Semantic Web: New-style data-integration (and how it works for life-scientists too!) Frank van Harmelen AI Department Vrije Universiteit Amsterdam

What’s the problem? (data-mess in bio-inf)

The Study of Genes... Chromosomal location Sequence Sequence Variation Splicing Protein Sequence Protein Structure

… and Their Function Homology Motifs Publications Expression HTS In Vivo/Vitro Functional Characterization

Understanding Mechanisms of Disease Metabolic and regulatory pathway induction

Development of Drugs, Vaccines, Diagnostics Differing types of Drugs, Vaccines, and Diagnostics Small molecules Protein therapeutics Gene therapy In vitro, In vivo diagnostics Development requires Preclinical research Clinical trials Long-term clinical research All of which often feeds back into ongoing Genomics research and discovery.

Sample Problem: Hyperprolactinemia Over production of prolactin –prolactin stimulates mammary gland development and milk production Hyperprolactinemia is characterized by: –inappropriate milk production –disruption of menstrual cycle –can lead to conception difficulty

Understanding transcription factors for prolactin production “Show me all genes in the public literature that are putatively related to hyperprolactinemia, have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells, and are homologous to known transcription factors.” “Show me all genes that are homologous to known transcription factors” SEQUENCE “Show me all genes that have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells” EXPRESSION “Show me all genes in the public literature that are putatively related to hyperprolactinemia” LITERATURE (Q1Q2Q3)(Q1Q2Q3)

The Industry’s Problem Too much unintegrated data: –from a variety of incompatible sources –no standard naming convention –each with a custom browsing and querying mechanism (no common interface) –and poor interaction with other data sources

ESTC Sept, 2008 Andy Law’s First Law “The first step in developing a new genetic analysis algorithm is to decide how to make the input data file format different from all pre-existing analysis data file formats.” ESTC Sept, 2008

ESTC Sept, 2008 Andy Law’s Second Law “The second step in developing a new genetic analysis algorithm is to decide how to make the output data file format incompatible with all pre-existing analysis data file input formats.” ESTC Sept, 2008

What are the Data Sources? Flat Files URLs Proprietary Databases Public Databases Data Marts Spreadsheets s …

Stitching this all together by hand? Source: Stephens et al. J Web Semantics 2006

Why would Semantic Web technology help?

Semantic Web Approach 1.Convert all data sources to RDF representation (local or distributed) 2.Optional: Collect the data to scalable semantic repository 3.Apply light-weight reasoning to specify formal interpretations of the data, e.g.: l remove redundancy, l establish equalities, etc 4.Derive new implicit knowledge ESTC Sept, 2008

machine accessible meaning (What it’s like to be a machine)  drug administration disease IS-A alleviates META-DATA

What is meta-data? it's just data it's data describing other data its' meant for machine consumption disease name symptoms drug administration

Required are: 1. one or more standard vocabularies l so search engines, producers and consumers all speak the same language 2. a standard syntax, l so meta-data can be recognised as such 3. lots of resources with meta-data attached mechanisms for attribution and trust

no shared understanding Conceptual and terminological confusion Actors: both humans and machines Agree on a conceptualization Make it explicit in some language. world concept language What are ontologies & what are they used for

standard vocabularies (“Ontologies”) Identify the key concepts in a domain Identify a vocabulary for these concepts Identify relations between these concepts Make these precise enough so that they can be shared between l humans and humans l humans and machines l machines and machines

Real life examples handcrafted l music: CDnow (2410/5), MusicMoz (1073/7)CDnow MusicMoz l biomedical: SNOMED (200k), GO (15k), Emtree(45k+190k Systems biologyGO Systems biology ranging from lightweight l Yahoo, UNSPC, Open directory (400k) to heavyweight (Cyc (300k)) Yahoo ranging from small ( METAR ) to large ( UNSPC ) METAR

Biomedical ontologies (a few..) Mesh l Medical Subject Headings, National Library of Medicine l descriptions EMTREE l Commercial Elsevier, Drugs and diseases l terms, synonyms UMLS l Integrates 100 different vocabularies SNOMED l concepts, College of American Pathologists Gene Ontology l terms in molecular biology NCBI Cancer Ontology: l 17,000 classes (about 1M definitions),

Remember “required are”: ü one or more standard vocabularies l so search engines, producers and consumers all speak the same language 2. a standard syntax, l so meta-data can be recognised as such 3. lots of resources with meta-data attached

Stack of languages

Bluffer’s guide to RDF (1) Object --Attribute-> Value triples objects are web-resources Value is again an Object: l triples can be linked l data-model = graph pers05 ISBN... Author-of pers05 ISBN... Author-of MIT ISBN... Publ- by Author-of Publ- by

What does RDF Schema add? Defines vocabulary for RDF Organizes this vocabulary in a typed hierarchy Class, subClassOf, type Property, subPropertyOf domain, range Person TeacherStudent subClassOf Marta type supervises domain range Frank type supervises

RDF Triples in Life Sciences

OWL: things RDF Schema can’t do equality enumeration number restrictions l Single-valued/multi-valued l Optional/required values inverse, symmetric, transitive boolean algebra l Union, complement …

Web of Data: a nybody can say anything about anything All identifiers are URL's (= on the Web) l Allows total decoupling of data vocabulary meta-data x T [ IsOfType ] different owners & locations

RDF(S) have a (very small) formal semantics Defines what other statements are implied by a given set of RDF(S) statements Ensures mutual agreement on minimal content between parties without further contact In the form of “entailment rules” Very simple to compute (and not explosive in practice)

RDF(S) semantics: examples Aspirin isOfType Painkiller Painkiller subClassOf Drug  Aspirin isOfType Drug aspirin alleviates headache alleviates range symptom  headache isOfType symptom

RDF(S) semantics: examples  isOfType   subClassOf    isOfType    range    isOfType 

RDF(S) semantics X R Y + R domain T  X IsOfType T X R Y + R range T  Y IsOfType T T1 SubClassOf T2 + T2 SubClassOf T3  T1 SubClassOf T3 X IsOfType T1 + T1 SubClassOf T2  X IsOfType T1

OWL also has a formal semantics Defines what other statements are implied by a given set of statements Ensures mutual agreement on content (both minimal and maximal ) between parties without further contact Can be used for integrity/ consistency checking Hard to compute (and rarely/sometime/always explosive in practice)

OWL semantics: minimal vanGogh isOfType Impressionist Impressionist subClassOf Painter  vanGogh isOfType Painter vanGogh painter-of sunflowers painter-of domain painter  vanGogh isOfType painter

OWL semantics: maximal vanGogh isOfType Impressionist Impressionist disjointFrom Cubist  NOT: vanGogh isOfType Cubist painted-by has-cardinality 1 sun-flowers painted-by vanGogh Picasso different-individual-from vanGogh  NOT: sun-flowers painted-by Picasso

Remember “required are”: ü one or more standard vocabularies l so search engines, producers and consumers all speak the same language ü a standard syntax, l so meta-data can be recognised as such 3. lots of resources with meta-data attached

Question: who writes the ontologies? Professional bodies, scientific communities, companies, publishers, …. See previous slide on Biomedical ontologies l Same developments in many other fields Good old fashioned Knowledge Engineering Convert from DB-schema, UML, etc.

Question: Who writes the meta-data ? -Automated learning -shallow natural language analysis -Concept extraction amsterdam trade antwerp europe amsterdam merchant city town center netherlands merchant city town Example: Encyclopedia Britannica on “Amsterdam”

Remember “required are” ü one or more standard vocabularies l so search engines, producers and consumers all speak the same language ü a standard syntax, l so meta-data can be recognised as such lots of resources with meta-data attached

How to handle multiple ontologies: ontology matching Linguistics & structure Shared vocabulary Instance-based matching Shared background knowledge

Q Matching through shared vocabulary

Matching through shared instances

shared background knowledge Matching using shared background knowledge ontology 1 ontology 2

Some working examples? Linked Life Data DOPE HCLS

ESTC Sept, 2008 Linked Life Data Overview LinkedLifeData - statistics: – Number of statements: 1,159,857,602 – Number of explicit statements: 403,361,589 – Number of entities: 128,948,564 Platform to automate the process: – Infrastructure to store and inferences – Transform the structured data sources to RDF – Provide web interface to access the data Currently operates over OWLIM semantic repository Publicly available at: ESTC Sept, 2008

ESTC Sept, 2008 Light Weight Reasoning in Linked Life Data ESTC Sept, 2008 rdf:type rdf:seeAlso urn:intact:1007 urn:uniprot:P urn:uniprot:Protein urn:biogrid:Interaction urn:biogrid:15904 urn:biogrid:FBgn urn:biogrid:FBgn urn:pubmed:15904 urn:uniprot:FBgn urn:uniprot:FBgn rdf:type urn:intact:Interaction urn:uniprot:Q interactsWith hasParticipant rdf:type sameAs Resolve the syntactic differences in the identifiersUse relationships to derive new implicit knowledge These are only examples resource names

ESTC Sept, 2008 ESTC Sept, 2008 DatabaseDatasetSchemaDescription UniprotCurated entries Original by the providerProtein sequences and annotations Entrez-GeneCompleteCustom RDF schemaGenes and annotation iProClassCompleteCustom RDF schemaProtein cross- references Gene OntologyCompleteSchema by the providerGene and gene product annotation thesaurus BioGRIDCompleteBioPAX 2.0 (custom generated)Protein interactions extracted from the literature NCI - Pathway Interaction Database CompleteBioPAX 2.0 (original by the provider) Human pathway interaction database The Cancer Cell MapCompleteBioPAX 2.0 (original by the provider) Cancer pathways database ReactomeCompleteBioPAX 2.0 (original by the provider) Human pathways and interactions BioCartaCompleteBioPAX 2.0 (original by the provider) Pathway database KEGGCompleteBioPAX 1.0 (original by the provider) Molecular Interaction BioCycCompleteBioPAX 1.0 (original by the provider) Pathway database NCBI TaxonomyCompleteCustom RDF schemaOrganisms

Some working examples? Linked Life Data DOPE HCLS

The Data Document repositories: l ScienceDirect: approx fulltext articles l MEDLINE: approx abstracts Extracted Metadata l The Collexis Metadata Server: concept- extraction ("semantic fingerprinting") Thesauri and Ontologies l EMTREE: preferred terms synonyms

Summarising… Data integration on the Web: l machine processable data besides human processable data Syntax for meta-data l (not discussed in any detail) Vocabularies for meta-data l Lot’s of them in bio-inf. Actual meta-data: l Lot’s in bio-inf. Will enable: l Better search engines (recall, precision, concepts) l Combining information across pages (inference) l …

Things to do for you Practical: Use existing software to construct new use-scenario’s Conceptual: Create on ontology for some area of bio-medical expertise l from scratch l as a refinement of an existing ontology Technical: Transform an existing data-set in meta-data format, and provide a query interface (for humans and machines)