Introduction to Semantic Web What? Why? How? So far? Next? Frank van Harmelen AI Department Vrije Universiteit Amsterdam Creative Commons License: allowed to share & remix, but must attribute & non-commercial
Who am I Frank van Harmelen Prof in AI at Vrije Universiteit Amsterdam Knowledge Representation Early Semantic Web Projects (> 1999) Co-designed OWL Tech advisor of Aduna (Sesame) Scientific Director of LarKC (Large Knowledge Collider) I know nothing about image analysis…
Who are you? who knows roughly what Semantic Web is? who has heard of RDF & OWL? who has studied RDF & OWL? who has used RDF & OWL? who expects ever to use RDF & OWL? who is a logician who is a KR researcher who is a Web researcher who is an image researcher
General idea of the Semantic Web
General idea of Semantic Web Make current web more machine accessible (currently all the intelligence is in the user) Motivating use-cases search personalisation semantic linking data integration web services...
General idea of Semantic Web Do this by: 1.Making data and meta-data available on the Web in machine-understandable form (formalised) 2.Structure the data and meta-data in ontologies These are non-trivial design decisions. Alternative would be: Make current web more machine accessible (currently all the intelligence is in the user)
Whats wrong with the Web? linked web-pages, written by people, written for people, used only by people... Many of these pages already come from data, usable by computers! But we cant link the data.... ? ? ? ? ? linked data, usable by computers! useful for people!
"Web of Data" (TBL) 1.expose data on the web (facts) in interoperable form (RDF) 2.expose knowledge on the web with interoperable semantics (ontologies, RDF Schema, OWL) 3.Apply lightweight inference for Interoperability Query answering Search Unexpected reuse … Semantic Web
Not just data, also knowledge All of this: Low expressivity logic (RDF) That allows some inference: Property inheritance, domain/range inference Some of this: Medium expressive logic (OWL) That allows more inference: (in)equality, number restrictions, datatypes
Desideratum: On the Web of Data, anyone can say anything about anything Need for total decoupling of data vocabulary meta-data x T [ IsOfType ] different owners & locations
Two versions of Semantic Web story: V1: Semantic Web = annotated Web ; 1 & 2 are embedded in text & images on the Web V2: Semantic Web = Web of Data ; 1 & 2 live in dedicated repositories (triple stores) x T [ IsOfType ] different owners & locations
Why is this hard?
machine accessible meaning (What its like to be a machine) symptoms drug administration disease IS-A alleviates META-DATA
What is meta-data? it's just data it's data describing other data its' meant for machine consumption disease name symptoms drug administration
What is required?
Required are: 1. one or more standard vocabularies l so search engines, producers and consumers all speak the same language 2. a standard syntax, l so meta-data can be recognised as such 3. lots of resources with meta-data attached
Bluffers Guide to RDF & RDF Schema
Bluffers Guide to RDF Express relations between things: Results in labelled network (graph) All labels are actually web-addresses (URIs) You can ping any label and find out more Bits of the graph can live at physically different locations & have different owners Franky x AuthorOf MIT publishedBy Subject Object Predicate
Bluffers Guide to RDF Schema types for subjects & objects & predicates Types organised in a hierarchy Inheritance of properties Franky x AuthorOf MIT publishedBy author book publisher personartifact man
So whats special about RDF(S)? statements about an identifier can be distributed no unique name assumption no closed world assumption Remember web-style decoupling
Remember: Need for total decoupling of data vocabulary meta-data x T [ IsOfType ] different owners & locations
RDF(S) have a (very small) formal semantics Defines what other statements are implied by a given set of RDF(S) statements Ensures mutual agreement on minimal content between parties without further contact In the form of entailment rules Very simple to compute (and not explosive in practice)
RDF(S) semantics: examples Aspirin isOfType Painkiller Painkiller subClassOf Drug Aspirin isOfType Drug aspirin alleviates headache alleviates range symptom headache isOfType symptom
RDF(S) semantics: examples isOfType subClassOf isOfType range isOfType
RDF(S) semantics X R Y + R domain T X IsOfType T X R Y + R range T Y IsOfType T T1 SubClassOf T2 + T2 SubClassOf T3 T1 SubClassOf T3 X IsOfType T1 + T1 SubClassOf T2 X IsOfType T1 Semantics = predictable inference
Bluffers Guide to OWL
OWL: things RDF Schema cant do equality enumeration number restrictions Single-valued/multi-valued Optional/required values inverse, symmetric, transitive boolean algebra Union, complement …
Layered language OWL Lite: Classification hierarchy Simple constraints OWL DL: Maximal expressiveness While maintaining tractability Standard formalisation OWL Full: Very high expressiveness Loosing tractability Non-standard formalisation All syntactic freedom of RDF (self-modifying) Syntactic layering Semantic layering Syntactic layering Semantic layering Full DL Lite
Language Layers Full DL Lite OWL Full Allow meta-classes etc OWL DL Negation Disjunction Full Cardinality Enumerated types OWL Light (sub)classes, individuals (sub)properties, domain, range conjunction (in)equality cardinality 0/1 datatypes inverse, transitive, symmetric hasValue someValuesFrom allValuesFrom RDF Schema
Backward compatibility with RDF OWL agents understand everything…
OWL agents understand everything… … others still the most important aspects Backward compatibility with RDF
OWL also has a formal semantics Defines what other statements are implied by a given set of statements Ensures mutual agreement on content (both minimal and maximal ) between parties without further contact Can be used for integrity/ consistency checking Hard to compute (and rarely/sometime/always explosive in practice)
OWL semantics: minimal vanGogh isOfType Impressionist Impressionist subClassOf Painter vanGogh isOfType Painter vanGogh painter-of sunflowers painter-of domain painter vanGogh isOfType painter
OWL semantics: maximal vanGogh isOfType Impressionist Impressionist disjointFrom Cubist NOT: vanGogh isOfType Cubist painted-by has-cardinality 1 sun-flowers painted-by vanGogh Picasso different-individual-from vanGogh NOT: sun-flowers painted-by Picasso
Remember: Require are 1. standard vocabularies 2.a standard syntax, 3. lots of resources with meta-data attached
Ontologies: real life examples handcrafted music: CDnow (2410/5), MusicMoz (1073/7)CDnow MusicMoz biomedical: SNOMED (200k), GO (15k), Emtree(45k+190k Systems biologyGO Systems biology ranging from lightweight Yahoo, UNSPC, Open directory (400k) to heavyweight (Cyc (300k))Yahoo ranging from small ( METAR ) to large ( UNSPC ) METAR
Biomedical ontologies (a few..) Mesh Medical Subject Headings, National Library of Medicine descriptions EMTREE Commercial Elsevier, Drugs and diseases terms, synonyms UMLS Integrates 100 different vocabularies SNOMED concepts, College of American Pathologists Gene Ontology terms in molecular biology NCBI Cancer Ontology: 17,000 classes (about 1M definitions),
Remember: Require are 1. standard vocabularies 2.a standard syntax, 3. lots of resources with meta-data attached
Who makes the meta-data? Dont throw away what we already have: Databases (Amazon.com) Navigation structures meta-data in documents Office, Acrobat, MP3, jpg As spin-off on what we already do MIT Media Lab photo annotator Automated analysis Text, Images, Video
Summary so far
Linked Data/Semantic Web Identification Uniform Resource Identifier (URI) Global identifier (NB: persistent!) Looks like a URL, is often and internationalized Resource Identifier (IRI) Description Resource Description Framework (RDF) RDF Schema (RDFS) Simple Knowledge Organization System (SKOS) Web Ontology Language (OWL) Querying RDF Triple stores SPARQL Query Language
Hoe ziet RDF eruit? Datamodel is een (directed) graph Elk data-item is een resource met een URI als identifier Elke eigenschap is een binaire relatie: triple Tussen resources: Tussen een resource en een literal
Why is this a Web of data? Global unique identifiers Reuse of identifiers in other datasets For data: (two sources say something about over Amsterdam ) For schema: (two sources each use the same concept City) This reuse builds links between datasets
Does this work in practice?
already many billions of facts & rules Linked Open Data cloud Encyclopedia Geographic names (millions) names of artists & art works (10.000s) scientific bibliographies hierarchical dictionaries (UK, FR, NL) hierarchical dictionaries (UK, FR, NL) life-science databases any CD ever recorded (almost) May 09 estimate > 4.2 billion triples million interlinks May 09 estimate > 4.2 billion triples million interlinks basic facts on every country on the planet common sense rules & facts ( s) It gets bigger every month
And remember: not just data All of this: l Low expressivity logic (RDF/RDFS) l That allows some inference: Property inheritance, domain/range inference Some of this: l Medium expressive logic (OWL) l That allows more inference: (in)equality, number restrictions, datatypes
Nice in the lab, but are you getting anywhere in practice?
Semantic Web News Quiz Google Reuters New York Times Microsoft Zemanta Obama Government BBC (music, worldcup, wildlife)BBC BestBuy.com Facebook
Challenges
What to do when success is becoming a problem? Heterogeneity l ontology mapping, instance identification Scale (10^10 statements) Dynamics, versioning (Flickr: 3000 pictures/minute, Wikipedia: 100 edits/minute) Trust, attribution, provenance Multimedia l In both directions