Creating and Exploiting a Web of Semantic Data

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
CS570 Artificial Intelligence Semantic Web & Ontology 2
By Ahmet Can Babaoğlu Abdurrahman Beşinci.  Suppose you want to buy a Star wars DVD having such properties;  wide-screen ( not full-screen )  the extra.
Wikitology Wikipedia as an Ontology Zareen Syed and Anupam Joshi University of Maryland, Baltimore County James Mayfield, Paul McNamee and Christine Piatko.
The Web of data with meaning... By Michael Griffiths.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
COMP 6703 eScience Project Semantic Web for Museums Student : Lei Junran Client/Technical Supervisor : Tom Worthington Academic Supervisor : Peter Strazdins.
The Semantic Web Week 1 Module Content + Assessment Lee McCluskey, room 2/07 Department of Computing And Mathematical Sciences Module.
RDF: Building Block for the Semantic Web Jim Ellenberger UCCS CS5260 Spring 2011.
Semantic Web Presented by: Edward Cheng Wayne Choi Tony Deng Peter Kuc-Pittet Anita Yong.
Samad Paydar Web Technology Laboratory Computer Engineering Department Ferdowsi University of Mashhad 1389/11/20 An Introduction to the Semantic Web.
Module 2b: Modeling Information Objects and Relationships IMT530: Organization of Information Resources Winter, 2007 Michael Crandall.
1 DCS861A-2007 Emerging IT II Rinaldo Di Giorgio Andres Nieto Chris Nwosisi Richard Washington March 17, 2007.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
RDF (Resource Description Framework) Why?. XML XML is a metalanguage that allows users to define markup XML separates content and structure from formatting.
Semantic Web outlook and trends May The Past 24 Odd Years 1984 Lenat’s Cyc vision 1989 TBL’s Web vision 1991 DARPA Knowledge Sharing Effort 1996.
Semantic Web Technologies ufiekg-20-2 | data, schemas & applications | lecture 21 original presentation by: Dr Rob Stephens
Chapter 6 Understanding Each Other CSE 431 – Intelligent Agents.
The Semantic Web Service Shuying Wang Outline Semantic Web vision Core technologies XML, RDF, Ontology, Agent… Web services DAML-S.
INF 384 C, Spring 2009 Ontologies Knowledge representation to support computer reasoning.
Wikitology: A Wikipedia Derived Knowledge Base Zareen Syed Advisor: Dr. Tim Finin February 6th, 2009 Zareen Syed Advisor: Dr. Tim Finin February 6th, 2009.
Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County
8//2808 Wikitology Wikipedia as an Ontology Tim Finin, Zareen Syed and Anupam Joshi University of Maryland, Baltimore County
Logics for Data and Knowledge Representation
The Semantic Web Web Science Systems Development Spring 2015.
Chapter 6 Understanding Each Other CSE 431 – Intelligent Agents.
Populating A Knowledge Base From Text Clay Fink, Tim Finin, Christine Piatko and Jim Mayfield.
Introduction to the Semantic Web. Questions What is the Semantic Web? Why do we want it? How will we do it? Who will do it? When will it be done?
Metadata. Generally speaking, metadata are data and information that describe and model data and information For example, a database schema is the metadata.
Coastal Atlas Interoperability - Ontologies (Advanced topics that we did not get to in detail) Luis Bermudez Stephanie Watson Marine Metadata Interoperability.
Semantic Web - an introduction By Daniel Wu (danielwujr)
LOD for the Rest of Us Tim Finin, Anupam Joshi, Varish Mulwad and Lushan Han University of Maryland, Baltimore County 15 March 2012
You sexy beast. Ok, inappropriate. How about: Web of links to Web of Meaning Hello Semantic Web!
Artificial Intelligence 2004 Ontology
The future of the Web: Semantic Web 9/30/2004 Xiangming Mu.
Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.
Introduction to the Semantic Web and Linked Data
Using linked data to interpret tables Varish Mulwad September 14,
Dr. Lowell Vizenor Ontology and Semantic Technology Practice Lead Alion Science and Technology Semantic Technology: A Basic Introduction.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Creating and Exploiting a Web of Semantic Data. Overview Introduction Semantic Web 101 Recent Semantic Web trends Examples: DBpedia, Wikitology Conclusion.
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
Creating and Exploiting a Web of Semantic Data Tim Finin, UMBC Earth and Space Science Informatics Workshop 05 August 2009
CityStateMayorPopulation BaltimoreMDS.C.Rawlings-Blake637,418 SeattleWAM.McGinn617,334 BostonMAT.Menino645,169 RaleighNCC.Meeker405,791 We are laying a.
Chapter 3 RDF. Introduction Problem: What does an XML document mean? – XML is about data structures – Their meaning (semantics) is not apparent to a machine.
1 Open Ontology Repository initiative - Planning Meeting - Thu Co-conveners: PeterYim, LeoObrst & MikeDean ref.:
KAnOE: Research Centre for Knowledge Analytics and Ontological Engineering Managing Semantic Data NACLIN-2014, 10 Dec 2014 Dr. Kavi Mahesh Dean of Research,
Creating and Exploiting a Web of (Semantic) Data, Tim Finin Zareen Syed and Anupam Joshi University of Maryland, Baltimore County James Mayfield, Paul.
Semantic Web COMS 6135 Class Presentation Jian Pan Department of Computer Science Columbia University Web Enhanced Information Management.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Introduction to the Semantic Web Jeff Heflin Lehigh University.
Making Software Agents Smarter Tim Finin University of Maryland, Baltimore County ICAART 2010, 22 January 2010
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Semantic Web. P2 Introduction Information management facilities not keeping pace with the capacity of our information storage. –Information Overload –haphazardly.
Setting the stage: linked data concepts Moving-Away-From-MARC-a-thon.
Introduction to the Semantic Web. Questions What is the Semantic Web? Why do we want it? How will we do it? Who will do it? When will it be done?
OWL (Ontology Web Language and Applications) Maw-Sheng Horng Department of Mathematics and Information Education National Taipei University of Education.
Web-Technology Lecture 13.
The Semantic Web By: Maulik Parikh.
Knowledge Representation Part II Description Logic & Introduction to Protégé Jan Pettersen Nytun.
Building the Semantic Web
SPARQL + RDF Based on: Prof. Benny Kimelfled’s lecture notes
Big Data Quality the next semantic challenge
RDF For Semantic Web Dhaval Patel 2nd Year Student School of IT
Wikitology Wikipedia as an Ontology
ece 720 intelligent web: ontology and beyond
PREMIS Tools and Services
DBpedia 2014 Liang Zheng 9.22.
Introduction to Information Retrieval
Presentation transcript:

Creating and Exploiting a Web of Semantic Data

Overview Introduction Semantic Web 101 Recent Semantic Web trends Examples: DBpedia, Wikitology Conclusion

The Age of Big Data Massive amounts of data is available today Advances in many fields driven by availability of unstructured data, e.g., text, audio, images Increasingly, large amounts of structured and semi-structured data is also online Much of this available in the Semantic Web language RDF, fostering integration and interoperability Such structured data is especially important for the sciences

Twenty years ago… Tim Berners-Lee’s 1989 WWW proposal described a web of rela- tionships among named objects unifying many information management tasks Capsule history Guha’s MCF (~94) XML+MCF=>RDF (~96) RDF+OO=>RDFS (~99) RDFS+KR=>DAML+OIL (00) W3C’s SW activity (01) W3C’s OWL (03) SPARQL, RDFa (08) Rules (09) http://www.w3.org/History/1989/proposal.html

Ten years ago …. The W3C started developing standards for the Semantic Web The vision, technology and use cases are still evolving Moving from a web of documents to a web of data

Today 4.5 billion integrated facts published on the Web as RDF Linked Open Data

Tomorrow Large collections of integrated facts published on the Web for many disciplines and domains

W3C’s Semantic Web Goal “The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation.” -- Berners-Lee, Hendler and Lassila, The Semantic Web, Scientific American, 2001

Contrast with a non-Web approach The W3C Semantic Web approach is Distributed Open Non-proprietary Standards based

How can we share data on the Web? POX, Plain Old XML, is one approach, but it has deficiencies The Semantic Web languages RDF and OWL offer a simpler and more abstract data model (a graph) that is better for integration Its well defined semantics supports knowledge modeling and inference Supported by a stable, funded standards organization, the World Wide Web Consortium

http://umbc.edu/ ~finin/talks/idm02/ Simple RDF Example http://umbc.edu/ ~finin/talks/idm02/ dc:Title “Intelligent Information Systems on the Web and in the Aether” dc:Creator Note: “blank node” bib:Aff http://umbc.edu/ bib:email bib:name “finin@umbc.edu” “Tim Finin”

The RDF Data Model An RDF document is an unordered collection of statements, each with a subject, predicate and object Such triples can be thought of as a labelled arc in a graph Statements describe properties of resources A resource is any object that can be referenced or denoted by a URI Properties themselves are also resources (URIs) Dereferencing a URI produces useful additional information, e.g., a definition or additional facts

RDF is the first SW language Graph XML Encoding RDF Data Model <rdf:RDF ……..> <….> </rdf:RDF> Good for human viewing Good for Machine processing Triples stmt(docInst, rdf_type, Document) stmt(personInst, rdf_type, Person) stmt(inroomInst, rdf_type, InRoom) stmt(personInst, holding, docInst) stmt(inroomInst, person, personInst) RDF is a simple language for graph based representations Good for storage and reasoning

http://umbc.edu/ ~finin/talks/idm02/ XML encoding for RDF <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:bib="http://daml.umbc.edu/ontologies/bib/"> <description about="http://umbc.edu/~finin/talks/idm02/"> <dc:title>Intelligent Information … and in the Aether</dc:Title> <dc:creator> <description> <bib:Name>Tim Finin</bib:Name> <bib:Email>finin@umbc.edu</bib:Email> <bib:Aff resource="http://umbc.edu/" /> </description> </dc:Creator> </rdf:RDF> http://umbc.edu/ ~finin/talks/idm02/ “Intelligent Information Systems on the Web and in the Aether” http://umbc.edu/ dc:Title dc:Creator bib:Aff “Tim Finin” “finin@umbc.edu” bib:name bib:email

N3 is a friendlier encoding @prefix rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# . @prefix dc: http://purl.org/dc/elements/1.1/ . @prefix bib: http://daml.umbc.edu/ontologies/bib/ . <http://umbc.edu/~finin/talks/idm02/> dc:title "Intelligent ... and in the Aether" ; dc:creator [ bib:Name "Tim Finin"; bib:Email "finin@umbc.edu" bib:Aff: "http://umbc.edu/" ] . http://umbc.edu/ ~finin/talks/idm02/ “Intelligent Information Systems on the Web and in the Aether” http://umbc.edu/ dc:Title dc:Creator bib:Aff “Tim Finin” “finin@umbc.edu” bib:name bib:email

RDFS supports simple inferences RDF Schema adds vocabulary for classes, properties & constraints An RDF ontology plus some RDF statements may imply additional RDF statements (not possible in XML) Note that this is part of the data model and not of the accessing or processing code. @prefix rdfs: <http://www.....>. @prefix : <genesis.n3>. parent a rdf: property; rdfs:domain person; rdfs:range person. mother rdfs:subProperty parent; rdfs:domain woman; eve mother cain. person a class. woman subClass person. mother a property. eve a person; a woman; parent cain. cain a person.

OWL adds further richness OWL adds richer representational vocabulary, e.g. parentOf is the inverse of childOf Every person has exactly one mother Every person is a man or a woman but not both A man is the equivalent of a person with a sex property with value “male” OWL is based on ‘description logic’ – a logic subset with efficient reasoners that are complete Good algorithms for reasoning about descriptions

That was then, this is now 1996-2000: focus on RDF and data 2000-2007: focus on OWL, developing ontologies, sophisticated reasoning 2008-…: Integrating and exploiting large RDF data collections backed by lightweight ontologies

A Linked Data story Wikipedia as a source of knowledge Wikis are a great ways to collaborate on building up knowledge resources Wikipedia as an ontology Every Wikipedia page is a concept or object Wikipedia as RDF data Map this ontology into RDF DBpedia as the lynchpin for Linked Data Exploit its breadth of coverage to integrate things

Populating Freebase KB

Underlying Powerset’s KB

Mined by TrueKnowledge

Wikipedia as an ontology Using Wikipedia as an ontology each article (~3M) is an ontology concept or instance terms linked via category system (~200k), infobox template use, inter-article links, infobox links Article history contains metadata for trust, provenance, etc. It’s a consensus ontology with broad coverage Created and maintained by a diverse community for free! Multilingual Very current Overall content quality is high

Wikipedia as an ontology Uncategorized and miscategorized articles Many ‘administrative’ categories: articles needing revision; useless ones: 1949 births Multiple infobox templates for the same class Multiple infobox attribute names for same property No datatypes or domains for infobox attribute values etc.

Dbpedia : Wikipedia in RDF A community effort to extract structured information from Wikipedia and publish as RDF on the Web Effort started in 2006 with EU funding Data and software open sourced DBpedia doesn’t extract information from Wikipedia’s text, but from the its structured information, e.g., links, categories, infoboxes

DBpedia: Linked Data lynchpin

http://lookup.dbpedia.org/

Dbpedia uses WP structured data DBpedia extracts structured data from Wikipedia, especially from Infoboxes

Dbpedia ontology Dbpedia 3.2 (Nov 2008) added a manually constructed ontology with 170 classes in a subsumption hierarchy 880K instances 940 properties with domain and range A partial, manual mapping was constructed from infobox attributes to these term Current domain and range constraints are “loose” Namespace: http://dbpedia.org/ontology/ Place 248,000 Person 214,000 Work 193,000 Species 90,000 Org. 76,000 Building 23,000

Person 56 properties

Organisation 50 properties

Place 110 properties

PREFIX dbp: <http://dbpedia.org/resource/> PREFIX dbpo: <http://dbpedia.org/ontology/> SELECT distinct ?Property ?Place WHERE {dbp:Barack_Obama ?Property ?Place . ?Place rdf:type dbpo:Place .} http://dbpedia.org/sparql/

DBpedia: Linked Data lynchpin

Consider Baltimore, MD

Looking at the RDF description We find assertions equating DBpedia's object for Baltimore with those in other LOD datasets: dbpedia:Baltimore%2C_Maryland owl:sameAs census:us/md/counties/baltimore/baltimore; owl:sameAs cyc:concept/Mx4rvVin-5wpEbGdrcN5Y29ycA; owl:sameAs freebase:guid.9202a8c04000641f800000000004921a; owl:sameAs geonames:4347778/ . Since owl:sameAs is defined as an equivalence relation, the mapping works both ways

Linked Data Cloud, March 2009

Four principles for linked data Use URIs to identify things that you expose to the Web as resources Use HTTP URIs so that people can locate and look up (dereference) these things. When someone looks up a URI, provide useful information Include links to other, related URIs in the exposed data as a means of improving information discovery on the Web -- Tim Berners-Lee, 2006

4.5 billion triples for free The full public LOD dataset has about 4.5 billion triples as of March 2009 Linking assertions are spotty, but probably include order 10M equivalences Availability: download the data in RDF Query it via a public SPARQL servers load it as an Amazon EC2 public dataset Launch it and required software as an Amazon public AMI image

Wikitology We’ve been exploring a different approach to derive an ontology from Wikipedia through a series of use cases: Identifying user context in a collaboration system from documents viewed (2006) Improve IR accuracy by adding Wikitology tags to documents (2007) ACE: cross document co-reference resolution for named entities in text (2008) TAC KBP: Knowledge Base population from text (2009) Improve Web search engine by tagging documents and queries (2009)

Wikitology 2.0 (2008) Freebase KB RDF RDF graphs text Yago WordNet The Wikitology KB was constructed by integrating information from three sources: Wikipedia, DBpedia and Freebase. We extract the text and links from articles, categories, disambiguation pages, etc. and represent it in a Lucene index, several database tables and a spreading activation graph DBpedia extracts information from Wikipedia and mixes is with information from the WordNet lexical KB and the Yago ontology and materializes it as RDF. Freebase exposes information extracted from Wikipedia and many databases in a graph/object database. Information can also be entered and edited manually by people Given some text and associate RDF triples, Wikitology returns a vector of matching articles and categories based on a combination of text similarity, RDF data match and spreading activation score. Yago WordNet Databases Human input & editing

Wikitology tagging Using Serif’s output, we produced an entity document for each entity. Included the entity’s name, nominal and pronominal mentions, APF type and subtype, and words in a window around the mentions We tagged entity documents using Wiki-tology producing vectors of (1) terms and (2) categories for the entity We used the vectors to compute features measuring entity pair similarity/dissimilarity

Wikitology Entity Document & Tags <DOCNO>ABC19980430.1830.0091.LDC2000T44-E2 <DOCNO> <TEXT> Webb Hubbell PER Individual NAM: "Hubbell” "Hubbells” "Webb Hubbell” "Webb_Hubbell" PRO: "he” "him” "his" abc's accountant after again ago all alleges alone also and arranged attorney avoid been before being betray but came can cat charges cheating circle clearly close concluded conspiracy cooperate counsel counsel's department did disgrace do dog dollars earned eightynine enough evasion feel financial firm first four friend friends going got grand happening has he help him hi s hope house hubbell hubbells hundred hush income increase independent indict indicted indictment inner investigating jackie jackie_judd jail jordan judd jury justice kantor ken knew lady late law left lie little make many mickey mid money mr my nineteen nineties ninetyfour not nothing now office other others paying peter_jennings president's pressure pressured probe prosecutors questions reported reveal rock saddened said schemed seen seven since starr statement such tax taxes tell them they thousand time today ultimately vernon washington webb webb_hubbell were what's whether which white whitewater why wife years </TEXT> </DOC> Wikitology article tag vector Webster_Hubbell 1.000 Hubbell_Trading_Post National Historic Site 0.379 United_States_v._Hubbell 0.377 Hubbell_Center 0.226 Whitewater_controversy 0.222 Wikitology category tag vector Clinton_administration_controversies 0.204 American_political_scandals 0.204 Living_people 0.201 1949_births 0.167 People_from_Arkansas 0.167 Arkansas_politicians 0.167 American_tax_evaders 0.167 Arkansas_lawyers 0.167 Name Type & subtype Mention heads Words surrounding mentions 48

Top Ten Features (by F1) Prec. Recall F1 Feature Description 90.8% 76.6% 83.1% some NAM mention has an exact match 92.9% 71.6% 80.9% Dice score of NAM strings (based on the intersection of NAM strings, not words or n-grams of NAM strings) 95.1% 65.0% 77.2% the/a longest NAM mention is an exact match 86.9% 66.2% 75.1% Similarity based on cosine similarity of Wikitology Article Medium article tag vector 86.1% 65.4% 74.3% Similarity based on cosine similarity of Wikitology Article Long article tag vector 64.8% 82.9% 72.8% Dice score of character bigrams from the 'longest' NAM string 95.9% 56.2% 70.9% all NAM mentions have an exact match in the other pair 85.3% 52.5% Similarity based on a match of entities' top Wikitology article tag 52.3% 85.7% 32.9% 47.5% Pair has a known alias

Knowledge Base Population The 2009 NIST Text Analysis Conference (TAC) will include a new Knowledge Base Population track Goal: discover information about named entities (people, organizations, places) and incorporate it into a KB TAC KBP has two related tasks: Entity linking: doc. entity mention -> KB entity Slot filling: given a document entity mention, find missing slot values in large corpus

KBs and IE are Symbiotic KB info helps interpret text Knowledge Base Information Extraction from Text IE helps populate KBs

Linked Semantic Web data & ontologies Wikitology 3.0 (2009) Articles IR collection Application Specific Algorithms Category Links Graph Infobox Graph Wikitology Code Application Specific Algorithms Infobox Graph Page Link Graph RDF reasoner This slide is from Zareen’s dissertation proposal. It describes a revised architecture where the structured and semi-structured data associated with Wikitology nodes is stored in a relational database or RDF store, as appropriate. I’ve also shown a connection from the “linked Open Data” cloud of ontologies and data sources that is on the Web, providing more structured data in RDF Application Specific Algorithms Relational Database Triple Store Linked Semantic Web data & ontologies

Wikipedia’s social network Wikipedia has an implicit ‘social network’ that can help disambiguate PER mentions Resolving PER mentions in a short document to KB people who are linked in the KB is good The same can be done for the network of ORG and GPE entities

WSN Data We extracted 213K people from the DBpedia’s Infobox dataset, ~30K of which participate in an infobox link to another person We extracted 875K people from Freebase, 616K of were linked to Wikipedia pages, 431K of which are in one of 4.8M person-person article links Consider a document that mentions two people: George Bush and Mr. Quayle

Which Bush & which Quayle? Six George Bushes Nine Male Quayles

A simple closeness metric Let Si = {two hop neighbors of Si} Cij = |intersection(Si,Sj)| / |union(Si,Sj) | Cij>0 for six of the 56 possible pairs 0.43 George_H._W._Bush -- Dan_Quayle 0.24 George_W._Bush -- Dan_Quayle 0.18 George_Bush_(biblical_scholar) -- Dan_Quayle 0.02 George_Bush_(biblical_scholar) -- James_C._Quayle 0.02 George_H._W._Bush -- Anthony_Quayle 0.01 George_H._W._Bush -- James_C._Quayle

Application to TAC KBP Using entity network data extracted from Dbpedia and Wikipedia provides evidence to support KBP tasks: Mapping document mentions into infobox entities Mapping potential slot fillers into infobox entities Evaluating the coherence of entities as potential slot fillers

Conclusion The Semantic Web approach is a powerful approach for data interoperability and integration The research focus is shifting to a “Web of Data” perspective Many research issue remain: uncertainty, provenance, trust, parallel graph algorithms, reasoning over billions of triples, user-friendly tools, etc. Just as the Web enhances human intelligence, the Semantic Web will enhance machine intelligence The ideas and technology are still evolving

http://ebiquity.umbc.edu/