Information Retrieval and the Semantic Web

Information Retrieval and the Semantic Web
Tim Finin, James Mayfield, Anupam Joshi, R. Scott Cost and Clay Fink University of Maryland, Baltimore County Johns Hopkins University, Applied Physics Lab 04 January 2004 DARPA contract F and NSF awards ITR-IIS and ITR-IIS provided partial research support for this work

Introduction and motivation

“XML is Lisp's bastard nephew, with uglier syntax and no semantics
“XML is Lisp's bastard nephew, with uglier syntax and no semantics. Yet XML is poised to enable the creation of a Web of data that dwarfs anything since the Library at Alexandria.” -- Philip Wadler, Et tu XML? The fall of the relational empire, VLDB, Rome, September 2001.

“The web has made people smarter
“The web has made people smarter. We need to understand how to use it to make machines smarter, too.” -- Michael I. Jordan (UC Berkeley), paraphrased from a talk at AAAI, July 2002

“The Semantic Web will globalize KR, just as the WWW globalize hypertext”
-- Tim Berners-Lee

“The multi-agent systems paradigm and the web both emerged around 1990
“The multi-agent systems paradigm and the web both emerged around One has succeeded beyond imagination and the other has not yet made it out of the lab.” -- Anonymous, 2001

tell register Software agents will need something similar to maximize the use of information on the semantic web.

Vision and Model

Vision Semantic markup (e.g., OWL) as markup
Web documents are traditional HTML documents, augmented with machine-readable semantic markup that describes their content Inference and retrieval are tightly bound Inference over semantic markup improves retrieval and text retrieval facilitates inference Agents should use the web like humans do Think of a query, encode to retrieve possibly relevant documents, read some and extract knowledge, repeat until objectives met

Why use IR techniques? We will want to retrieve over structured and unstructured knowledge We should prepare for the appearance of text documents with embedded SW markup We may want to get our SWDs into conventional search engines, such as Google. Mature, scalable, low cost, deployed infrastructure IR techniques also have some unique characteristics that may be very useful e.g., ranking matches, document similarity, clustering, relevance feedback, etc.

Framework–Semantic Markup
agent Local KB Semantic Web Query Inference Engine Extractor Encoder (“swangler”) Encoded Markup Semantic Markup Statement to be proved Web Search Engine Ranked Pages Filters Semantic Markup Semantic Markup

Framework–Incorporating Text
Local KB Semantic Web Query Inference Engine Extractor Encoder (“swangler”) Encoded Markup Semantic Markup Statement to be proved Web Search Engine Text Query Filters Text Text Ranked Pages Filters Semantic Markup Semantic Markup

Harnessing Google Google started indexing RDF documents some time in late 2003 Can we take advantage of this? We’ve developed techniques to get some structured data to be indexed by Google And then later retrieved Technique: give Google enhanced documents with additional annotations containing Swangle Terms ™

Swangle definition swan·gle
Pronunciation: ‘swa[ng]-g&l Function: transitive verb Inflected Forms: swan·gled; swan·gling /-g(&-)li[ng]/ Etymology: Postmodern English, from C++ mangle, Date: 20th century 1: to convert an RDF triple into one or more IR indexing terms 2: to process a document or query so that its content bearing markup will be indexed by an IR system Synonym: see tblify - swan·gler /-g(&-)l&r/ noun

Swangling Swangling turns a SW triple into 7 word like terms
One for each non-empty subset of the three components with the missing elements replaced by the special “don’t care” URI Terms generated by a hashing function (e.g., SHA1) Swangling an RDF document means adding in triples with swangle terms. This can be indexed and retrieved via conventional search engines like Google Allows one to search for a SWD with a triple that claims “Ossama bin Laden is located at X”

A Swangled Triple <rdf:RDF
xmlns:s=" </rdf> <s:SwangledTriple> <s:swangledText>N656WNTZ36KQ5PX6RFUGVKQ63A</s:swangledText> <rdfs:comment>Swangled text for [ </rdfs:comment> <s:swangledText>M6IMWPWIH4YQI4IMGZYBGPYKEI</s:swangledText> <s:swangledText>HO2H3FOPAEM53AQIZ6YVPFQ2XI</s:swangledText> <s:swangledText>2AQEUJOYPMXWKHZTENIJS6PQ6M</s:swangledText> <s:swangledText>IIVQRXOAYRH6GGRZDFXKEEB4PY</s:swangledText> <s:swangledText>75Q5Z3BYAKRPLZDLFNS5KKMTOY</s:swangledText> <s:swangledText>2FQ2YI7SNJ7OMXOXIDEEE2WOZU</s:swangledText> </s:SwangledTriple>

What’s the point? We’d like to get our documents into Google
Swangle terms look like words to Google and other search engines. Cloaking obviates modifying document Add rules to the web server so that, when a search spider asks for document X the document swangled(X) is returned. Caching makes this efficient A swangle term length of 7 may be an acceptable length for a Semantic Web of 1010 triples -- collision prob for a triple ~ 2*10-6. We could also use Swanglish – hashing each triple into N of the 50K most common English words

Student Event Scenario
UMBC sends out descriptions of ~50 events a week to students. Each student has a “standing query” used to route event messages. A student only receives announcements of events matching his/her interests and schedule. Use LMCO’s AeroText system to automatically add DAML+OIL markup to event descriptions. Categorize text announcements into event types Identify key elements and add DAML markup Use JESS to reason over the markup, drawing ontology-supported inferences

Event Ontology A simple ontology for University events
Includes classes, subclasses, properties, etc. Can include instance data, e.g., UMBC, NEC, Fairleigh Dickenson, etc.

OWLIR Architecture Jess Jess Jess Expand Event Description Agents
Classification Extract triples & reason Info Extraction Event Categories Movie Sport Talk . . . Trip LMCO AeroText + Java Jess Event Descriptions Text Jess Text+ DAML Text+ DAML Text + triples Text + triples Convert triples to index terms Extract triples & reason Convert triples to index terms Must Text Index Query User Interface Text OK Jess SIRE Must not Retrieve Text + triples Results User Interface Final Results Inference on results

Swoogle

http://swoogle.umbc.edu/ SWD = SWO + SWI SWD Rank SWD IR Engine
Swoogle Search SWOs SWIs HTML documents Images CGI scripts Audio files Video files SWD = SWO + SWI SWOOGLE 2 Ontology Dictionary Swoogle Search Ontology Dictionary Swoogle Statistics Web Server Human users The web, like Gaul, is divided into three parts: the regular web (e.g. HTML), Semantic Web Ontologies (SWOs), and Semantic Web Instance files (SWIs) Web Service Intelligent Agents service IR analyzer SWD analyzer analysis SWD Cache SWD Metadata digest SWD Reader Candidate URLs The Web SWD Rank Swoogle Statistics Web Crawler discovery A SWD’s rank is a function of its type (SWO/SWI) and the rank and types of the documents to which it’s related. Swoogle uses four kinds of crawlers to discover semantic web documents and several analysis agents to compute metadata and relations among documents and ontologies. Metadata is stored in a relational DBMS. Services are provided to people and agents. Statistics as of November 2004 SWDs 336,000 Classes 95,000 Triples 47,000,000 Properties 53,000 Ontologies 4,200 Individuals 7,200,000 SWD IR Engine Swoogle provides services to people via a web interface and to agents as web services. Swoogle puts documents into a character n-gram based IR engine to compute document similarity and do retrieval from queries Contributors include Tim Finin, Anupam Joshi, Yun Peng, R. Scott Cost, Jim Mayfield, Joel Sachs, Pavan Reddivari, Vishal Doshi, Rong Pan, Li Ding, and Drew Ogle. Partial research support was provided by DARPA contract F and by NSF by awards NSF-ITR-IIS and NSF-ITR-IDM November 2004.

Concepts Document Term Individual
A Semantic Web Document (SWD) is an online document written in semantic web languages (i.e. RDF and OWL). An ontology document (SWO) is a SWD that contains mostly term definition (i.e. classes and properties). It corresponds to T-Box in Description Logic. An instance document (SWI or SWDB) is a SWD that contains mostly class individuals. It corresponds to A-Box in Description Logic. Term A term is a non-anonymous RDF resource which is the URI reference of either a class or a property. Individual An individual refers to a non-anonymous RDF resource which is the URI reference of a class member. In swoogle, a document D is a valid SWD iff. JENA* correctly parses D and produces at least one triple. *JENA is a Java framework for writing Semantic Web applications. rdf:type foaf:Person rdfs:Class rdf:type foaf:Person

Demo 1 2 3 4 5 Find “Time” Ontology (Swoogle Search)
Digest “Time” Ontology Document view Term view 2 3 Find Term “Person” (Ontology Dictionary) Digest Term “Person” Class properties (Instance) properties 4 5 Swoogle Statistics

Demo 1 Find “Time” Ontology We can use a set of keywords to search ontology. For example, “time, before, after” are basic concepts for a “Time” ontology.

Usage of Terms in SWD http://www.cs.umbc.edu/~finin/foaf.rdf
rdf:type foaf:Person rdf:type foaf:Person foaf:mbox foaf:mbox populated Class rdfs:subClassOf wordNet:Agent populated Property foaf:Person rdf:type rdfs:Class rdfs:domain foaf:mbox defined Class rdf:type defined Property rdf:Property defined Individual

Digest “Time” Ontology (term view)
Demo 2(a) Digest “Time” Ontology (term view) TimeZone before …………. intAfter

Digest “Time” Ontology (document view)
Demo 2(b) Digest “Time” Ontology (document view)

Demo 3 Find Term “Person” Not capitalized! URIref is case sensitive!

167 different properties 562 different properties
Demo 4 Digest Term “Person” 167 different properties 562 different properties

Demo 5 Swoogle Statistics

Swoogle IR Search This is work in progress, not yet fully integrated into Swoogle Documents are put into an ngram IR engine (after processing by Jena) in canonical XML form Each contiguous sequence of N characters is used as an index term (e.g., N=5) Queries processed the same way Character ngrams work almost as well as words but have some advantages No tokenization, so works well with artificial languages and agglutinative languages => good for RDF!

Why character n-grams? Suppose we want to find ontologies for time
We might use the following query “time temporal interval point before after during day month year eventually calendar clock duration end begin zone” And have matches for documents with URIs like

Another approach: URIs as words
Remember: ontologies define vocabularies In OWL, URIs of classes and properties are the words So, take a SWD, reduce to triples, extract the URIs (with duplicates), discard URIs for blank nodes, hash each URI to a token (use MD5Hash), and index the document. Process queries in the same way Variation: include literal data (e.g., strings) too.

Conclusion

What we have done Developed Swoogle – a crawler based retrieval system for SWDs Developed and implemented a technique to get Google to index and retrieve SWDs Prototyped (twice) an ngram based IR engine for SWDs Explored the integration of inference and retrieval Used these in several demonstration systems

Information Retrieval and the Semantic Web

Similar presentations

Presentation on theme: "Information Retrieval and the Semantic Web"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Retrieval and the Semantic Web

Similar presentations

Presentation on theme: "Information Retrieval and the Semantic Web"— Presentation transcript:

Similar presentations

About project

Feedback