Information Retrieval and the Semantic Web Tim Finin, James Mayfield, Anupam Joshi, R. Scott Cost and Clay Fink University of Maryland, Baltimore County Johns Hopkins University, Applied Physics Lab 04 January 2004 DARPA contract F30602-00-0591and NSF awards ITR-IIS-0326460 and ITR-IIS-0325464 provided partial research support for this work
Introduction and motivation
“XML is Lisp's bastard nephew, with uglier syntax and no semantics “XML is Lisp's bastard nephew, with uglier syntax and no semantics. Yet XML is poised to enable the creation of a Web of data that dwarfs anything since the Library at Alexandria.” -- Philip Wadler, Et tu XML? The fall of the relational empire, VLDB, Rome, September 2001.
“The web has made people smarter “The web has made people smarter. We need to understand how to use it to make machines smarter, too.” -- Michael I. Jordan (UC Berkeley), paraphrased from a talk at AAAI, July 2002
“The Semantic Web will globalize KR, just as the WWW globalize hypertext” -- Tim Berners-Lee
“The multi-agent systems paradigm and the web both emerged around 1990 “The multi-agent systems paradigm and the web both emerged around 1990. One has succeeded beyond imagination and the other has not yet made it out of the lab.” -- Anonymous, 2001
tell register Software agents will need something similar to maximize the use of information on the semantic web.
Vision and Model
Vision Semantic markup (e.g., OWL) as markup Web documents are traditional HTML documents, augmented with machine-readable semantic markup that describes their content Inference and retrieval are tightly bound Inference over semantic markup improves retrieval and text retrieval facilitates inference Agents should use the web like humans do Think of a query, encode to retrieve possibly relevant documents, read some and extract knowledge, repeat until objectives met
Why use IR techniques? We will want to retrieve over structured and unstructured knowledge We should prepare for the appearance of text documents with embedded SW markup We may want to get our SWDs into conventional search engines, such as Google. Mature, scalable, low cost, deployed infrastructure IR techniques also have some unique characteristics that may be very useful e.g., ranking matches, document similarity, clustering, relevance feedback, etc.
Framework–Semantic Markup agent Local KB Semantic Web Query Inference Engine Extractor Encoder (“swangler”) Encoded Markup Semantic Markup Statement to be proved Web Search Engine Ranked Pages Filters Semantic Markup Semantic Markup
Framework–Incorporating Text Local KB Semantic Web Query Inference Engine Extractor Encoder (“swangler”) Encoded Markup Semantic Markup Statement to be proved Web Search Engine Text Query Filters Text Text Ranked Pages Filters Semantic Markup Semantic Markup
Harnessing Google Google started indexing RDF documents some time in late 2003 Can we take advantage of this? We’ve developed techniques to get some structured data to be indexed by Google And then later retrieved Technique: give Google enhanced documents with additional annotations containing Swangle Terms ™
Swangle definition swan·gle Pronunciation: ‘swa[ng]-g&l Function: transitive verb Inflected Forms: swan·gled; swan·gling /-g(&-)li[ng]/ Etymology: Postmodern English, from C++ mangle, Date: 20th century 1: to convert an RDF triple into one or more IR indexing terms 2: to process a document or query so that its content bearing markup will be indexed by an IR system Synonym: see tblify - swan·gler /-g(&-)l&r/ noun
Swangling Swangling turns a SW triple into 7 word like terms One for each non-empty subset of the three components with the missing elements replaced by the special “don’t care” URI Terms generated by a hashing function (e.g., SHA1) Swangling an RDF document means adding in triples with swangle terms. This can be indexed and retrieved via conventional search engines like Google Allows one to search for a SWD with a triple that claims “Ossama bin Laden is located at X”
A Swangled Triple <rdf:RDF xmlns:s="http://swoogle.umbc.edu/ontologies/swangle.owl#" </rdf> <s:SwangledTriple> <s:swangledText>N656WNTZ36KQ5PX6RFUGVKQ63A</s:swangledText> <rdfs:comment>Swangled text for [http://www.xfront.com/owl/ontologies/camera/#Camera, http://www.w3.org/2000/01/rdf-schema#subClassOf, http://www.xfront.com/owl/ontologies/camera/#PurchaseableItem] </rdfs:comment> <s:swangledText>M6IMWPWIH4YQI4IMGZYBGPYKEI</s:swangledText> <s:swangledText>HO2H3FOPAEM53AQIZ6YVPFQ2XI</s:swangledText> <s:swangledText>2AQEUJOYPMXWKHZTENIJS6PQ6M</s:swangledText> <s:swangledText>IIVQRXOAYRH6GGRZDFXKEEB4PY</s:swangledText> <s:swangledText>75Q5Z3BYAKRPLZDLFNS5KKMTOY</s:swangledText> <s:swangledText>2FQ2YI7SNJ7OMXOXIDEEE2WOZU</s:swangledText> </s:SwangledTriple>
What’s the point? We’d like to get our documents into Google Swangle terms look like words to Google and other search engines. Cloaking obviates modifying document Add rules to the web server so that, when a search spider asks for document X the document swangled(X) is returned. Caching makes this efficient A swangle term length of 7 may be an acceptable length for a Semantic Web of 1010 triples -- collision prob for a triple ~ 2*10-6. We could also use Swanglish – hashing each triple into N of the 50K most common English words
OWLIR
Student Event Scenario UMBC sends out descriptions of ~50 events a week to students. Each student has a “standing query” used to route event messages. A student only receives announcements of events matching his/her interests and schedule. Use LMCO’s AeroText system to automatically add DAML+OIL markup to event descriptions. Categorize text announcements into event types Identify key elements and add DAML markup Use JESS to reason over the markup, drawing ontology-supported inferences
Event Ontology A simple ontology for University events Includes classes, subclasses, properties, etc. Can include instance data, e.g., UMBC, NEC, Fairleigh Dickenson, etc.
OWLIR Architecture Jess Jess Jess Expand Event Description Agents Classification Extract triples & reason Info Extraction Event Categories Movie Sport Talk . . . Trip LMCO AeroText + Java Jess Event Descriptions Text Jess Text+ DAML Text+ DAML Text + triples Text + triples Convert triples to index terms Extract triples & reason Convert triples to index terms Must Text Index Query User Interface Text OK Jess SIRE Must not Retrieve Text + triples Results User Interface Final Results Inference on results
Swoogle
http://swoogle.umbc.edu/ SWD = SWO + SWI SWD Rank SWD IR Engine Swoogle Search SWOs SWIs HTML documents Images CGI scripts Audio files Video files SWD = SWO + SWI SWOOGLE 2 Ontology Dictionary Swoogle Search Ontology Dictionary Swoogle Statistics Web Server Human users The web, like Gaul, is divided into three parts: the regular web (e.g. HTML), Semantic Web Ontologies (SWOs), and Semantic Web Instance files (SWIs) Web Service Intelligent Agents service IR analyzer SWD analyzer analysis SWD Cache SWD Metadata digest SWD Reader Candidate URLs The Web SWD Rank Swoogle Statistics Web Crawler discovery A SWD’s rank is a function of its type (SWO/SWI) and the rank and types of the documents to which it’s related. Swoogle uses four kinds of crawlers to discover semantic web documents and several analysis agents to compute metadata and relations among documents and ontologies. Metadata is stored in a relational DBMS. Services are provided to people and agents. http://swoogle.umbc.edu/ Statistics as of November 2004 SWDs 336,000 Classes 95,000 Triples 47,000,000 Properties 53,000 Ontologies 4,200 Individuals 7,200,000 SWD IR Engine Swoogle provides services to people via a web interface and to agents as web services. Swoogle puts documents into a character n-gram based IR engine to compute document similarity and do retrieval from queries Contributors include Tim Finin, Anupam Joshi, Yun Peng, R. Scott Cost, Jim Mayfield, Joel Sachs, Pavan Reddivari, Vishal Doshi, Rong Pan, Li Ding, and Drew Ogle. Partial research support was provided by DARPA contract F30602-00-0591 and by NSF by awards NSF-ITR-IIS-0326460 and NSF-ITR-IDM-0219649. November 2004.
Concepts Document Term Individual A Semantic Web Document (SWD) is an online document written in semantic web languages (i.e. RDF and OWL). An ontology document (SWO) is a SWD that contains mostly term definition (i.e. classes and properties). It corresponds to T-Box in Description Logic. An instance document (SWI or SWDB) is a SWD that contains mostly class individuals. It corresponds to A-Box in Description Logic. Term A term is a non-anonymous RDF resource which is the URI reference of either a class or a property. Individual An individual refers to a non-anonymous RDF resource which is the URI reference of a class member. In swoogle, a document D is a valid SWD iff. JENA* correctly parses D and produces at least one triple. *JENA is a Java framework for writing Semantic Web applications. http://www.hpl.hp.com/semweb/jena2.htm rdf:type foaf:Person rdfs:Class rdf:type http://.../foaf.rdf#finin foaf:Person
Demo 1 2 3 4 5 Find “Time” Ontology (Swoogle Search) Digest “Time” Ontology Document view Term view 2 3 Find Term “Person” (Ontology Dictionary) Digest Term “Person” Class properties (Instance) properties 4 5 Swoogle Statistics
Demo 1 Find “Time” Ontology We can use a set of keywords to search ontology. For example, “time, before, after” are basic concepts for a “Time” ontology.
Usage of Terms in SWD http://www.cs.umbc.edu/~finin/foaf.rdf http://foo.com/foaf.rdf rdf:type foaf:Person rdf:type foaf:Person foaf:mbox http://foo.com/foaf.rdf#finin finin@umbc.edu foaf:mbox finin@umbc.edu http://xmlns.com/foaf/1.0/ populated Class rdfs:subClassOf wordNet:Agent populated Property foaf:Person rdf:type rdfs:Class rdfs:domain foaf:mbox defined Class rdf:type defined Property rdf:Property defined Individual
Digest “Time” Ontology (term view) Demo 2(a) Digest “Time” Ontology (term view) TimeZone before …………. intAfter
Digest “Time” Ontology (document view) Demo 2(b) Digest “Time” Ontology (document view)
Demo 3 Find Term “Person” Not capitalized! URIref is case sensitive!
167 different properties 562 different properties Demo 4 Digest Term “Person” 167 different properties 562 different properties
Demo 5 Swoogle Statistics
Swoogle IR Search This is work in progress, not yet fully integrated into Swoogle Documents are put into an ngram IR engine (after processing by Jena) in canonical XML form Each contiguous sequence of N characters is used as an index term (e.g., N=5) Queries processed the same way Character ngrams work almost as well as words but have some advantages No tokenization, so works well with artificial languages and agglutinative languages => good for RDF!
Why character n-grams? Suppose we want to find ontologies for time We might use the following query “time temporal interval point before after during day month year eventually calendar clock duration end begin zone” And have matches for documents with URIs like http://foo.com/timeont.owl#timeInterval http://foo.com/timeont.owl#CalendarClockInterval http://purl.org/upper/temporal/t13.owl#timeThing
Another approach: URIs as words Remember: ontologies define vocabularies In OWL, URIs of classes and properties are the words So, take a SWD, reduce to triples, extract the URIs (with duplicates), discard URIs for blank nodes, hash each URI to a token (use MD5Hash), and index the document. Process queries in the same way Variation: include literal data (e.g., strings) too.
Conclusion
What we have done Developed Swoogle – a crawler based retrieval system for SWDs Developed and implemented a technique to get Google to index and retrieve SWDs Prototyped (twice) an ngram based IR engine for SWDs Explored the integration of inference and retrieval Used these in several demonstration systems