Information Retrieval and the Semantic Web

Slides:



Advertisements
Similar presentations
DAML PI Meeting Status Briefing UMBC, JHU APL, MIT Sloan Tim Finin Jim Mayfield Benjamin Grosof February 12, 2002 tell register JHU APL Haircut retrieval.
Advertisements

CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
Semantic Web Thanks to folks at LAIT lab Sources include :
1 UIM with DAML-S Service Description Team Members: Jean-Yves Ouellet Kevin Lam Yun Xu.
CS570 Artificial Intelligence Semantic Web & Ontology 2
By Ahmet Can Babaoğlu Abdurrahman Beşinci.  Suppose you want to buy a Star wars DVD having such properties;  wide-screen ( not full-screen )  the extra.
Roi Adadi David Ben-David.  Semantic Web Document (SWD) ◦ A web page that serializes an RDF graph. ◦ Uses one of the recommended RDF syntax languages,
Dr. Alexandra I. Cristea RDF.
COMP 6703 eScience Project Semantic Web for Museums Student : Lei Junran Client/Technical Supervisor : Tom Worthington Academic Supervisor : Peter Strazdins.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
Research Problems in Semantic Web Search Varish Mulwad ____________________________ 1.
The Semantic Web Week 12 Term 1 Recap Lee McCluskey, room 2/07 Department of Computing And Mathematical Sciences Module Website:
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Semantic Web Presented by: Edward Cheng Wayne Choi Tony Deng Peter Kuc-Pittet Anita Yong.
Swoogle Swoogle Semantic Search Engine Web-enhanced Information Management Bin Wang.
Overview of Search Engines
UMBC an Honors University in Maryland 1 Knowledge Sharing on the Semantic Web Tim Finin University of Maryland, Baltimore County Department of Homeland.
Practical RDF Chapter 1. RDF: An Introduction
Logics for Data and Knowledge Representation
0 DAML Tools for Intelligent Information Annotation, Sharing and Retrieval UMBC Johns Hopkins University Applied Physics Lab MIT Sloan School Tim Finin.
@ Swoogle Tutorial (Part II: Swoogle Demo) A canned demo Use-case: UMBC tree survey Presented by eBiquity Lab, CSEE, UMBC.
UMBC an Honors University in Maryland 1 Search Engines for Semantic Web Knowledge Tim Finin University of Maryland, Baltimore County Joint work with Li.
Introduction to the Semantic Web. Questions What is the Semantic Web? Why do we want it? How will we do it? Who will do it? When will it be done?
@ Presented by eBiquity group, UMBC CIKM’04, Nov 12, 2004 SwoogleSwoogle SwoogleSwoogle search and metadata for the semantic web Partial research support.
Semantic Web - an introduction By Daniel Wu (danielwujr)
UMBC an Honors University in Maryland 1 Search Engines for Semantic Web Knowledge Tim Finin University of Maryland, Baltimore County Joint work with Li.
UMBC an Honors University in Maryland 1 Information Integration and the Semantic Web Finding knowledge, data and answers Tim Finin University of Maryland,
EEL 5937 Ontologies EEL 5937 Multi Agent Systems Lecture 5, Jan 23 th, 2003 Lotzi Bölöni.
UMBC an Honors University in Maryland 1 Finding knowledge, data and answers on the Semantic Web Tim Finin University of Maryland, Baltimore County
Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.
Introduction to the Semantic Web and Linked Data
UMBC an Honors University in Maryland 1 Information Integration and the Semantic Web Finding knowledge, data and answers Tim Finin 1, Anupam Joshi 1, Li.
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Strategies for subject navigation of linked Web sites using RDF topic maps Carol Jean Godby Devon Smith OCLC Online Computer Library Center Knowledge Technologies.
Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28,
EEL 5937 Ontologies EEL 5937 Multi Agent Systems Lotzi Bölöni.
UMBC an Honors University in Maryland 1 Finding and Ranking Knowledge on the Semantic Web Li Ding, Rong Pan, Tim Finin, Anupam Joshi, Yun Peng and Pranam.
Semantic Web COMS 6135 Class Presentation Jian Pan Department of Computer Science Columbia University Web Enhanced Information Management.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
An Ontological Approach to Financial Analysis and Monitoring.
@ eBiquity Lab, CSEE, UMBC Swoogle Tutorial (Part I: Swoogle R & D) A brief introduction to Swoogle An overview of Swoogle research A summary of Swoogle.
Enable Semantic Interoperability for Decision Support and Risk Management Presented by Dr. David Li Key Contributors: Dr. Ruixin Yang and Dr. John Qu.
UMBC an Honors University in Maryland 1 Searching for Knowledge and Data on the Semantic Web Tim Finin University of Maryland, Baltimore County
Selected Semantic Web UMBC CoBrA – Context Broker Architecture  Using OWL to define ontologies for context modeling and reasoning  Taking.
1 Web Services for Semantic Interoperability and Integration Tim Finin University of Maryland, Baltimore County Dagstuhl, 20 September 2004
©2003 Paula Matuszek CSC 9010: AeroText, Ontologies, AeroDAML Dr. Paula Matuszek (610)
Setting the stage: linked data concepts Moving-Away-From-MARC-a-thon.
SEMANTIC WEB Presented by- Farhana Yasmin – MD.Raihanul Islam – Nohore Jannat –
Swoogle: A Semantic Web Search and Metadata Engine Li Ding, Tim Finin, Anupam Joshi, Rong Pan, R. Scott Cost, Yun Peng Pavan Reddivari, Vishal Doshi, Joel.
Introduction to the Semantic Web. Questions What is the Semantic Web? Why do we want it? How will we do it? Who will do it? When will it be done?
Facilitating Semantic Web Search with Embedded Grammar Tags (EGTs) Gautham K.Dorai Yaser Yacoob Department of Computer Science University of Maryland –
Introduction to the Semantic Web. Questions What is the Semantic Web? Why do we want it? How will we do it? Who will do it? When will it be done?
OWL (Ontology Web Language and Applications) Maw-Sheng Horng Department of Mathematics and Information Education National Taipei University of Education.
Information Retrieval in Practice
The Semantic Web By: Maulik Parikh.
Linked Data Web that can be processed by machines
Search Engine Architecture
Finding knowledge, data and answers on the Semantic Web
Multi-agent system for web services
RDF For Semantic Web Dhaval Patel 2nd Year Student School of IT
SWD = SWO + SWI SWD Rank SWD IR Engine
Web Services for Semantic Interoperability and Integration
Presented by ebiqity UMBC Nov, 2004
Visit Swoogle web site at
PREMIS Tools and Services
Lecture 8 Information Retrieval Introduction
Semantic Markup for Semantic Web Tools:
OntoRank for RDF documents
Swangling S S Inference U C B M        
Presentation transcript:

Information Retrieval and the Semantic Web Tim Finin, James Mayfield, Anupam Joshi, R. Scott Cost and Clay Fink University of Maryland, Baltimore County Johns Hopkins University, Applied Physics Lab 04 January 2004 DARPA contract F30602-00-0591and NSF awards ITR-IIS-0326460 and ITR-IIS-0325464 provided partial research support for this work

Introduction and motivation

“XML is Lisp's bastard nephew, with uglier syntax and no semantics “XML is Lisp's bastard nephew, with uglier syntax and no semantics. Yet XML is poised to enable the creation of a Web of data that dwarfs anything since the Library at Alexandria.” -- Philip Wadler, Et tu XML? The fall of the relational empire, VLDB, Rome, September 2001.

“The web has made people smarter “The web has made people smarter. We need to understand how to use it to make machines smarter, too.” -- Michael I. Jordan (UC Berkeley), paraphrased from a talk at AAAI, July 2002

“The Semantic Web will globalize KR, just as the WWW globalize hypertext” -- Tim Berners-Lee

“The multi-agent systems paradigm and the web both emerged around 1990 “The multi-agent systems paradigm and the web both emerged around 1990. One has succeeded beyond imagination and the other has not yet made it out of the lab.” -- Anonymous, 2001

tell register Software agents will need something similar to maximize the use of information on the semantic web.

Vision and Model

Vision Semantic markup (e.g., OWL) as markup Web documents are traditional HTML documents, augmented with machine-readable semantic markup that describes their content Inference and retrieval are tightly bound Inference over semantic markup improves retrieval and text retrieval facilitates inference Agents should use the web like humans do Think of a query, encode to retrieve possibly relevant documents, read some and extract knowledge, repeat until objectives met

Why use IR techniques? We will want to retrieve over structured and unstructured knowledge We should prepare for the appearance of text documents with embedded SW markup We may want to get our SWDs into conventional search engines, such as Google. Mature, scalable, low cost, deployed infrastructure IR techniques also have some unique characteristics that may be very useful e.g., ranking matches, document similarity, clustering, relevance feedback, etc.

Framework–Semantic Markup agent Local KB Semantic Web Query Inference Engine Extractor Encoder (“swangler”) Encoded Markup Semantic Markup Statement to be proved Web Search Engine Ranked Pages Filters Semantic Markup Semantic Markup

Framework–Incorporating Text Local KB Semantic Web Query Inference Engine Extractor Encoder (“swangler”) Encoded Markup Semantic Markup Statement to be proved Web Search Engine Text Query Filters Text Text Ranked Pages Filters Semantic Markup Semantic Markup

Harnessing Google Google started indexing RDF documents some time in late 2003 Can we take advantage of this? We’ve developed techniques to get some structured data to be indexed by Google And then later retrieved Technique: give Google enhanced documents with additional annotations containing Swangle Terms ™

Swangle definition swan·gle Pronunciation: ‘swa[ng]-g&l Function: transitive verb Inflected Forms: swan·gled; swan·gling /-g(&-)li[ng]/ Etymology: Postmodern English, from C++ mangle, Date: 20th century 1: to convert an RDF triple into one or more IR indexing terms 2: to process a document or query so that its content bearing markup will be indexed by an IR system Synonym: see tblify - swan·gler /-g(&-)l&r/ noun

Swangling Swangling turns a SW triple into 7 word like terms One for each non-empty subset of the three components with the missing elements replaced by the special “don’t care” URI Terms generated by a hashing function (e.g., SHA1) Swangling an RDF document means adding in triples with swangle terms. This can be indexed and retrieved via conventional search engines like Google Allows one to search for a SWD with a triple that claims “Ossama bin Laden is located at X”

A Swangled Triple <rdf:RDF xmlns:s="http://swoogle.umbc.edu/ontologies/swangle.owl#" </rdf> <s:SwangledTriple> <s:swangledText>N656WNTZ36KQ5PX6RFUGVKQ63A</s:swangledText> <rdfs:comment>Swangled text for [http://www.xfront.com/owl/ontologies/camera/#Camera, http://www.w3.org/2000/01/rdf-schema#subClassOf, http://www.xfront.com/owl/ontologies/camera/#PurchaseableItem] </rdfs:comment> <s:swangledText>M6IMWPWIH4YQI4IMGZYBGPYKEI</s:swangledText> <s:swangledText>HO2H3FOPAEM53AQIZ6YVPFQ2XI</s:swangledText> <s:swangledText>2AQEUJOYPMXWKHZTENIJS6PQ6M</s:swangledText> <s:swangledText>IIVQRXOAYRH6GGRZDFXKEEB4PY</s:swangledText> <s:swangledText>75Q5Z3BYAKRPLZDLFNS5KKMTOY</s:swangledText> <s:swangledText>2FQ2YI7SNJ7OMXOXIDEEE2WOZU</s:swangledText> </s:SwangledTriple>

What’s the point? We’d like to get our documents into Google Swangle terms look like words to Google and other search engines. Cloaking obviates modifying document Add rules to the web server so that, when a search spider asks for document X the document swangled(X) is returned. Caching makes this efficient A swangle term length of 7 may be an acceptable length for a Semantic Web of 1010 triples -- collision prob for a triple ~ 2*10-6. We could also use Swanglish – hashing each triple into N of the 50K most common English words

OWLIR

Student Event Scenario UMBC sends out descriptions of ~50 events a week to students. Each student has a “standing query” used to route event messages. A student only receives announcements of events matching his/her interests and schedule. Use LMCO’s AeroText system to automatically add DAML+OIL markup to event descriptions. Categorize text announcements into event types Identify key elements and add DAML markup Use JESS to reason over the markup, drawing ontology-supported inferences

Event Ontology A simple ontology for University events Includes classes, subclasses, properties, etc. Can include instance data, e.g., UMBC, NEC, Fairleigh Dickenson, etc.

OWLIR Architecture Jess Jess Jess Expand Event Description Agents Classification Extract triples & reason Info Extraction Event Categories Movie Sport Talk . . . Trip LMCO AeroText + Java Jess Event Descriptions Text Jess Text+ DAML Text+ DAML Text + triples Text + triples Convert triples to index terms Extract triples & reason Convert triples to index terms Must Text Index Query User Interface Text OK Jess SIRE Must not Retrieve Text + triples Results User Interface Final Results Inference on results

Swoogle

http://swoogle.umbc.edu/ SWD = SWO + SWI SWD Rank SWD IR Engine Swoogle Search SWOs SWIs HTML documents Images CGI scripts Audio files Video files SWD = SWO + SWI SWOOGLE 2 Ontology Dictionary Swoogle Search Ontology Dictionary Swoogle Statistics Web Server Human users The web, like Gaul, is divided into three parts: the regular web (e.g. HTML), Semantic Web Ontologies (SWOs), and Semantic Web Instance files (SWIs) Web Service Intelligent Agents service IR analyzer SWD analyzer analysis SWD Cache SWD Metadata digest SWD Reader Candidate URLs The Web SWD Rank Swoogle Statistics Web Crawler discovery A SWD’s rank is a function of its type (SWO/SWI) and the rank and types of the documents to which it’s related. Swoogle uses four kinds of crawlers to discover semantic web documents and several analysis agents to compute metadata and relations among documents and ontologies. Metadata is stored in a relational DBMS. Services are provided to people and agents. http://swoogle.umbc.edu/ Statistics as of November 2004 SWDs 336,000 Classes 95,000 Triples 47,000,000 Properties 53,000 Ontologies 4,200 Individuals 7,200,000 SWD IR Engine Swoogle provides services to people via a web interface and to agents as web services. Swoogle puts documents into a character n-gram based IR engine to compute document similarity and do retrieval from queries Contributors include Tim Finin, Anupam Joshi, Yun Peng, R. Scott Cost, Jim Mayfield, Joel Sachs, Pavan Reddivari, Vishal Doshi, Rong Pan, Li Ding, and Drew Ogle. Partial research support was provided by DARPA contract F30602-00-0591 and by NSF by awards NSF-ITR-IIS-0326460 and NSF-ITR-IDM-0219649. November 2004.

Concepts Document Term Individual A Semantic Web Document (SWD) is an online document written in semantic web languages (i.e. RDF and OWL). An ontology document (SWO) is a SWD that contains mostly term definition (i.e. classes and properties). It corresponds to T-Box in Description Logic. An instance document (SWI or SWDB) is a SWD that contains mostly class individuals. It corresponds to A-Box in Description Logic. Term A term is a non-anonymous RDF resource which is the URI reference of either a class or a property. Individual An individual refers to a non-anonymous RDF resource which is the URI reference of a class member. In swoogle, a document D is a valid SWD iff. JENA* correctly parses D and produces at least one triple. *JENA is a Java framework for writing Semantic Web applications. http://www.hpl.hp.com/semweb/jena2.htm rdf:type foaf:Person rdfs:Class rdf:type http://.../foaf.rdf#finin foaf:Person

Demo 1 2 3 4 5 Find “Time” Ontology (Swoogle Search) Digest “Time” Ontology Document view Term view 2 3 Find Term “Person” (Ontology Dictionary) Digest Term “Person” Class properties (Instance) properties 4 5 Swoogle Statistics

Demo 1 Find “Time” Ontology We can use a set of keywords to search ontology. For example, “time, before, after” are basic concepts for a “Time” ontology.

Usage of Terms in SWD http://www.cs.umbc.edu/~finin/foaf.rdf http://foo.com/foaf.rdf rdf:type foaf:Person rdf:type foaf:Person foaf:mbox http://foo.com/foaf.rdf#finin finin@umbc.edu foaf:mbox finin@umbc.edu http://xmlns.com/foaf/1.0/ populated Class rdfs:subClassOf wordNet:Agent populated Property foaf:Person rdf:type rdfs:Class rdfs:domain foaf:mbox defined Class rdf:type defined Property rdf:Property defined Individual

Digest “Time” Ontology (term view) Demo 2(a) Digest “Time” Ontology (term view) TimeZone before …………. intAfter

Digest “Time” Ontology (document view) Demo 2(b) Digest “Time” Ontology (document view)

Demo 3 Find Term “Person” Not capitalized! URIref is case sensitive!

167 different properties 562 different properties Demo 4 Digest Term “Person” 167 different properties 562 different properties

Demo 5 Swoogle Statistics

Swoogle IR Search This is work in progress, not yet fully integrated into Swoogle Documents are put into an ngram IR engine (after processing by Jena) in canonical XML form Each contiguous sequence of N characters is used as an index term (e.g., N=5) Queries processed the same way Character ngrams work almost as well as words but have some advantages No tokenization, so works well with artificial languages and agglutinative languages => good for RDF!

Why character n-grams? Suppose we want to find ontologies for time We might use the following query “time temporal interval point before after during day month year eventually calendar clock duration end begin zone” And have matches for documents with URIs like http://foo.com/timeont.owl#timeInterval http://foo.com/timeont.owl#CalendarClockInterval http://purl.org/upper/temporal/t13.owl#timeThing

Another approach: URIs as words Remember: ontologies define vocabularies In OWL, URIs of classes and properties are the words So, take a SWD, reduce to triples, extract the URIs (with duplicates), discard URIs for blank nodes, hash each URI to a token (use MD5Hash), and index the document. Process queries in the same way Variation: include literal data (e.g., strings) too.

Conclusion

What we have done Developed Swoogle – a crawler based retrieval system for SWDs Developed and implemented a technique to get Google to index and retrieve SWDs Prototyped (twice) an ngram based IR engine for SWDs Explored the integration of inference and retrieval Used these in several demonstration systems