XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Slides:



Advertisements
Similar presentations
Symmetrically Exploiting XML Shuohao Zhang and Curtis Dyreson School of E.E. and Computer Science Washington State University Pullman, Washington, USA.
Advertisements

Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Basic IR: Modeling Basic IR Task: Slightly more complex:
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.
XSEarch XML Search Engine Jonathan MAMOU October 2002.
Information Retrieval in Practice
Xyleme A Dynamic Warehouse for XML Data of the Web.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Information Retrieval
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Search Engines and Information Retrieval Chapter 1.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Querying Structured Text in an XML Database By Xuemei Luo.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
XSEarch: A Semantic Search Engine for XML Sara Cohen, Jonathan Mamou, Yaron Kanza, Yehoshua Sagiv The Hebrew University of Jerusalem Presented by Deniz.
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Facilitating Document Annotation using Content and Querying Value.
Algorithmic Detection of Semantic Similarity WWW 2005.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
1 Information Retrieval LECTURE 1 : Introduction.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Efficient Processing of Updates in Dynamic XML Data Changqing Li, Tok Wang Ling, Min Hu.
1 CS 430: Information Discovery Lecture 5 Ranking.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Schema-Free XQuery Based on the work of: Yanyao Li, Cong Yu and H.V.Jagadish From the University of Michigan From the University of Michigan Presented.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Facilitating Document Annotation Using Content and Querying Value.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
General Architecture of Retrieval Systems 1Adrienn Skrop.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Major Issues n Information is mostly online n Information is increasing available in full-text (full-content) n There is an explosion in the amount of.
Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Multimedia Information Retrieval
Toshiyuki Shimizu (Kyoto University)
Structure and Content Scoring for XML
MCN: A New Semantics Towards Effective XML Keyword Search
Structure and Content Scoring for XML
Introduction to XML IR XML Group.
Presentation transcript:

XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany

XSEarch an XML Search Engine Our Goal: Find the “relevant” XML fragments, given tag names and keywords Our Goal: Find the “relevant” XML fragments, given tag names and keywords

Excerpt from the XML Version of DBLP Moshe Y. Vardi Querying Logical Databases Victor Vianu A Web Odyssey: From Codd to XML

A Search Example Find papers by Vianu on the topic of “logical databases” How can we find such papers?

Attempt 1: Standard Search Engine Each document in the corpus is treated as an integral unit. A document containing some of the three query terms is considered as a result.

The document is not relevant to the query. This does not work!!! The document is returned BUT it does not contain any paper on “logical databases” by Vianu This fragment does not represent a paper by Vianu This fragment does not represent a paper about logical databases Moshe Y. Vardi Querying Logical Databases Victor Vianu A Web Odyssey: From Codd to XML The document contains the three query terms. Hence, it is returned by a standard search engine. BUT

Attempt 2: XML Query Language FOR $i IN document(“bib.xml”)//inproceedings WHERE $i/author contains ‘Vianu’ AND $i/title contains ‘Logical’ AND $i/title contains ‘Databases’ RETURN $i/author $i/title FOR $i IN document(“bib.xml”)//inproceedings WHERE $i/author contains ‘Vianu’ AND $i/title contains ‘Logical’ AND $i/title contains ‘Databases’ RETURN $i/author $i/title Complicated syntax Extensive knowledge of the document structure required to write the query No mechanism for ranking results This does work, BUT

Our Requirements from the Search Tool A simple syntax that can be used by naive users Search results should include XML fragments and not necessarily full documents The XML fragments in an answer, should be semantically related –For example, a paper and an author should be in an answer only if the paper was written by this author Search results should be ranked Search results should be returned in “reasonable” time

Overall Architecture User Interface Query Processor Ranker XML Files L1L1 L2L2 L3L3 L4L4 Indexer Indices Index Repository

Query Syntax and Semantics User Interface Query Processor Ranker XML Files L1L1 L2L2 L3L3 L4L4 Indexer Indices Index Repository User Interface Query Processor

XSEarch Query Syntax A query is a list of query terms A query term can be a –Keyword, e.g., database –Tag, e.g., inproceedings: –Tag-keyword combination, e.g., author:Vianu Optionally preceded by a ‘+’

Appearance of logical in the fragment increases the rank of this fragment Note that the different document fragments matching these query terms must be “semantically related” Example Find papers by Vianu on the topic of “logical databases” logical +database inproceedings: author:Vianu Appearance of the tag inproceedings, in the fragment, increases the rank of this fragment Appearance of Vianu under the tag author, in the fragment, increases the rank of this fragment The keyword database must appear in the fragment

Semantic Relation: The Intuition Semantic Relation: The Intuition

XSEarch: Moshe Y. Vardi Querying Logical Databases Victor Vianu A Web Odyssey: From Codd to XML A Web Odyssey: From Codd to XML Victor Vianu Good Result! title and author elements ARE semantically related author:Vianu title:

Moshe Y. Vardi Querying Logical Databases Victor Vianu A Web Odyssey: From Codd to XML Querying Logical Databases Victor Vianu Bad Result! title and author elements ARE NOT semantically related XSEarch: author:Vianu title:

Semantic Relation: Formalization Semantic Relation: Formalization

Data Model: Document Tree Tags are colored in green Data is colored in red proceedings Moshe Y. Vardi inproceedings author title Querying Logical Databases author title Victor Vianu A Web Odyssey: From Codd to XML inproceedings GOAL: Find pairs of semantically related titles and authors.

Relationship Trees Relationship tree of n 1, n 2, …, n k n1n1 n2n2 nknk … Lowest common ancestor of n 1, n 2, …, n k

Our “Semantic Relation”: Interconnection n 1,..., n k are strongly interconnected if the relationship tree of n 1,..., n k does not contain two nodes with the same label n 1,..., n k are interconnected if either –they are strongly interconnected, or –the only nodes with the same label in the relationship tree of n 1,..., n k, are among n 1,..., n k

proceedings Moshe Y. Vardi inproceedings author title Querying Logical Databases author title Victor Vianu A Web Odyssey: From Codd to XML inproceedings Circled nodes belong to different inproceedings entities. They ARE NOT strongly interconnected nor interconnected! Relationship tree Lowest common ancestor of circled nodes Example (1)

proceedings Moshe Y. Vardi inproceedings author title Querying Logical Databases author title Victor Vianu A Web Odyssey: From Codd to XML inproceedings Circled nodes belong to the same inproceedings entity. They ARE strongly interconnected, thus, interconnected! Relationship tree Lowest common ancestor of circled nodes Example (2)

proceedings Moshe Y. Vardi inproceedings author title Querying Logical Databases author title Victor Vianu Queries and Computation on the Web inproceedings Circled nodes belong to the same inproceedings entity, but are labeled with the same tag. They ARE interconnected, BUT NOT strongly interconnected! Relationship tree Lowest common ancestor of circled nodes Example (3) author Serge Abiteboul We can see the advantage of using interconnection rather than strong interconnection. These two author nodes ARE semantically related.

Interconnection Based on theoretical results of “Generating relations from XML documents”, S. Cohen, Y.Kanza, Y. Sagiv, ICDT –Three types of interconnection We have implemented two types of interconnection XSEarch can easily accommodate different types of interconnection, or other semantic relations between nodes

Checking Whether Two Nodes Are Interconnected Given a document T, it is possible to check whether nodes n and n’ are interconnected in O(|T|) time Too expensive to do it during query processing! During query processing, we need to check whether pairs of nodes are interconnected

Interconnection Index Is built offline Allows for checking interconnection between two nodes, during query processing, in O(1) time We have two implementations –as a hash table –as a symmetric matrix The Indexer is responsible for building the Interconnection Index

Indexer User Query Processor Ranker XML Files L1L1 L2L2 L3L3 L4L4 Indexer Indices Index Repository

For each pair of nodes, check whether this pair is interconnected –There are O(|T| 2 ) pairs –Checking interconnection is in O(|T|) time As a result, checking for interconnection of all pairs of nodes in T is in O(|T| 3 ) time  Too expensive also if it is done offline!!! Building the Interconnection Index: Naïve Approach

Idea: Checking whether two nodes are interconnected can be done by checking interconnection between their parents/children There are two characterizations of nodes interconnection –For child-ancestor nodes –For non child-ancestor nodes Building the Interconnection Index: Dynamic Programming Approach

Interconnection Characterization: n is an ancestor of n’ n and n’ are interconnected if and only if: – the parent of n’ is strongly interconnected with n – the child of n on the path to n’ is strongly interconnected with n’ parent of n’ child of n … n n’ – the parent of n’ is strongly interconnected with n parent of n’ n child of n n’ – the child of n on the path to n’ is strongly interconnected with n’

Interconnection Characterization: n is not an ancestor of n’ n and n’ are interconnected if and only if: – the parent of n’ is strongly interconnected with n – the parent of n is strongly interconnected with n’ n’n parent of n’ parent of n … lca … – the parent of n is strongly interconnected with n’ n’ parent of n – the parent of n’ is strongly interconnected with n n parent of n’

Building the Interconnection Index Using Dynamic Programming Theorem: Let T be a document. Then it is possible to determine interconnection of all pairs of nodes in T in O(|T| 2 ) time Proof hint: –Derive nodes numbers in T by a depth-first traversal of T –Compute the index using dynamic programming, based on the characterizations

Query Processing Document fragments are extracted using the interconnection index and other indices Extracted fragments are returned ranked by the estimated relevance

Ranker User Query Processor Ranker XML Files Indexer Indices Index Repository L1L1 L2L2 L3L3 L4L4

Ranking Factors Several factors increase the rank of a result Similarity between query and result Weight of labels appearing in the result Characteristics of result tree

Query and Result Similarity TFILF –Extension of TFIDF, classical in IR –Term Frequency: number of occurrences of a query term in a fragment –Inverse Leaf Frequency: number of leaves containing a query term divided by number of leaves in the corpus

TFILF Term frequency of keyword k in a leaf node n l Inverse leaf frequency TFILF is the product between tf and ilf

Weight of Labels Some labels are considered more important than others –Text under an element labeled with title is more “important” than text under element labeled with section Label weights can be –system generated –user defined

Relationship between Nodes Size of the relationship tree: small fragment indicates that its nodes are closer, and thus, probably, “more related” article: title:XML 2 nodes 3 nodes This fragment will obtain an higher rank article title XML article section title XML

Relationship between Nodes Ancestor-descendant relationships between a pair of nodes in a fragment, indicates “strong relation” between these nodes section: title:XML article title XML section section node is an ancestor of title node This fragment will obtain an higher rank section title article XML

Experimental Results

Hardware and Software Used Language: Java Processor: 1.6 GHZ Pentium 4 RAM: 2 GB (limited to 1.46 GB by JVM) OS: Windows XP

Choosing the Implementation for the Interconnection Index We have experimented the two implementations of the interconnection index 1. IIH: the index is an hash table 2. IIM: the index is a symmetric matrix We compare the two implementations –Cost of building the index –Cost of query processing, i.e., using the index

IIM is better than IIH, because of the additional overhead of hashing Time For Building Indices IIH time (ms) IIM time (ms) Number of nodes Size (KB) XML corpus 36293,360146Dream ,635281Hamlet 1,7291,55221,246704Sigmod 7,8376,23149,4221,198Mondial Both implementations are reasonable

On the Fly Indexing (OFI) Fully building the indices as a preprocess of querying is expensive in memory for “huge” corpuses! –Also expensive in time because of the additional overhead of using virtual memory Instead, compute interconnection index incrementally on-the-fly during query processing for each pair that must be checked –By how much will query processing be slowed down?

Time For Building Indices: Comparing IIH, IIM, OFI IIM time (ms) IIH time (ms) OFI time (ms) Number of nodes Size (KB) XML corpus ,360146Dream ,635281Hamlet 1,5521, ,246704Sigmod 6,2317, ,4221,198Mondial For these corpuses, OFI time is less than 10 ms. Actually it is the time to build all the indices other than the interconnection index.

Query Execution Time We generated 1000 random queries for the Sigmod Record corpus Each query had: –At most 3 optional search terms –At most 3 required search terms We checked time with IIH, IIM and OFI

IIH/IIM: Query Processing Time Note: Logarithmic scale Both approaches lead to similar results Average run time for queries: 35 ms

After processing the 1000 queries, 0.75% of all pairs of nodes were checked for interconnection. Space saved in main memory Slowdown in response time not too large! Locality property: queries tend to be similar in the parts of the document that they may access More than 50% of the queries processed in under 10 ms OFI: Query Processing Time

How Good are the Results? We measured recall & precision for the query: –Find papers written by Buneman that contain the keyword database in the title We tried two different queries that reflect different amounts of user knowledge –Kw: +Buneman +database (classical search engine query) –Tag-kw: +author:Buneman +title:database Corpus: Sigmod, DBLP

We computed the "correct answers" using XQuery Recall  Perfect recall, i.e., XSEarch returns all the correct answers Precision at n Precision and Recall correct returned answers correct answers correct answers in the first n returned answers n

Precision at 5, 10 and 20 Sigmod: Perfect precision DBLP: 0.8/0.9 for query containing only keywords Combining tags and keywords leads to perfect precision

Conclusions Paradigm for querying XML combining IR and database techniques Returns semantically related fragments, ranked by estimated relevance Combining tags and keywords in the query leads to good results

Conclusions Efficient index structures –IIM/IIH for “small” documents –OFI for “big” documents Efficient evaluation algorithms –Dynamic algorithm for computing interconnection Extensible implementation –The system can easily accommodate different types of semantic relations between nodes, other than interconnection

Thank You. Questions?