XSEarch: A Semantic Search Engine for XML Sara Cohen, Jonathan Mamou, Yaron Kanza, Yehoshua Sagiv The Hebrew University of Jerusalem Presented by Deniz.

XSEarch: A Semantic Search Engine for XML Sara Cohen, Jonathan Mamou, Yaron Kanza, Yehoshua Sagiv The Hebrew University of Jerusalem Presented by Deniz Kasap & Sarp Baran Özkan

XSEarch an XML Search Engine Goal: Find the “relevant” XML fragments, given tag names and keywords Goal: Find the “relevant” XML fragments, given tag names and keywords

Introduction It is becoming increasingly popular to publish data on the Web in the form of XML documents. Current search engines, which are an indispensable tool for finding HTML documents, have two main drawbacks when it comes to searching for XML documents. –It is not possible to pose queries that explicitly refer to XML tags. –Search engines return references (i.e. links) to documents and not specific fragments thereof. This is problematic, since large XML documents may contain thousands of elements storing many pieces of information that are not necessarily related to each other.

Excerpt from the XML Version of DBLP Moshe Y. Vardi Querying Logical Databases Victor Vianu A Web Odyssey: From Codd to XML

A Search Example Find papers by Vianu on the topic of “logical databases” How can we find such papers?

Attempt 1: Standard Search Engine A document containing some of the three query terms is considered as a result.

The document is not relevant to the query. This does not work!!! The document is returned BUT it does not contain any paper on “logical databases” by Vianu This fragment does not represent a paper by Vianu This fragment does not represent a paper about logical databases Moshe Y. Vardi Querying Logical Databases Victor Vianu A Web Odyssey: From Codd to XML The document contains the three query terms. Hence, it is returned by a standard search engine. BUT

Since a reference to whole XML document is usually not a useful answer, the granularity of the search should be refined. Instead of returning entire document, an XML search engine should return fragments of XML documents.

A query language for XML, such as XQuery, can be used to extract data from XML documents. However, such a query language is not an alternative to an XML search engine for several reasons. –The syntax of XQuery is more complicated than the syntax of a standart search query. Hence, it is not appropriate for a naive user. –Extensive knowledge of the document structure is required in order to correctly formulate a query. Thus, queries must be formulated on a per document basis. –XQuery lacks any mechanism for ranking answers.

Attempt 2: XML Query Language FOR $i IN document(“bib.xml”)//inproceedings WHERE $i/author contains ‘Vianu’ AND $i/title contains ‘Logical’ AND $i/title contains ‘Databases’ RETURN $i/author $i/title FOR $i IN document(“bib.xml”)//inproceedings WHERE $i/author contains ‘Vianu’ AND $i/title contains ‘Logical’ AND $i/title contains ‘Databases’ RETURN $i/author $i/title Complicated syntax Extensive knowledge of the document structure required to write the query No mechanism for ranking results This does work, BUT

Our Requirements from the Search Tool A simple syntax that can be used by naive users Search results should include XML fragments and not necessarily full documents The XML fragments in an answer, should be semantically related –For example, a paper and an author should be in an answer only if the paper was written by this author Search results should be ranked Search results should be returned in “reasonable” time

The design and implementation of XSEarch involved several challenges. –A syntax is suitable for a naive user. –The theoretical results were adapted so that XSEarch always returns as answers. –Answers are highly relevant to the keywords of the query. –Suitable ranking mechanism that takes into account both the degree of the semantic relationship and the relevance of the keywords have been developed. –Index structures and evaluation algorithms that allow the system to deal efficiently with large documents have been developed. –The implemantation of XSEarch is extensible in the sense that it can easily accommodate different type of semantic relationships.

Query Syntax The query language of a standart search engine is simply a list of keywords. Keywords with a plus (+) sign must appear in a satisfying document, whereas keywords without a plus sign may or may not appear in a satisfying document. (but the appearance of such keywords is desirable)

The query language of XSEarch is a simple extension of the language described below. In addition to specify labels and keyword-label combinations that must or may appear in a satisfying document. A search term may have a plus sign prepended, in which case it is a required term. Otherwise, it is an optional term. We use t, t 1, t 2, etc., as an abstract notation for required and optional term. A query has the form Q(S) where S = t 1,...,t m is a sequence of required and optional search terms.

Formally, a search term has the form; l:k, l:, :k where l is a label and k is a keyword.

Appearance of logical in the fragment increases the rank of this fragment Note that the different document fragments matching these query terms must be “semantically related” Example Find papers by Vianu on the topic of “logical databases” logical +database inproceedings: author:Vianu Appearance of the tag inproceedings, in the fragment, increases the rank of this fragment Appearance of Vianu under the tag author, in the fragment, increases the rank of this fragment The keyword database must appear in the fragment

Query Semantics This section presents the semantics of our queries. In order to satisfy a query Q, each of the required terms in Q must be satisfied. In addition, the elements satisfying Q must be meaningfully related.

XSEarch: Moshe Y. Vardi Querying Logical Databases Victor Vianu A Web Odyssey: From Codd to XML A Web Odyssey: From Codd to XML Victor Vianu Good Result! title and author elements ARE semantically related author:Vianu title:

Moshe Y. Vardi Querying Logical Databases Victor Vianu A Web Odyssey: From Codd to XML Querying Logical Databases Victor Vianu Bad Result! title and author elements ARE NOT semantically related XSEarch: author:Vianu title:

Satisfaction of a Search Term XML documents are modeled as trees in the standard fashion. –Each interior node is associated with a label and each leaf node is associated with the sequence of keywords. –If k is a keyword in the sequence associated with n, n contains k is said. In Figure 1 there is a tree that represents a small portion of the Sigmod Record. We will refer to this tree as T sr

Let n be an interior node in a tree T. We say that n satisfies the search term; –l:k if n is labeled with l and a descendent that contains the keyword k. –l: if n is labeled with l. –:k if n has a leaf child that contains the keyword k. Example: –In the tree T sr, node number 14 satisfies :Kempster node number 9 satisfies authors:Kempster. node 9 does not satisfy :Kempster, position: or :position.

Meaningfully Related Sets of Nodes Let T be a tree and R be a binary, reflexive and symmetric relationship on the nodes in T. We assume that R contains pairs of nodes that are meaningfully related. We present two different way to extend R to arbitrary sets of nodes

A set of nodes N is all-pairs R-related, if (n 1,n 2 ) is in R, for every pair of nodes n 1, n 2. This states that a set of nodes is meaningfully related if every pair of nodes in the set is meaningfully related. N is star R-related, if there is a node n*  N such that the pair (n*,n) is in R, for all nodes n  N. This states that the nodes of a set are meaningfully related if all these nodes are meaningfully related to a node in the set. Depending on the structure of the documents, either the all-pairs relationship or star relation-ship may be more appropriate.

Query Answers Let Q(t 1,…,t m ) be a query. A sequence N = n 1, …,n m of nodes and null values is an all-pairs R-answer for Q if the nodes in N are all- pairs R-related and for all 1  i  m: –n i is not the null value if t i is a required term; –n i satisfies t i if it is not the null value. Similarly, N is star R-answer, when the nodes in N are star R-related.

We use; –Ans a,R (Q) to denote the set of all-pairs R-answer for the query Q over a tree T and –Anst s,R (Q) to denote the set of star R-answers for Q over T. –MaxAns a,R to denote the set of maximal answers in Ans a,R (Q)

The Interconnection Relationship We present a relation which can be used to determine whether a pair of nodes is meaningfully related. Let T be tree an n 1 and n 2 be nodes in T. The shortest undirected path between n 1 and n 2 consists of the paths from the lowest common ancestor of n 1 and n 2 to n 1 and n 2.

We denote the tree consisting of these two paths as T|n 1,n 2. This tree describes the relationship between the nodes n 1 and n 2. For example in T sr, the tree T| 8,13 consists of the nodes 7, 8, 9, 12 and 13.

Relationship Trees Relationship tree of n 1, n 2, …, n k n1n1 n2n2 nknk … Lowest common ancestor of n 1, n 2, …, n k

Our “Semantic Relation”: Interconnection n 1,..., n k are interconnected if either –relationship tree of n 1,..., n k does not contain two nodes with the same label, or –the only nodes with the same label in the relationship tree of n 1,..., n k, are among n 1,..., n k

proceedings Moshe Y. Vardi inproceedings author title Querying Logical Databases author title Victor Vianu A Web Odyssey: From Codd to XML inproceedings Circled nodes belong to different inproceedings entities. They ARE NOT interconnected! Relationship tree Lowest common ancestor of circled nodes Example (1)

proceedings Moshe Y. Vardi inproceedings author title Querying Logical Databases author title Victor Vianu A Web Odyssey: From Codd to XML inproceedings Circled nodes belong to the same inproceedings entity. They ARE interconnected! Relationship tree Lowest common ancestor of circled nodes Example (2)

proceedings Moshe Y. Vardi inproceedings author title Querying Logical Databases author title Victor Vianu Queries and Computation on the Web inproceedings Circled nodes belong to the same inproceedings entity, but are labeled with the same tag. They ARE interconnected. Relationship tree Lowest common ancestor of circled nodes Example (3) author Serge Abiteboul

Example 1 of Query Semantics Consider the query Q1 defined as; Q1(+title:, author:). –The query Q1 finds pairs of titles and authors, belonging to the same article. –Only tuples where the title is non-null will be returned. –The answers created for T sr are; (8,10), (8,12), (8,14), (17,18) and (25,  )

Example 2 of Query Semantics The answers for Q1 over this document would consists of; (6,3) and (6,4)

Query Processing Document fragments are extracted using the interconnection index and other indices Extracted fragments are returned ranked by the estimated relevance

Ranker User Query Processor Ranker XML Files Indexer Indices Index Repository 1 2 3 4 L1L1 L2L2 L3L3 L4L4

Ranking Factors Several factors increase the rank of a result Similarity between query and result Weight of labels appearing in the result Characteristics of result tree

Query and Result Similarity TFILF –Extension of TFIDF, classical in IR –Term Frequency: number of occurrences of a query term in a fragment –Inverse Leaf Frequency: number of leaves containing a query term divided by number of leaves in the corpus

TFILF Term frequency of keyword k in a leaf node n l Inverse leaf frequency TFILF is the product between tf and ilf

Weight of Labels Some labels are considered more important than others –Text under an element labeled with title is more “important” than text under element labeled with section Label weights can be –system generated –user defined

Relationship between Nodes Size of the relationship tree: small fragment indicates that its nodes are closer, and thus, probably, “more related” article: title:XML 2 nodes 3 nodes This fragment will obtain an higher rank article title XML article section title XML

Relationship between Nodes Ancestor-descendant relationships between a pair of nodes in a fragment, indicates “strong relation” between these nodes section: title:XML article title XML section section node is an ancestor of title node This fragment will obtain an higher rank section title article XML

Combining the Factors Given a query Q and an answer N, we use the measures sim(Q,N), tsize(N) and anc-des(N) to determine the ranking of the answer. We experimented with the following combination of factors by varying the values of α, β and γ sim(Q,N) α / tsize(N) β x (1+ γ x anc-des(N))

System Implementation The architecture of the XSEarch system is depicted in the following figure:

User Interface Query Processor Ranker XML Files 1 2 3 4 L1L1 L2L2 L3L3 L4L4 Indexer Indices Index Repository

The basic follow of information is as follows: –The user enters a query using a browser. –The Search-Query Processor parses the query into a list of search terms. –The Index Repository is used to find nodes that satisfy that satisfy the search terms and to find whether pairs of nodes are interconnected. It responds by checking the stored indices. If these indices do not contain sufficient information, the Indexer is used to augment the current indices. Once the relevant information is returned to the Search-Query Processor, it creates the answers, which are ranked, sorted and then returned. –The Indexer creates several different indices in the Index Repository based on a set of XML documents.

We focus on the most important and novel index structures: –The interconnection index –Path index The interconnection index allows for rapid checking of the interconnection relationship. Path index allow us to create first answers with higher estimated ranking.

Dynamic Offline Interconnection Indexing Checking for interconnection of nodes online is expensive. –Hence, it is decided that at the first to create a node- interconnection index that would store information about the interconnection relationship between each pair of nodes. –This requires solving the following problem: Given a document T, for all pairs of nodes n and n’ in T, determine whether n and n’ are interconnected. –The algorithm which is the solution of this problem, is based on the following Lemma:

Lemma (Interconnection Characterization) –Let T be a document and let n and n’ be nodes in T. –If n is ancestor of n’, then n and n’ are interconnected if and only if the following hold: The parent of n’ is strongly-interconnected with n; The child of n on the path to n’ is strongly- interconnected with n’. –If n is not an ancestor of n’ and n’ is not an ancestor of n, then n and n’ are interconnected if and only if the following hold: The parent of n’ is strongly-interconnected with n; The parent of n is strongly-interconnected with n‘.

In the XSearch system, we have explored the possibilities of storing the node-interconnection index in either a hashtable or a symmetric matrix. When implemented as a hastable, the node-interconnection index contains pairs of ids of interconnected nodes. When implemented as a symmetric matrix, the node- interconnection index contains a boolean value for each pairs of nodes, indicating whether they are interconnected or not. A comparison of time and space efficiency of these structures will be explained.

Dynamic Online Interconnection Indexing Offline computation of the node-interconnection index may be expensive. In order to amortize the cost of computing this index over the queries received, we have also considered an online indexing method. When indexing online, for each pair of nodes n and n’, we compute the section of the node interconnection index corresponding to T|n,n’

We use a hashtable to store the part of the part of index that has already been computed at any given moment. The hashtable contains a boolean value for each pair of nodes whose interconnection has already been checked. The boolean value indicates whether the nodes are interconnected or not.

During query processing, usually only a small part of the node-interconnection index will be created, thus the slowdown in response time is not large. In addition, queries tend to be similar in the parts of the document that they must access. Therefore, even after many queries have been evaluated, it is likely for the node-interconnection index to be only partially computed. This speeds up execution time when loading the index into main memory.

Experimental Results

Hardware and Software Used Language: Java Processor: 1.6 GHZ Pentium 4 RAM: 2 GB (limited to 1.46 GB by JVM) OS: Windows XP

Interconnection Index Is built offline Allows for checking interconnection between two nodes, during query processing, in O(1) time We have two implementations –as a hash table –as a symmetric matrix The Indexer is responsible for building the Interconnection Index

Choosing the Implementation for the Interconnection Index We have experimented the two implementations of the interconnection index 1. IIH: the index is an hash table 2. IIM: the index is a symmetric matrix We compare the two implementations –Cost of building the index –Cost of query processing, i.e., using the index

IIM is better than IIH, because of the additional overhead of hashing Time For Building Indices IIH time (ms) IIM time (ms) Number of nodes Size (KB) XML corpus 36293,360146Dream 1851146,635281Hamlet 1,7291,55221,246704Sigmod 7,8376,23149,4221,198Mondial Both implementations are reasonable

On the Fly Indexing (OFI) Fully building the indices as a preprocess of querying is expensive in memory for “huge” corpuses! –Also expensive in time because of the additional overhead of using virtual memory Instead, compute interconnection index incrementally on-the-fly during query processing for each pair that must be checked –By how much will query processing be slowed down?

Time For Building Indices: Comparing IIH, IIM, OFI IIM time (ms) IIH time (ms) OFI time (ms) Number of nodes Size (KB) XML corpus 29360.63,360146Dream 1141851.16,635281Hamlet 1,5521,7292.221,246704Sigmod 6,2317,83710.049,4221,198Mondial For these corpuses, OFI time is less than 10 ms. Actually it is the time to build all the indices other than the interconnection index.

Query Execution Time We generated 1000 random queries for the Sigmod Record corpus Each query had: –At most 3 optional search terms –At most 3 required search terms We checked time with IIH, IIM and OFI

IIH/IIM: Query Processing Time Note: Logarithmic scale Both approaches lead to similar results Average run time for queries: 35 ms

After processing the 1000 queries, 0.75% of all pairs of nodes were checked for interconnection. Space saved in main memory Slowdown in response time not too large! Locality property: queries tend to be similar in the parts of the document that they may access More than 50% of the queries processed in under 10 ms OFI: Query Processing Time

How Good are the Results? We measured recall & precision for the query: –Find papers written by Buneman that contain the keyword database in the title We tried two different queries that reflect different amounts of user knowledge –Kw: +Buneman +database (classical search engine query) –Tag-kw: +author:Buneman +title:database Corpus: Sigmod, DBLP

We computed the "correct answers" using XQuery Recall  Perfect recall, i.e., XSEarch returns all the correct answers Precision at n Precision and Recall correct returned answers correct answers correct answers in the first n returned answers n

Precision at 5, 10 and 20 Sigmod: Perfect precision DBLP: 0.8/0.9 for query containing only keywords Combining tags and keywords leads to perfect precision

Related Work Numerous query languages for XML have been developed. –For example, the XQuery working group is considering how to add full-text search features and ranking to XQuery. Such capabilities have already been added to various XML query languages. But these languages are not suitable for naïve user, since the query syntax is always complex. –A recent related work is the XRANK system for keyword searching in XML documents

Conclusions The main contribution of this paper is in laying the foundations for a semantic search engine over XML documents. XSearch returns semantically related fragmants, ranked by estimated relevance. This system is extensible and can easily accommodate different types of relationships between nodes. We have shown that it is possible to combine these qualities with an efficient, scalable and modular system. Thus, XSearch can be seen as a general framework for semantic searching in XML documents.

Efficient index structures –IIM/IIH for “small” documents –OFI for “big” documents Efficient evaluation algorithms –Dynamic algorithm for computing interconnection Extensible implementation –The system can easily accommodate different types of semantic relations between nodes, other than interconnection

Thank You. Questions?

XSEarch: A Semantic Search Engine for XML Sara Cohen, Jonathan Mamou, Yaron Kanza, Yehoshua Sagiv The Hebrew University of Jerusalem Presented by Deniz.

Similar presentations

Presentation on theme: "XSEarch: A Semantic Search Engine for XML Sara Cohen, Jonathan Mamou, Yaron Kanza, Yehoshua Sagiv The Hebrew University of Jerusalem Presented by Deniz."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

XSEarch: A Semantic Search Engine for XML Sara Cohen, Jonathan Mamou, Yaron Kanza, Yehoshua Sagiv The Hebrew University of Jerusalem Presented by Deniz.

Similar presentations

Presentation on theme: "XSEarch: A Semantic Search Engine for XML Sara Cohen, Jonathan Mamou, Yaron Kanza, Yehoshua Sagiv The Hebrew University of Jerusalem Presented by Deniz."— Presentation transcript:

Similar presentations

About project

Feedback