Download presentation
Presentation is loading. Please wait.
Published byMaximilian Foster Modified over 9 years ago
1
Gökay Burak AKKUŞ Ece AKSU XRANK XRANK: Ranked Keyword Search over XML Documents Ece AKSU Gökay Burak AKKUŞ
2
Gökay Burak AKKUŞ Ece AKSU This Paper... Describes the architecture, implementation and evaluation of the XRANK system The contributions of the paper are: (a) the problem definition and system architecture (b) an algorithm for computing the ranking of XML elements (c) new inverted list index structures and associated query processing algorithms (d) an experimental evaluation of XRANK
3
Gökay Burak AKKUŞ Ece AKSU Overview Problem: Efficiently producing ranked results for keyword search queries over hierarchical XML documents. New challanges 1. Returns deeply nested XML elements. 2. Ranking is at the granularity of an XML element (not the document) 3. Keyword proximity is more complex.
4
Gökay Burak AKKUŞ Ece AKSU Overview - 2 This paper pesents XRANK system to handle these features of XML keyword search. XRANK offers both space & performance benefits XRANK generalizes a hyperlink based HTML search engine such as Google. XRANK can be used to query both HTML and XML documents.
5
Gökay Burak AKKUŞ Ece AKSU Keyword Search Querying - 1 Keyword search querying Adv: simple users do not have to learn a complex query language can issue queries without any prior knowledge about the structure of the underlying data. Consequence: Interface is fexible Queries may not always be precise and can return large number of query results.
6
Gökay Burak AKKUŞ Ece AKSU Keyword Search Querying - 2 An important requirement for keyword search is to rank the query results so that the most relevant results appear first. C ertain limitations of the HTML data model make such systems ineffective in many domains. HTML is a presentation language HTML cannot capture much semantics
7
Gökay Burak AKKUŞ Ece AKSU Keyword Search Querying - 3 The XML data model addresses this limitation by allowing for extensible element tags. (Example: Figure.1)
8
Gökay Burak AKKUŞ Ece AKSU
9
Querying XML Documents One approach is the sophisticated query language XQUERY Effective in some cases Users have to learn a complex query language and understand the schema of underlying XML An alternative approach is XRANK Retain the simple keyword search query interface Exploit XML’s tagged and nested structure during query processing.
10
Gökay Burak AKKUŞ Ece AKSU New Challanges Keyword searching over XML introduces many new challenges. 1. The result of the keyword search query can be a deeply nested XML element. return the ‘deepest’ node 2. Ranking is not solely based on hyperlinks. semantics of containment links (relating parent and child elements) is very different from that of hyperlinks (such as IDREFs and XLinks)
11
Gökay Burak AKKUŞ Ece AKSU New Challanges 3. The notion of proximity among keywords is more complex In HTML, proximity among keywords translates directly to the distance between keywords in a document. For XML there is a 2-dimensional proximity metric. Keyword distance Ancestor distance
12
Gökay Burak AKKUŞ Ece AKSU XML Data Model XML is a hierarchical format for data representation and exchange. An XML document consists of: Root element, nested sub-elements, attributes and values, supports intra-document and inter-document references.
13
Gökay Burak AKKUŞ Ece AKSU XML Data Model-2 Intra-document referencees are represented using IDREFs. Inter-document references are represented using XLink. Both IDREFs and XLinks are reffered as hyperlinks!
14
Gökay Burak AKKUŞ Ece AKSU Definitions A collection of hyperlinked XML documents can be defined as a directed graph: G = (N, CE, HE) N : The set of nodes N = NE U NV NE : The set of elements NV : The set of values CE : The set of containment edges relating nodes HE : The set of hyperlink edges relating nodes
15
Gökay Burak AKKUŞ Ece AKSU Definitions - 2 The edge (u, v) CE iff v is a value/nested sub- element of u. The edge (u, v) HE iff u contains a hyperlink reference to v. An element u is a sub-element of an element v if (v,u) CE. An element u is the parent of node v if (u,v) CE. The predicate contains*(v, k) is true if the node v directly or indirectly contains the keyword k.
16
Gökay Burak AKKUŞ Ece AKSU Keyword Query Results There are two possible semantics for keyword search queries: conjunctive keyword query semantics contain all of the query keywords are returned. disjunctive keyword query semantics contain at least one of the query keywords are returned This paper focuses on conjunctive keyword query semantics.
17
Gökay Burak AKKUŞ Ece AKSU Keyword Query Results - 2 Q={k 1,…, k n }. R 0 = {v v NE k Q(contains*(v,k))} the set of elements that directly or indirectly contain all of the query keywords. Result(Q)={v k Q c N ((v,c) CE c R 0 contains*(c,k))} ensures that only the most specific results are returned. ensures that an element that has multiple independent occurrences of the query keywords is returned, CE are considered for result set, HE are considered for ranking
18
Gökay Burak AKKUŞ Ece AKSU Keyword Query Results - 3 XML elements provides more context information Also poses interesting user-interface challenges. One solution is to allow the user to navigate up to the ancestors of the query result Another solution, is to predefine a set of “answer nodes” AN. XRANK supports both may require knowledge of the domain and underlying XML schema
19
Gökay Burak AKKUŞ Ece AKSU Ranking Keyword Query Results Desired Properties of Ranking Function: 1) Result specificity: more specific results higher than less specific results. one dimension of result proximity. 2) Keyword proximity: another dimension of result proximity. 3) Hyperlink Awareness: hyperlinked structure of XML documents.
20
Gökay Burak AKKUŞ Ece AKSU Ranking Function: Definition ElemRank is defined at the granularity of an element and takes the nested structure of XML into account. Similar to Google’s PageRank Q = (k 1, k 2, …, k n ) R = Result(Q) A result element v 1 R First define the ranking of v 1 with respect to one query keyword k i, r(v 1,k i ) before defining the overall rank, rank(v 1, Q).
21
Gökay Burak AKKUŞ Ece AKSU Ranking with respect to one keyword There exists a sub-element/value node v 2 of v 1 such that v 2 R 0 and contains*(v 2, k i ). There is a sequence of containment edges in CE of the form (v 1, v 2 ), (v 2, v 3 ), …, (v t, v t+1 ) such that v t+1 is a value node that directly contains the keyword k i.
22
Gökay Burak AKKUŞ Ece AKSU Ranking with respect to one keyword r(v 1, k i ) does not depend on the ElemRank of the result node v 1, except when v 1 = v t for 2 reasons: 1. less specific results indeed get lower ranks. 2. in fact related to ElemRank(v 1 ) due to certain properties of containment edges. For multiple occurences of k i in v 1 combined rank is: f = max
23
Gökay Burak AKKUŞ Ece AKSU Overall Ranking The overall ranking is the sum of the ranks with respect to each query keyword, multiplied by a measure of keyword proximity p(v 1, k 1, k 2, …, k n ).
24
Gökay Burak AKKUŞ Ece AKSU XRANK System Architecture
25
Gökay Burak AKKUŞ Ece AKSU XRANK System Architecture-2 ElemRank Computation Module Computes the ElemRanks of XML elements Combined with ancestor info HDIL Generates an index structure called HDIL The Query Evaluator Module Evaluates queries using HDIL Returns ranked results.
26
Gökay Burak AKKUŞ Ece AKSU ElemRank Computational Module ElemRank is a measure of the objective importance of an XML element and is based on the hyperlinked structure of XML docs. PageRank function is sum of 2 probabilities Visiting v at random (d=0.85) Visiting v by navigating
27
Gökay Burak AKKUŞ Ece AKSU ElemRank Computational Module PageRank is unidirectional Forward ElemRank propagation Paper section Reverse ElemRank propagation Paper -- > workshop
28
Gökay Burak AKKUŞ Ece AKSU Refinements of PageRank Bi-directional transfer of ElemRanks Discrimination between containment and hyperlink edges Aggregate ElemRanks for reverse containment relationships
29
Gökay Burak AKKUŞ Ece AKSU Bi-directional Transfer of ElemRanks A simple solution is to add reverse containment edges, does not distinguish between containment and hyperlink edges
30
Gökay Burak AKKUŞ Ece AKSU Discrimination between containment and hyperlink edges It weights forward and reverse containment relationships similarly.
31
Gökay Burak AKKUŞ Ece AKSU Aggregate ElemRanks for reverse containment relationships
32
XRANK System Efficiently Evaluating XML Keyword Search Queries
33
Gökay Burak AKKUŞ Ece AKSU Efficiently Evaluating XML Keyword Search Queries Naïve Approach Dewey Inverted List (DIL) Ranked Dewey Inverted List (RDIL) Hybrid Dewey Inverted List (HDIL)
34
Gökay Burak AKKUŞ Ece AKSU Naïve Approach Main Difference between XML and HTML keyword search: The granularity of query results XML keyword search returns elements HTML keyword search returns documents One way to do XML keyword search Treat each element as a document
35
Gökay Burak AKKUŞ Ece AKSU Problems of Naïve Approach Space Overhead Spurious Query Results Inaccurate ranking of results
36
Gökay Burak AKKUŞ Ece AKSU Space Overhead An inverted list contains for each keyword, the list of documents that contain the keyword For XML documents, the list of elements A large space overhead; because each inverted list contains XML element that directly contains the keyword(1) All of (1)s ancestors redundantly
37
Gökay Burak AKKUŞ Ece AKSU Spurious Query Results The naïve approach ignores ancestor- descendant relationships. All elements treated as independent documents Results will not correspond to the desired semantics for XML keyword search
38
Gökay Burak AKKUŞ Ece AKSU Inaccurate Ranking of Results Existing approaches do not take result specificity into account when ranking results.
39
Gökay Burak AKKUŞ Ece AKSU Dewey Inverted List (DIL) Naïve approach has drawbacks: Decouples representation of ancestors and descendants. Dewey encoding of Element IDs jointly captures ancestor and descendant information.
40
Gökay Burak AKKUŞ Ece AKSU
41
DIL An interesting feature: ID of an ancestor is a prefix of the ID of a descendant. Ancestor-descendant relationships are implicitly captured in the Dewey ID.
42
Gökay Burak AKKUŞ Ece AKSU DIL Data Structure The inverted list for a keyword k contains the Dewey IDs of all the XML elements that directly contain the keyword k. For multiple documents : First component of each Dewey ID is the document ID
43
Gökay Burak AKKUŞ Ece AKSU DIL Data Structure -2 An entry in DIL: ElemRank of corresponding XML element The list of all positions where the keyword k appears in that element. Entries are sorted by Dewey IDs The size of DIL is smaller than that of Naïve Approach.
44
Gökay Burak AKKUŞ Ece AKSU
45
DIL Query Processing An algorithm that works in a single pass over the query keyword inverted lists. The key idea: Merge the query keyword inverted lists Simultaneously compute the longest common prefix of the Dewey IDs in different lists.
46
Gökay Burak AKKUŞ Ece AKSU
48
Ranked Dewey Inverted List (RDIL) “If inverted lists are long (due to common keywords or large document collections) even the cost of a single scan of the inverted list can be expensive, especially if the users want only the top few results.”
49
Gökay Burak AKKUŞ Ece AKSU RDIL -2 One solution: Order the inverted lists by the ElemRank instead of by the Dewey ID. Higher ranked results will appear first in the inverted list. Threshold Algorithm.
50
Gökay Burak AKKUŞ Ece AKSU RDIL Data Structure RDIL is similar to DIL except that: Inverted lists are ordered by ElemRank, Each inverted list has a B+-tree index of the Dewey ID field.
51
Gökay Burak AKKUŞ Ece AKSU
52
RDIL Query Processing Consider an entry retrieved from the inverted list of keyword k i. The entry contains the Dewey ID d of a top-ranked element that directly contains the query keyword k i. To determine a query result the longest prefix of d that also contains the other query keywords needs to be determined.
53
Gökay Burak AKKUŞ Ece AKSU
54
Hybrid Dewey Inverted List (HDIL) In many cases RDIL is likely to perform well. It may perform worse than DIL when there is a query where keywords are not correlated.
55
Gökay Burak AKKUŞ Ece AKSU HDIL -2 The individual query keywords occur relatively frequently in the document collection but rarely occur together in the same document. Since the number of results is small: RDIL has to scan most (or all) of the inverted lists to produce the output. Can we combine the benefits of DIL and RDIL without replicating the entire inverted list index?
56
Gökay Burak AKKUŞ Ece AKSU
57
HDIL Query Processing An adaptive strategy: Periodically monitor performance. Calculate; Time spent – t The number of results above the threshold – r Estimated time remaining for RDIL = (m-r)*t/r m: desired number of query results If estimated time is more than the expected time for DIL, then switch to DIL.
58
Gökay Burak AKKUŞ Ece AKSU Experimental Evaluation Experimental Setup Quality and Ranking Function Space requirements Query Performance (1) the number of query keywords; (2) the correlation between the keywords; (3) the desired number of query results; (4) the selectivity of the keywords.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.