XRANK: Ranked Keyword Search over XML Documents Presentation by: Meghana Kshirsagar Nitin Gupta Indian Institute of Technology, Bombay Lin Guo Feng Shao.

XRANK: Ranked Keyword Search over XML Documents Presentation by: Meghana Kshirsagar Nitin Gupta Indian Institute of Technology, Bombay Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram

Outline ● Motivation ● Problem Definition, Query Semantics ● Ranking Function ● A New Datastructure – Dewey Inverted List (DIL) ● Algorithms ● Performance Evaluation

Motivation

Motivation - I ● Why do we need search over XML data? ● Why not use search techniques used on WWW (keyword search on HTML)?

Motivation - II Keyword Search: XML Vs HTML HTML structural ● Links: document-to-document ● Tags: Format specifiers ranking ● Result: Document ● Page-level ranking ● Proximity: ● width: distance between words XML structural ● Links: IDREFs and Xlinks ● Tags: Content specifiers ranking ● Result: XML element (a tree) ● Element-level ranking ● Proximity: ● width ● height

Problem Definition, Query Semantics, and Ranking

Problem Definition ● Input:Set of keywords ● Output:Ranked XML elements What is a result? How to rank results ?

Bird's eye view of the system Query Evaluator Data Structures (DIL) XML doc repository Preprocessing (ElemRank computation) Query Keywords Results

What is a result? ● A minimal Steiner tree of XML elements ● Result-set is a set of XML elements that ● includes a subset of elements containing all query-keywords at least once, after excluding the occurrences of keywords in contained results (if any).

result 1 result 2

Result: Graphical representation containment edge descendant ancestor

Ranking: Which results to return first? Properties: The Ranking function should ● reflect Result Specificity ● consider Keyword-Proximity ● be Hyperlink Aware Ranking function: f (height, width, link-structure)

Less specific result More specific result

Ranking Function r (v 1, k i ) = ElemRank ( v t ). decay t-1 v1v1 vtvt kiki For a single XML element (node):

Ranking Function Combining ranks in case of multiple occurrences: Overall Rank:

Semantics of the ranking function r (v 1, k i ) = ElemRank ( v t ). decay t-1 Specificity (height) Proximity Link structure

ElemRank Computation – adopt PageRank?? ● PageRank ● Short-comings: Fails to capture: ✗ bidirectional transfer of “ElemRanks” ✗ discrimination between edge-types (containment and hyperlink) ✗ doesn't aggregate “ElemRanks” for reverse containment relationships

ElemRank Computation - I ● N e = total # of XML elements ● N h (u) = # hyperlinks from 'u' ● N c (u) = # children of 'u' ● E = HE U CE U CE' ● CE' = reverse containment edges ● Consider Both forward and reverse ElemRank propagation.

ElemRank Computation - II ● Seperate containment and hyperlink edges ● CE = containment edges ● HE = hyperlink edges ● ElemRank (sub elements) α 1 / ( # sibling sub-elements )

ElemRank Computation - III ● Sum over the reverse-containment edges, instead of distributing the weight ● N d (u) = total # XML documents ● N de (v) = # elements in the XML doc containing v ● ElemRank (parent) α Sum (ElemRank(sub-elements))

Datastructures and Algorithms

Naïve Algorithm Approach: ● XML element ~ doc ● Use “keyword search on WWW” Limitations: ● Space overhead (in inverted indices) ● Failure to model Hierarchical relationships (ancestor~decendent). ● Inaccurate Ranking Need a new datastructure which can model hierarchical relationships !! Answer: Dewey Inverted Lists

Labeling nodes using Dewey Ids

Dewey Inverted Lists ● One entry per keyword ● Entry for keyword 'k' has Dewey-IDs of elements directly containing 'k' Simple equi merge-join of Dewey-ID-lists won't work ! Need to compute prefixes.

System Architecture

DIL : Query Processing ● Simple equality merge-join will not work ● Need to find LCP (longest common prefix) over all elements with query keyword-match. ● Single pass over the inverted lists suffices! ● Compute LCP while merging the ILs of individual keywords. ● ILs are sorted on Dewey-IDs

Datastructures ● Array of all inverted lists : invertedList[] ● invertedList[i] for keyword 'i' ● each invertedList[i] is sorted on Dewey-ID ● Heap to maintain top-m results : resultHeap ● Stack to store current Dewey-ID, ranks, position List, longest common prefixes : deweyStack

Algorithm on DILs - Abstract While all inverted-lists are not processed ● Read the next entry from DIL having smallest Dewey-ID ● call this 'currentEntry' ● Find the longest common prefix (lcp) between stack components and entry read from DIL ● lcp (deweyStack, currentEntry) ● Pop non-matching entries from Dewey-stack; Add result to heap if appropriate ● check if current top-of-stack contains all keywords ● if yes, compute OverallRank, put this result onto heap ● else ● non-matching entries are popped one component at a time and update (rank, posList) on each pop ● Push non-matching part of 'currentEntry' to 'deweyStack' ● non-matching components of 'currentEntry.deweyID' are pushed onto stack ● Update components of top entry of deweyStack

Example Query: “XQL Ricardo”

Algorithm Trace – Step 1 Rank[i] = Rank due to keyword 'i' PosList[i] = List of occurrences of keyword 'i' DIL: invertedList[] DeweyStack push all components and find rank, posL Smallest ID: 5.0.3.0.0

Algorithm Trace – Step 2 DIL: invertedList[] DeweyStack find lcp and pop nonmatching components Smallest ID: 5.0.3.0.1

Algorithm Trace – Step 3 DIL: invertedList[] DeweyStack updated rank, posL Smallest ID: 5.0.3.0.1

Algorithm Trace – Step 4 DIL: invertedList[] DeweyStack push non-matching components Smallest ID: 5.0.3.0.1

Algorithm Trace – Step 5 DIL: invertedList[] DeweyStack find lcp, update, finally pop all components Smallest ID: 6.0.3.8.3

Problems with DIL ● Scans the entire inverted-list for all keywords before a result is output ● Very inefficient for top-k computation

Other Techniques - RDIL ● Ranked Dewey Inverted List: ● For efficient top-k result computation ● IL is ordered by ElemRank ● Each IL has a B+ tree index on the Dewey- IDs ● Algorithm with RDIL uses a threshold

Algorithm using RDIL (Abstract) ● Choose the next entry from one of the invertedList[] in a Round- Robin fashion. ● say chosen IL = invertedList[i] ● d = top-ranked Dewey-ID from invertedList[i] ● Find the longest common prefix that contains all query-keywords ● Probe the B+ tree index of all other keyword ILs, for the longest common prefix ● Claim: ● d2 = smallest Dewey-ID in invertedList[j] of query-keyword 'j' ● d3 = immediate predecessor of d2 ● lcp = max_prefix (lcp ( d, d2), lcp ( d, d3)) ● Check if 'lcp' is a complete result ● Recompute 'threshold' = sum (ElemRank of last processed element in each query keyword IL) ● If (rank of top-k results on heap) >= threshold) return;

Performance of RDIL ● Works well for queries with highly correlated keywords ● BUT ! becomes equivalent (actually worse) to DIL for totally uncorrelated keywords ● Need an intermediate technique

HDIL ● Uses both DIL and RDIL ● Adaptive strategy: – Start with RDIL – Switch to DIL if performance is bad ● Performance? – Estimated remaining time for RDIL = (m – r ) * t / r ● t = time spent so far ● r = no. of results above threshold so far ● m = desired no. of results – Estimated remaining time for DIL ? ● No. of query-keywords is known ● Size of each IL is known

HDIL ● Datastructures? – Store full IL sorted on Dewey-ID – Store small fraction of IL sorted on ElemRank – Share the leaf level between IL and B+ tree (in RDIL) – Overhead : top levels of B+ tree

Updating the lists ● Updation is easy ● Insertion – very bad! – techniques from Tatarinov et al. – we've seen a better technique in this course :) – OrdPath

Evaluation ● Criteria: ● no. of query-keywords ● correlation between query-keywords ● desired no. of query results ● selectivity of keywords ● Setup: ● Datasets used: DBLP, Xmark ● d1 = 0.35, d2 = 0.25, d3 = 0.25 ● 2.8GHz Pentium IV + 1GB RAM + 80GB HDD

Performance - 1

Performance - 2

Critique ● New datastructure (DIL) defined to represent hierarchical relationships accurately and efficiently. ● Hyperlinks and IDREFs are considered only while computing ElemRank. Not used while returning results. ● Only containment edges (ancestor-descendant) are considered while computing result trees. ● Works only on trees, can't handle graphs.

The SphereSearch Engine for Unified Banked Retrieval of Heterogenous XML and Web Documents Jens Graupmann Ralf Schenkel Gerhard Weikum Max-Plack-Institut fur Informatik Presentation by: Nitin Gupta Meghana Kshirsagar Indian Institute of Technology Bombay

Why another search engine ? ● To cope with diversity in the structures and annotations of the data ● Ranked retrieval paradigm for producing relevance ordered results lists rather than a mere boolean retrieval. ● Short comings of the current search engines – Concept aware – Context aware (or link-awareness) – Abstraction aware – Query Language

Concept awareness ● Example: researcher max planck yields many results about researchers who work at the institute “Max Plack” Society ● Better formulation would be researcher person=“max planck” ● Objective attained by – Transformation to XML – Data Annotation

Concept awareness :: Transformation Experiments... Text1... Settings... Text2......... Text1...... Text2......

Abstraction Awareness ● Example: Synonyms, Ontologies ● Is connection to various encyclopedias/ Wiki's possible? ● Objective attained by using – Ontology Service: provides quantified ontological information to the system – Preprocessed information based on focused web crawls to estimate statistical correlations between the characteristic words of related concepts

Context Awareness ● Query may not be answered by web search engines as no single web page may be a match ● Unlike usual navigation axes in XML, context should go beyond trees. ● Consider graph structure spanned by Xlink/XPointer references and href hyperlinks ● Objective attained by – introduction of the concept of a SPHERE

Context Awareness :: Sphere ● What is a sphere? – Relevance of an element for a group of query conditions is not just determined by its own content, but also by the content of other neighboring elements, including linked documents, in an environment - called Sphere - of the element.

Query Language ● Query S = (Q, J) consists of – set Q = { G 1.. G q } of query groups – set J = { J 1.. J m } of join conditions ● Each Q i consists of – set of keyword conditions t 1.. t k – set of concept value conditions c 1 = v 1... c l = v l ● Each join has the form Q i.v = (or ~) Q j.w

Query Language ● Example: – P(professor, location=~Germany) – C(course, ~databases) – R(~project, ~XML) – A(gothic, church) – B(romanic, church) – A.location = B.location German professors who teach database courses and have projects on XML Gothic and Romanic churches at the same location

Data Model ● Collection X = (D, L) of XML documents D together with a set L of (href, Xpointer, or Xlink) links between their elements ● Consider all attributes as elements: then element level graph G E (X) = (V E (X), E E (X)) has the union of all the elements of the document as nodes and undirected edges between them ● Each edge has nonnegative weight – 1 for parent-child; ‘λ’ for links ● A distance function δ X (x,y) : computes weight of a shortest path in G E (X) between x and y

Spheres and Query Groups ● Node-score ns(n,t) is computed using Okapi BM25 model ● Similarity condition ~K: Compute exp(K) for the keyword. The node score is defined as max xЄexp(K) sim(K,x) * ns(n,x) where sim(K,x) is the ontological similarity ● Concept value: – ns(n, c=v) = 0if name(n) ≠ c – ns(n,v) otherwise ● Similarity concept value: ~c = v: sim(name(n), c) * ns(n,v) ● This is insufficient – in the presence of linked documents – when content is spread over several elements {

Spheres and Query Groups Sphere S d (n): set of nodes at distance d from node n s d (n,t) = ∑ v Є Sd(n) ns(v,t) s(n,t) = ∑ s i (n,t) * α i s(1,t) = 1 + 4*0.5 + 2*0.25 + 5*0.125 = 4.175 s(2,t) = 3 + 0*0.5 + 0*0.25 + 1*0.125 = 3.125 s(n, G) = ∑ j s(n,t j ) + ∑ j s(n, c j =v j )

Spheres and Query Groups :: Ranking Create a connection graph G(N) = (V(N), E(N)) Weight of an edge between x,y: 0if x and y are not connected 1/ δ x (x,y)+1otherwise Compactness C(N) of a potential answer N is then the sum of the total edge weights of a maximal spanning tree for G(N), and the score is given by: s(N, S) = β C(N) + (1- β ) ∑ i s(n i, G i )

Spheres and Query Groups :: Joins New virtual links to form an extended collection X' = (D, L') – Connect the elements that match the join – Similarity join: For Qi.v ~ Qj.w, consider sets N(v) (resp N(w)) with name v (w) or contain v (w) in their content. For each pair x N(v), y N(w) add a link {x,y} with weight 1/csim(x,y)

System Architecture Content stored in inverted lists with corresponding tf*idf- style term statistics Indexer stores with each element the corresponding Dewer encoding of its position within the document Focused web crawls used to estimate statistical correlations between the characteristic words of related concepts. Current version uses Dice coefficient.

Query Processor 1.First compute a result list for each query group 2.Add virtual links for join conditions 3.Compute the compactness of a subset of all potential answers of the query in order to return the top-k results 1.Compute a list of results for each of query keywords and concept-value conditions. 2.Candidate nodes: Nodes that are at distance at most D from any node that occurs in at least one of the lists. Sphere score is computed only for these nodes since only these can have a non-zero score! 3.For eachl candidate node N, look up the node scores of nodes in the sphere of N, and adding these scores with a proper damping factor.

Query Processor ● Virtual links: Processor considers only a limited set of possible end points for efficient computation ● Nodes in the spheres upto distance D around nodes with nonzero sphere score for any query group – Why? Any other node will have distance atleast D+1 to any results node and thus contributes at most 1/ (D+1)+1 to the compactness, which is negligible – This set of candidate nodes can be computed on the fly ● Set further reduced by testing join attributes, for example A.x = B.y results in two sets of potential end points.

Query Processor ● Generating answers – Naïve method: generate all possible potential answers from the answers to query groups, compute connection graphs and compactness, and finally their score – For top-k answers, use Fagin's Threshold Algorithm with sorted lists only ● Input: Sorted list of node scores and pairwise node scores (edges) ● Output: k potential answers with the best scores

Experiments ● Sun V40z, 16GB RAM, Windows 2003 Server, Tomcat 4.0.6 environment, Oracle 10g database ● Benchmarks: XMach, Xmark, INEX, TREC Designed for XQuery-style exact match Semantically poor tags Does not consider XML at all Wikipedia Collection from the Wikipedia project: HTML Collection transformed into XML and annotated Wikipedia++ Collection: Extension of Wikipedia with IMDB data, with generated XML files for each movie and actor DBLP++ Collection: Based on the DBLP project which indexes more than 480,000 publications INEX: Set of 12,107 XML documents, a set of queries with and without structural constraints

Experiments Conversion from HTML to XML Dataset Statistics

Experiments ● SSE-basic: basic version limited to keyword conditions using sphere- based scoring ● SSE-CV: basic version plus concept-value conditions ● SSE-QC: CV version plus query groups (full contest awareness) ● SSE-Join: full version will all features ● SSE-KW: very restricted version with simple keyword search ● GoogleWiki: Google search restricted to Wikipedia.org ● Google~Wiki: Google on wikipedia.org with Google's ~ operator for query expansion ● GoogleWeb: Google search on the entire web ● Google~Web: Google search on the entire web with query expansion

Experiments Aggregated results for Wikipedia

Experiments Aggregated results for Wikipedia++ and DBLP++

Experiments Graph showing the average runtimes for different versions

Thank you

XRANK: Ranked Keyword Search over XML Documents Presentation by: Meghana Kshirsagar Nitin Gupta Indian Institute of Technology, Bombay Lin Guo Feng Shao.

Similar presentations

Presentation on theme: "XRANK: Ranked Keyword Search over XML Documents Presentation by: Meghana Kshirsagar Nitin Gupta Indian Institute of Technology, Bombay Lin Guo Feng Shao."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

XRANK: Ranked Keyword Search over XML Documents Presentation by: Meghana Kshirsagar Nitin Gupta Indian Institute of Technology, Bombay Lin Guo Feng Shao.

Similar presentations

Presentation on theme: "XRANK: Ranked Keyword Search over XML Documents Presentation by: Meghana Kshirsagar Nitin Gupta Indian Institute of Technology, Bombay Lin Guo Feng Shao."— Presentation transcript:

Similar presentations

About project

Feedback