CIKM Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim and Rachel Benin School of Engineering and Computer Science האוניברסיטה העברית בירושלים The Hebrew University of Jerusalem
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 2 A paradigm for data extraction Data have varying degrees of structure –Relational databases, XML, Web sites Queries are sets of keywords −No structural constraints Keyword Proximity Search (KPS) The Goal: Extract meaningful parts of data w.r.t. the keywords
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 3 Querying Structure & Content by Keywords Keywords appear in different parts of the data Answers show occurrences of keywords, as well the associations among these occurrences Proximity of the keywords in the answer indicates a close (strong) semantic association among them Vardi Databases search …
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 4 Past Work on KPS ( Keyword Proximity Search ) DataSpotDataSpot (Sigmod 1998) Information UnitsInformation Units (WWW 2001) BANKSBANKS (ICDE 2002, VLDB 2005) DISCOVERDISCOVER (VLDB 2002) DBXplorerDBXplorer (ICDE 2002) XKeywordXKeyword (ICDE 2003) …
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 5 The Goal of this Paper Devise efficient algorithms for finding high- quality answers in keyword proximity search
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 6Contents Introduction Formal Setting The Main Results Enumerating in the Exact Order Enumerating in an Approximate Order Conclusion and Future Work
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 7Contents Introduction Formal Setting The Main Results Enumerating in the Exact Order Enumerating in an Approximate Order Conclusion and Future Work
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 8 Data Graphs Structural and keyword nodes Edges may have weights – Weak relationships are penalized by high weights
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 9Queries Q={ Summers, Cohen, coffee } Queries are sets of keywords from the data graph
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 10 Query Answers
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 11 Query Answers An answer is a directed subtree of the data graph Contains all keywords of the query Has no redundant edges (and nodes) The keywords of the query are the leaves The root has two or more children
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 12 Ranking: Inversely Proportional to Weight rank(A)=(weight(A)) -1 Smaller subtrees represent closer associations
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 13 Enumerating in Exact (Ranked) Order BCA BCA B C A BCA B C A B C A B C A If Then ≤ Top-k Answers B C A B C A
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 14 Enumerating in a C-Approximate Order BCA BCA B C A BCA B C A B C A B C A If Then ≤ C-Approximation of the Top-k Answers (Fagin et. al, PODS’01) C-Approximation of the Top-k Answers (Fagin et. al, PODS’01) B C A B C A C C may be a function of G and Q
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 15 Polynomial Delay Yardstick of efficiency: Polynomial delay Yardstick of efficiency: Polynomial delay BCA BCA B C A BCA B C A B C A B C A Polynomial time between generating successive answers Exponentially many answers even for 2 keywords (it is inefficient to generate all answers and then sort)
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 16Contents Introduction Formal Setting The Main Results Enumerating in the Exact Order Enumerating in an Approximate Order Conclusion and Future Work
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 17 Top Answers are Steiner Trees intractableFinding the top answer in KPS (a.k.a. the Steiner- tree problem) is intractable –Therefore, one cannot enumerate all answers in ranked order with polynomial delay However, the top answer can be found efficiently under data complexity –That is, the number of keywords is fixed Approximations can be found efficiently under query-and-data complexity –There is a lot of work on Steiner-tree approximations
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 18 So What Can Be Done? ? Can answers of KPS be enumerated in the exact order with polynomial delay, under data complexity? ? Can approximations of Steiner trees be used for efficiently enumerating in an approximate order (while preserving the approximation ratio) ?
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 19 Our Results Theorem 1: Under data complexity, answers of KPS can be enumerated in the exact order with polynomial delay BCA BCA B C A BCA B C A B C A B C A
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 20 Our Results (cont’d) Theorem 2: Under query-and-data complexity, given an efficient C-approximation for finding Steiner trees, one can enumerate with polynomial delay in a (C+1)-approximate order BCA BCA B C A BCA B C A B C A B C A
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 21 The Meaning of the Results KPS is tractable under data complexity Under query-and-data complexity, an efficient enumeration in an approximate order can be done with almost the same ratios as Steiner trees All results on Steiner trees can be applied to KPS Existing approaches to KPS are heuristics –Exponential delay in the worst case –No provable nontrivial approximation ratios From a theoretical point of view, using heuristics is not the only option
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 22Contents Introduction Formal Setting The Main Results Enumerating in the Exact Order Enumerating in an Approximate Order Conclusion and Future Work
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 23 Lawler’s Method We use the technique of Lawler (1972), which is an iterative method for finding the top-k answers Each iteration generates the next answer by finding the top answer under constraints Lawler’s method is designed for general (discrete) optimization problems When applying it to a specific problem, one needs to deal with the following two issues
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 24 Two Problems to Solve 1.constraints 1. What exactly are the constraints? ( That is, how can we apply Lawler’s method so that the constraints make it possible to find top answers efficiently? ) 2.efficiently ? 2. How can we find efficiently the top answer under constraints ?
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 25 Solving the First Problem subtrees Constraints are subtrees of the graph Pairwise node disjoint Their leaves are exactly the keywords of the query An answer satisfies the constraints if it supertree contains all the subtrees (i.e., a supertree)
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS' constraints 1. What exactly are the constraints? ( That is, how can we apply Lawler in a way that the constraints enable finding the top answer efficiently? ) Two Problems to Solve (One Left) 2.efficiently ? 2. How can we find efficiently the top answer under constraints ?
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 27 Formulation of the Second Problem Input: Input: constraints (node-disjoint subtrees, keywords as leaves) Objective: A minimal answer satisfying the constraints (i.e., containing all the subtress) Next, an algorithm that solves “almost” this problem, namely: (Almost the same) Objective: A minimal supertree satisfying the constraints
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 28 Finding a Minimal Supertree Input: Input: G, T (constraints, i.e., subtrees) 1. Collapse each of the subtrees of T into a node 2. Find a Steiner tree T of the collapsed subtrees 3. Restore the collapsed subtrees in T (more details in the proceedings…)
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 29 (Almost the same) Objective: A minimal supertree satisfying the constraints This is not Enough! Input: Input: constraints (node-disjoint subtrees, keywords as leaves) Objective: A minimal answer satisfying the constraints (i.e., containing all the subtress) Not the same!
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 30 Query Answers Revisited An answer is a directed subtree of the data graph Contains all keywords of the query Has no redundant edges (and nodes) Keywords are the leaves The root has two or more children
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 31 An Example
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 32 An Example The minimal supertree satisfying the constraints The minimal answer satisfying the constraints This edge is redundant! But, it cannot be removed since it is a constraint! The minimal answer can be completely different from the minimal supertree Furthermore, there can be no answer even if there is a supertree
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 33 What if We Remove Edges of Constraints? What if we first generate a minimal supertree and if the root has only one child, then we just remove it (until an answer is obtained)? The constraints are violated, leading to a failure of Lawler’s method! That is, –Some answers will be duplicated –While other answers will not be generated at all
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 34 Our Approach Transform Min. Supertree Constraints Answer New constraints The root of this subtree has more than one child and it must be the root of the answer
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 35 Min. Supertree Min. Supertree Min. Supertree Min. Supertree Transform This Process is Repeated Constraints Up to 2 #keywords times (fixed & usually fewer) final answer The best is the final answer
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 36 About the Transformation The details of the exact transformation and the proof of correctness are intricate All can be found in the proceedings… This concludes the algorithm for enumerating in the exact order
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 37 A Different View: Chain of Reductions Enumerating answers in ranked order Finding the top answer under constraints Finding minimal supertrees Finding Steiner trees Adapting Lawler’s method Transformation of constraints Collapse and restore
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 38Contents Introduction Formal Setting The Main Results Enumerating in the Exact Order Enumerating in an Approximate Order Conclusion and Future Work
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 39 Modifying the Chain of Reductions Enumeration in an approximate order Finding approximate answers under constraints Finding approximations of minimal supertrees Finding approximations of Steiner trees Similar Completely different!
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 40 Min. Supertree Min. Supertree Min. Supertree Min. Supertree Transform Constraints Exact Order Revisited Up to 2 #keywords We cannot allow it under query-and-data complexity!
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 41 The Algorithm Constraints C ≤ C times the optimum 1 ≤ 1 times the optimum A C-approximation of the minimal supertree (collapse and restore) A minimal answer for 3 or fewer constraints ( the algorithm for the exact order )
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 42 Combine the Subtrees The combined subgraph contains an answer (C+1) ≤ (C+1) times the optimum C ≤ C times the optimum 1 ≤ 1 times the optimum A C-approximation of the minimal supertree (collapse and restore) A minimal answer for 3 or fewer constraints ( the algorithm for the exact order )
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 43Contents Introduction Formal Setting The Main Results Enumerating in the Exact Order Enumerating in an Approximate Order Conclusion and Future Work
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 44 Keyword Proximity Search A common paradigm for keyword search over structured databases In the formal model: –Data are directed and weighted graphs –Queries are sets of keywords (i.e., nodes) from the data graph –Query answers are non-redundant subtrees containing the keywords of the query The goal is to find the top-k answers, where the rank is inversely proportional to the weight A stronger goal: enumeration with poly. delay
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 45 Our Results Under data complexity, answers can be enumerated in the exact ranked order with polynomial delay Under query-and-data complexity, every efficient C-approximation to the Steiner-tree problem yields an algorithm for enumerating answers with polynomial delay in a (C+1)-approximate order
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 46 Our Chain of Reductions Enumerating answers in sorted order Finding the top answer under constraints Finding minimal supertrees Finding Steiner trees Lawler’s approach The intricate part … Subtree Collapse/Restore
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 47 Other Variant of KPS Our algorithms can be adapted to other popular variants of KPS
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 48 Undirected Variant Answers are undirected trees
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 49 Strong Variant Answers are undirected trees and keywords are leaves
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 50 Open Problems ?Can we improve the space efficiency of our algorithms? Some ranking functions (e.g., height) are easier than weight when looking for the top answer (no constraints), but –The chain of reductions doesn’t work –The complexity of finding the top answer under constraints is unknown ?Can our results hold for richer queries that also have structural constraints?
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 51 Implementation Considerations Bottlenecks: Steiner-tree algorithms and approximations Thin graphs allow in-memory execution of our algorithms, even for large XML documents (e.g., DBLP) New and intuitive ranking functions that are easier to implement efficiently
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 52 Related Work: Order vs. Efficiency Exact Order Approximate Order Heuristic Order (no approx. guaranteed) No Order More Desirable More Efficient (Queries have a fixed size) This work Past work
CIKM Thank you. Questions?
CIKM Illustration of Lawler’s Method
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 55 Lawler’s Method (1972)
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS' Find the Top Answer In principle, at this point we should find the second-best answer But Instead…
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS' Partition the Remaining Answers
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS' Partition the Remaining Answers Each partition is defined by a distinct set of constraints
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS' Find the Top of each Set
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS' Find the Second Answer The second answer is the best among all the top answers in the partitions
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS' Further Divide the Chosen Partition
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 62 And so on…
CIKM Adapting Lawler’s Method
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 64 Our Constraints Node-disjoint subtrees of the data graph All the leaves are keywords An answer must contain all the subtrees Inclusion Inclusion constraints Edges of the data graph An answer must not contain any of the edges Exclusion Exclusion constraints
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 65 Partitioning a Partition (cont) A … edges(A) \ I = {e 1,…,e k } I A0A0 A0A0 E ⋃ { e 1 } I ⋃ { e 1 } A1A1 A1A1 E ⋃ { e 2 } I ⋃ { e 1,e 2 } A2A2 A2A2 E ⋃ { e 3 } I ⋃ { e 1,e 2,e 3 } A3A3 A3A3 E ⋃ { e 4 } I ⋃ { e 1,…,e k- 1 } A k-1 E ⋃ { e k } I AE
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 66 Generating Constraints (intuition) Constraints (subtrees/edges) are obtained from existing constraints of the current partition and the top answer
CIKM Collapsing Subtrees
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 68 Collapsing a Subtree
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS' Remove All Edges and Internal Nodes Only the root is left
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS' Remove Incoming Edges of Internal Nodes
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS' Add Outgoing Edges to the Root An edge that emanates from an internal node becomes an outgoing edge of the root
Finding and Approximating Top-k Answers in Keyword Proximity Search PODS'06 72 More Details When adding an outgoing edge (r,u) to the root, the weight of (r,u) is the minimal weight among all the edges from the collapsed subtree to u When restoring a subtree, each outgoing edge (r,u) of the root is replaced with an (arbitrary) original edge from the restored subtree to u, with the same weight Incoming edges of internal nodes of the subtree are never restored –Such edges cannot participate in G-supertrees