Download presentation
Presentation is loading. Please wait.
1
Keyword Proximity Search on Graphs M.Sc. Systems Course The Hebrew University of Jerusalem, Winter 2006
2
Keyword Proximity Search on Graphs MSSYS 2006 A rapidly evolving paradigm for data extraction Data have varying degrees of structure Queries are sets of keywords −No structural constraints Keyword Proximity Search Relational Databases Web Sites XML Documents The Goal: Extract meaningful parts of data w.r.t. the keywords
3
Keyword Proximity Search on Graphs MSSYS 2006 Recent Work on KPS ( Keyword Proximity Search ) DataSpot DataSpot (Sigmod 1998) Information Units Information Units (WWW 2001) BANKS BANKS (ICDE 2002, VLDB 2005) DISCOVER DISCOVER (VLDB 2002) DBXplorer DBXplorer (ICDE 2002) XKeyword XKeyword (ICDE 2003) …
4
Keyword Proximity Search on Graphs MSSYS 2006 Systems for KPS on Relational Data BANKS, DISCOVER and DBXplorer implemented KPS (Keyword Proximity Search) on relational databases Different algorithms are used Slight differences in semantics G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching and browsing in databases using BANKS. In ICDE, pages 431–440, 2002. V. Hristidis and Y. Papakonstantinou. DISCOVER: Keyword search in relational databases. In VLDB, pages 670–681, 2002. S. Agrawal, S. Chaudhuri, and G. Das. DBXplorer: enabling keyword search over relational databases. In SIGMOD Conference, page 627, 2002.
5
Keyword Proximity Search on Graphs MSSYS 2006 Example: KPS on RDB IDName Population 22 Amsterdam 1101407 73Brussels951580 IDNameHead Q. 135EU73 175ESA81 CountryOrg. B135 NL135 search Belgium, Brussels CodeNameAreaCapital NL Netherlands 3733022 BBelgium3051073 CitiesOrganizations CountriesMemberships
6
Keyword Proximity Search on Graphs MSSYS 2006 IDName Population 22 Amsterdam 1101407 73Brussels951580 IDNameHead Q. 135EU73 175ESA81 CountryOrg. B135 NL135 search Belgium, Brussels CodeNameAreaCapital NL Netherlands 3733022 BBelgium3051073 CitiesOrganizations CountriesMemberships Brussels is the capital city of Belgium
7
Keyword Proximity Search on Graphs MSSYS 2006 IDName Population 22 Amsterdam 1101407 73Brussels951580 IDNameHead Q. 135EU73 175ESA81 CountryOrg. B135 NL135 search Belgium, Brussels CodeNameAreaCapital NL Netherlands 3733022 BBelgium3051073 CitiesOrganizations CountriesMemberships BBelgium3051073 Brussels951580 Brussels is the capital city of Belgium
8
Keyword Proximity Search on Graphs MSSYS 2006 IDName Population 22 Amsterdam 1101407 73Brussels951580 IDNameHead Q. 135EU73 175ESA81 CountryOrg. B135 NL135 CodeNameAreaCapital NL Netherlands 3733022 BBelgium3051073 CitiesOrganizations CountriesMemberships Brussels hosts EU and Belgium is a member search Belgium, Brussels
9
Keyword Proximity Search on Graphs MSSYS 2006 IDName Population 22 Amsterdam 1101407 73Brussels951580 IDNameHead Q. 135EU73 175ESA81 CountryOrg. B135 NL135 CodeNameAreaCapital NL Netherlands 3733022 BBelgium3051073 CitiesOrganizations CountriesMemberships BBelgium3051073 Brussels951580 Brussels hosts EU and Belgium is a member search Belgium, Brussels B135 EU73
10
Keyword Proximity Search on Graphs MSSYS 2006 XKeyword: KPS on XML XKeyword implemented KPS on XML Architecture is based on that of DISCOVER A demo over DBLP is available http://kebab.ucsd.edu:81/xkeyword V. Hristidis, Y. Papakonstantinou, and A. Balmin. Keyword proximity search on XML graphs. In ICDE, pages 367–378, 2003.
11
Keyword Proximity Search on Graphs MSSYS 2006 Example: KPS on XML search Yannakakis, Approximation
12
Keyword Proximity Search on Graphs MSSYS 2006 Yannakakis wrote a paper about Approximation search Yannakakis, Approximation
13
Keyword Proximity Search on Graphs MSSYS 2006 Yannakakis is cited by a paper about Approximation search Yannakakis, Approximation
14
Keyword Proximity Search on Graphs MSSYS 2006 KPS on Web Sites (Information Units) KPS can also be used for retrieving information from Web sites For a given query, results are collections of Web pages from the site – Pages are relevant w.r.t. the keywords – Pages are connected by hyperlinks Wen-Syan Li, K. Selçuk Candan, Quoc Vu, and Divyakant Agrawal. Retrieving and organizing web pages by “information unit”. In WWW, pages 230-244, 2001.
15
Keyword Proximity Search on Graphs MSSYS 2006 Example: KPS in Web Siteshttp://www.goisrael.com/ search Hilton, Beach
16
Keyword Proximity Search on Graphs MSSYS 2006 Example: KPS in Web Sites Eilat Beaches Hilton Eilat Queen of Sheba search Hilton, Beach Eilat
17
A Formal Framework for KPS
18
Keyword Proximity Search on Graphs MSSYS 2006 Data Graphs Data graphs have two types of nodes: Structural nodes Keywords
19
Keyword Proximity Search on Graphs MSSYS 2006 Queries K={ Summers, Cohen, coffee } Queries are sets of keywords from the data graph
20
Keyword Proximity Search on Graphs MSSYS 2006 Query Results
21
Keyword Proximity Search on Graphs MSSYS 2006 Query Results Query results are subtrees of the data graph Contain all keywords in the query Have no redundant edges A subtree that is reduced w.r.t. the keywords
22
Keyword Proximity Search on Graphs MSSYS 2006 Three Variants Three variants of keyword proximity search are considered: Rooted proximity Undirected proximity Strong proximity
23
Keyword Proximity Search on Graphs MSSYS 2006 Rooted Variant BANKS Used in BANKS Results are rooted trees
24
Keyword Proximity Search on Graphs MSSYS 2006 Undirected Variant Interconnection Semantics for XML Used in Interconnection Semantics for XML Results are undirected trees
25
Keyword Proximity Search on Graphs MSSYS 2006 Strong Variant XKeywordInformation UnitsDBXplorer DISCOVER Used in XKeyword, Information Units, DBXplorer and DISCOVER Results are undirected trees and keywords are leaves
26
Keyword Proximity Search on Graphs MSSYS 2006 Data A data graph G Problem DefinitionQuery A set K of keywords in G Query Results Subtrees of G that are reduced w.r.t. K Input: Output: Rooted/Undirected/Strong
27
Keyword Proximity Search on Graphs MSSYS 2006 Creating Data Graphs from Relational Databases Nodes are tuples Edges are foreign- key references
28
Keyword Proximity Search on Graphs MSSYS 2006 Creating Data Graphs from Relational Databases Edges from each tuple node to all the keywords in that tuple
29
Keyword Proximity Search on Graphs MSSYS 2006 Creating Data Graphs from XML Nodes are XML elements
30
Keyword Proximity Search on Graphs MSSYS 2006 Creating Data Graphs from XML Nodes are XML elements Edges are nesting of elements … Edges represent nesting of elements …
31
Keyword Proximity Search on Graphs MSSYS 2006 Creating Data Graphs from XML Nodes are XML elements Edges represent nesting of elements … … and ID references
32
Keyword Proximity Search on Graphs MSSYS 2006 Creating Data Graphs from XML Keywords appear in PCDATA Nodes are XML elements … and ID references Edges are nesting of elements … Edges represent nesting of elements …
33
Keyword Proximity Search on Graphs MSSYS 2006 All Occurrences of a Keyword are Represented by One Node Approximation A keywords is represented by a single node
34
Keyword Proximity Search on Graphs MSSYS 2006 Creating Data Graphs from Web Sites Nodes are Web pages … Keywords appear in these pages … Edges are hyperlinks/XLinks http://www.goisrael.com/ A keywords is represented by a single node
35
Ranking and Enumeration Order
36
Keyword Proximity Search on Graphs MSSYS 2006 Ranking Results Ranking of results is determined by size
37
Keyword Proximity Search on Graphs MSSYS 2006 Edges Have Weights edges incident to dblp have a large weight edges from cite to article have a medium weight
38
Keyword Proximity Search on Graphs MSSYS 2006 Order of Results Arbitrary Order Exact Order
39
Keyword Proximity Search on Graphs MSSYS 2006 Order of Results (cont’d) Heuristic Order C-Approximate Order
40
Measuring the Efficiency of Enumerations
41
Keyword Proximity Search on Graphs MSSYS 2006 Polynomial Runtime is not Appropriate for KPS In the theory of CS, the usual notion of efficiency is polynomial running time That is, the algorithm terminates in time that is polynomial in the size of the input However, in KPS the number of results can be exponential in the size of the input Algorithms cannot be expected to terminate in polynomial time Even for two keywords Therefore, other notions are required
42
Keyword Proximity Search on Graphs MSSYS 2006 Time Efficiency Polynomial Total Time Polynomial runtime in the combined size of the input and the output Polynomial Delay The runtime between two successive results is polynomial in the size of the input
43
Keyword Proximity Search on Graphs MSSYS 2006 About Polynomial Delay With polynomial delay you can: Generate the first few results quickly Efficiently return results in pages In most cases of keyword search, this is the suitable notion of efficiency Goal: develop algorithms that enumerate KPS results with polynomial delay
44
Keyword Proximity Search on Graphs MSSYS 2006 Space Efficiency Polynomial Space Linearly-Incremental Space i results require i times polynomial space in the input
45
Keyword Proximity Search on Graphs MSSYS 2006 Data and Query-and-Data Complexity Under query-and-data complexity, we assume that both the query and the data are of unbounded size Many problems in database theory, e.g., computing joins of relational tables, are intractable under this measure In practice, however, queries are very small compared to the data Under data complexity, the size of the query is assumed to be fixed
46
Enumerating Results of KS with Polynomial Delay
47
Keyword Proximity Search on Graphs MSSYS 2006 Keyword Search with Polynomial Delay The following algorithm enumerates reduced subtrees (i.e., results of keyword search) with polynomial delay Results are not ranked A different version of the algorithm for each of the three variants: rooted undirected strong
48
Keyword Proximity Search on Graphs MSSYS 2006 Importance of the Algorithm An upper bound for ranked keyword search: Results can be enumerated in ranked order in polynomial total time Generate all the results and then sort them In some cases, ranking is not required A basis for developing efficient heuristics that enumerate in an “almost” ranked order (discussed later)
49
The Algorithm for Enumerating Rooted Reduced Subtrees
50
Keyword Proximity Search on Graphs MSSYS 2006 Overview The algorithm uses two reductions Each reduction alone either does not solve the problem or runs in exponential total time However, the two reductions can be combined together to enumerate reduced subtrees with polynomial delay
51
Keyword Proximity Search on Graphs MSSYS 2006 Data Reduction 1. Choose an arbitrary node v in K 2. For each parent p of v do: I. In K: replace v with p II. In G: remove v III. Generate all results for the new input IV. Add p→v to each result of the new input KG
52
Keyword Proximity Search on Graphs MSSYS 2006 Example Showing Failure B AC K A B C Four results! Two with this root
53
Keyword Proximity Search on Graphs MSSYS 2006 Failure Example B AC K A B C
54
Keyword Proximity Search on Graphs MSSYS 2006 Failure Example B C K B C
55
Keyword Proximity Search on Graphs MSSYS 2006 Failure Example C K C
56
Keyword Proximity Search on Graphs MSSYS 2006 Failure ExampleK
57
Keyword Proximity Search on Graphs MSSYS 2006 Failure Example C B A Only one result! Three others are missing!
58
Keyword Proximity Search on Graphs MSSYS 2006 Why Data Reduction Fails We assumed that v is a leaf in every result It does not hold for structural nodes in recursive steps! Therefore, some results are not found! Solution(?): Repeat data reduction for every v in K Exponential total time in the worst case!
59
Keyword Proximity Search on Graphs MSSYS 2006 Query Reduction 1. Remove one keyword from the query 2. Find all results for the smaller query 3. Extend each result to include the missing keyword, in every possible way K= {A,B,C}
60
Keyword Proximity Search on Graphs MSSYS 2006 Extending Partial Results In query reduction, we need to extend a result T of the query K\{k} to all results of the query K This is done as follows: For all nodes v of T: Remove from G all nodes of T, except for v Find all simple directed paths P from v to k and print the concatenation of T and P If v is the root of T, we also need to concatenate T with all subtrees that are reduced w.r.t. v and k More details are can be found in the paper
61
Keyword Proximity Search on Graphs MSSYS 2006 Extensions by Directed Paths
62
Keyword Proximity Search on Graphs MSSYS 2006 Extensions by Directed Subtrees
63
Keyword Proximity Search on Graphs MSSYS 2006 Query Reduction is not Efficient! Query reduction completely solves the problem, but it is inefficient Problem: A subset of the query may have much more results than the query itself Exponential total time! 2 n results for {A,B} 1 result for {A,B,C}
64
Keyword Proximity Search on Graphs MSSYS 2006 Combining the Reductions In order to enumerate in polynomial total time, combine query and data reductions: If some node v of K is reachable, in the data graph, from another node u of K, use query reduction remove v from K Otherwise, use data reduction By combining the two reductions, results can be enumerated in polynomial total time
65
Keyword Proximity Search on Graphs MSSYS 2006 Achieving Polynomial Delay To achieve polynomial delay, we cannot wait until a recursive subroutine terminates Use coroutines instead of subroutines! That is, each recursive execution of the algorithm stops after generating each result resumes when the next result is required
66
Keyword Proximity Search on Graphs MSSYS 2006 routine 3routine 2routine 1 Subroutines Polynomial Total Time
67
Keyword Proximity Search on Graphs MSSYS 2006 routine 3routine 2routine 1 Coroutines Polynomial Delay
68
For papers and projects related to this topic, see the home page of Benny Kimelfeldhome page
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.