Keyword Proximity Search on Graphs M.Sc. Systems Course The Hebrew University of Jerusalem, Winter 2006.

Slides:



Advertisements
Similar presentations
The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld.
Advertisements

Chapter 5: Tree Constructions
Optimizing Join Enumeration in Transformation-based Query Optimizers ANIL SHANBHAG, S. SUDARSHAN IIT BOMBAY VLDB 2014
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
The Volcano/Cascades Query Optimization Framework
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
1 Steiner Tree on graphs of small treewidth Algorithms and Networks 2014/2015 Hans L. Bodlaender Johan M. M. van Rooij.
Enumerating Large Query Results Benny Kimelfeld IBM Almaden Research Center Sara Cohen The Hebrew University of Jerusalem Yehoshua Sagiv The Hebrew University.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
Graph & BFS.
Recent Development on Elimination Ordering Group 1.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.
Graph COMP171 Fall Graph / Slide 2 Graphs * Extremely useful tool in modeling problems * Consist of: n Vertices n Edges D E A C F B Vertex Edge.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
CSE 421 Algorithms Richard Anderson Lecture 4. What does it mean for an algorithm to be efficient?
Backtracking Reading Material: Chapter 13, Sections 1, 2, 4, and 5.
DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.
1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.
Important Problem Types and Fundamental Data Structures
Bidirectional Expansion for Keyword Search on Graph Databases Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.
Keyword Search in Relational Databases Jaehui Park Intelligent Database Systems Lab. Seoul National University
May 5, 2015Applied Discrete Mathematics Week 13: Boolean Algebra 1 Dijkstra’s Algorithm procedure Dijkstra(G: weighted connected simple graph with vertices.
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.
Fixed Parameter Complexity Algorithms and Networks.
MA/CSSE 473 Day 12 Insertion Sort quick review DFS, BFS Topological Sort.
Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Theory of Computing Lecture 15 MAS 714 Hartmut Klauck.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Analysis of Algorithms
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das Cathy Wang
May 1, 2002Applied Discrete Mathematics Week 13: Graphs and Trees 1News CSEMS Scholarships for CS and Math students (US citizens only) $3,125 per year.
Querying Structured Text in an XML Database By Xuemei Luo.
1 Trees A tree is a data structure used to represent different kinds of data and help solve a number of algorithmic problems Game trees (i.e., chess ),
CSE 326: Data Structures NP Completeness Ben Lerner Summer 2007.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
CIKM Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.
Union-find Algorithm Presented by Michael Cassarino.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
Finding a Minimal Tree Pattern Under Neighborhood Constraints Benny Kimelfeld Yehoshua Sagiv IBM Research – AlmadenThe Hebrew University of Jerusalem 2011.
CS4432: Database Systems II Query Processing- Part 2.
CS 146: Data Structures and Algorithms July 16 Class Meeting Department of Computer Science San Jose State University Summer 2015 Instructor: Ron Mak
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Dec. 13, 2002 WISE2002 Processing XML View Queries Including User-defined Foreign Functions on Relational Databases Yoshiharu Ishikawa Jun Kawada Hiroyuki.
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
LIMITATIONS OF ALGORITHM POWER
Chapter 13 Backtracking Introduction The 3-coloring problem
Keyword Searching and Browsing in Databases using BANKS Charuta Nakhe, Arvind Hulgeri, Gaurav Bhalotia, Soumen Chakrabarti, S. Sudarshan Presented by Sushanth.
ICS 353: Design and Analysis of Algorithms Backtracking King Fahd University of Petroleum & Minerals Information & Computer Science Department.
8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
"Teachers open the door, but you must enter by yourself. "
By A. Aboulnaga, A. R. Alameldeen and J. F. Naughton Vldb’01
Backtracking And Branch And Bound
Keyword Searching and Browsing in Databases using BANKS
Keyword Searching and Browsing in Databases using BANKS
ICS 353: Design and Analysis of Algorithms
Introduction to XML IR XML Group.
Complexity Theory: Foundations
Presentation transcript:

Keyword Proximity Search on Graphs M.Sc. Systems Course The Hebrew University of Jerusalem, Winter 2006

Keyword Proximity Search on Graphs MSSYS 2006 A rapidly evolving paradigm for data extraction Data have varying degrees of structure Queries are sets of keywords −No structural constraints Keyword Proximity Search Relational Databases Web Sites XML Documents The Goal: Extract meaningful parts of data w.r.t. the keywords

Keyword Proximity Search on Graphs MSSYS 2006 Recent Work on KPS ( Keyword Proximity Search ) DataSpot DataSpot (Sigmod 1998) Information Units Information Units (WWW 2001) BANKS BANKS (ICDE 2002, VLDB 2005) DISCOVER DISCOVER (VLDB 2002) DBXplorer DBXplorer (ICDE 2002) XKeyword XKeyword (ICDE 2003) …

Keyword Proximity Search on Graphs MSSYS 2006 Systems for KPS on Relational Data BANKS, DISCOVER and DBXplorer implemented KPS (Keyword Proximity Search) on relational databases  Different algorithms are used  Slight differences in semantics G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching and browsing in databases using BANKS. In ICDE, pages 431–440, V. Hristidis and Y. Papakonstantinou. DISCOVER: Keyword search in relational databases. In VLDB, pages 670–681, S. Agrawal, S. Chaudhuri, and G. Das. DBXplorer: enabling keyword search over relational databases. In SIGMOD Conference, page 627, 2002.

Keyword Proximity Search on Graphs MSSYS 2006 Example: KPS on RDB IDName Population 22 Amsterdam Brussels IDNameHead Q. 135EU73 175ESA81 CountryOrg. B135 NL135 search Belgium, Brussels CodeNameAreaCapital NL Netherlands BBelgium CitiesOrganizations CountriesMemberships

Keyword Proximity Search on Graphs MSSYS 2006 IDName Population 22 Amsterdam Brussels IDNameHead Q. 135EU73 175ESA81 CountryOrg. B135 NL135 search Belgium, Brussels CodeNameAreaCapital NL Netherlands BBelgium CitiesOrganizations CountriesMemberships Brussels is the capital city of Belgium

Keyword Proximity Search on Graphs MSSYS 2006 IDName Population 22 Amsterdam Brussels IDNameHead Q. 135EU73 175ESA81 CountryOrg. B135 NL135 search Belgium, Brussels CodeNameAreaCapital NL Netherlands BBelgium CitiesOrganizations CountriesMemberships BBelgium Brussels Brussels is the capital city of Belgium

Keyword Proximity Search on Graphs MSSYS 2006 IDName Population 22 Amsterdam Brussels IDNameHead Q. 135EU73 175ESA81 CountryOrg. B135 NL135 CodeNameAreaCapital NL Netherlands BBelgium CitiesOrganizations CountriesMemberships Brussels hosts EU and Belgium is a member search Belgium, Brussels

Keyword Proximity Search on Graphs MSSYS 2006 IDName Population 22 Amsterdam Brussels IDNameHead Q. 135EU73 175ESA81 CountryOrg. B135 NL135 CodeNameAreaCapital NL Netherlands BBelgium CitiesOrganizations CountriesMemberships BBelgium Brussels Brussels hosts EU and Belgium is a member search Belgium, Brussels B135 EU73

Keyword Proximity Search on Graphs MSSYS 2006 XKeyword: KPS on XML XKeyword implemented KPS on XML  Architecture is based on that of DISCOVER  A demo over DBLP is available V. Hristidis, Y. Papakonstantinou, and A. Balmin. Keyword proximity search on XML graphs. In ICDE, pages 367–378, 2003.

Keyword Proximity Search on Graphs MSSYS 2006 Example: KPS on XML search Yannakakis, Approximation

Keyword Proximity Search on Graphs MSSYS 2006 Yannakakis wrote a paper about Approximation search Yannakakis, Approximation

Keyword Proximity Search on Graphs MSSYS 2006 Yannakakis is cited by a paper about Approximation search Yannakakis, Approximation

Keyword Proximity Search on Graphs MSSYS 2006 KPS on Web Sites (Information Units) KPS can also be used for retrieving information from Web sites For a given query, results are collections of Web pages from the site – Pages are relevant w.r.t. the keywords – Pages are connected by hyperlinks Wen-Syan Li, K. Selçuk Candan, Quoc Vu, and Divyakant Agrawal. Retrieving and organizing web pages by “information unit”. In WWW, pages , 2001.

Keyword Proximity Search on Graphs MSSYS 2006 Example: KPS in Web Siteshttp:// search Hilton, Beach

Keyword Proximity Search on Graphs MSSYS 2006 Example: KPS in Web Sites Eilat Beaches Hilton Eilat Queen of Sheba search Hilton, Beach Eilat

A Formal Framework for KPS

Keyword Proximity Search on Graphs MSSYS 2006 Data Graphs Data graphs have two types of nodes:  Structural nodes  Keywords

Keyword Proximity Search on Graphs MSSYS 2006 Queries K={ Summers, Cohen, coffee } Queries are sets of keywords from the data graph

Keyword Proximity Search on Graphs MSSYS 2006 Query Results

Keyword Proximity Search on Graphs MSSYS 2006 Query Results Query results are subtrees of the data graph  Contain all keywords in the query  Have no redundant edges A subtree that is reduced w.r.t. the keywords

Keyword Proximity Search on Graphs MSSYS 2006 Three Variants Three variants of keyword proximity search are considered: Rooted proximity Undirected proximity Strong proximity

Keyword Proximity Search on Graphs MSSYS 2006 Rooted Variant BANKS Used in BANKS Results are rooted trees

Keyword Proximity Search on Graphs MSSYS 2006 Undirected Variant Interconnection Semantics for XML Used in Interconnection Semantics for XML Results are undirected trees

Keyword Proximity Search on Graphs MSSYS 2006 Strong Variant XKeywordInformation UnitsDBXplorer DISCOVER Used in XKeyword, Information Units, DBXplorer and DISCOVER Results are undirected trees and keywords are leaves

Keyword Proximity Search on Graphs MSSYS 2006 Data A data graph G Problem DefinitionQuery A set K of keywords in G Query Results Subtrees of G that are reduced w.r.t. K Input: Output: Rooted/Undirected/Strong

Keyword Proximity Search on Graphs MSSYS 2006 Creating Data Graphs from Relational Databases Nodes are tuples Edges are foreign- key references

Keyword Proximity Search on Graphs MSSYS 2006 Creating Data Graphs from Relational Databases Edges from each tuple node to all the keywords in that tuple

Keyword Proximity Search on Graphs MSSYS 2006 Creating Data Graphs from XML Nodes are XML elements

Keyword Proximity Search on Graphs MSSYS 2006 Creating Data Graphs from XML Nodes are XML elements Edges are nesting of elements … Edges represent nesting of elements …

Keyword Proximity Search on Graphs MSSYS 2006 Creating Data Graphs from XML Nodes are XML elements Edges represent nesting of elements … … and ID references

Keyword Proximity Search on Graphs MSSYS 2006 Creating Data Graphs from XML Keywords appear in PCDATA Nodes are XML elements … and ID references Edges are nesting of elements … Edges represent nesting of elements …

Keyword Proximity Search on Graphs MSSYS 2006 All Occurrences of a Keyword are Represented by One Node Approximation A keywords is represented by a single node

Keyword Proximity Search on Graphs MSSYS 2006 Creating Data Graphs from Web Sites Nodes are Web pages … Keywords appear in these pages … Edges are hyperlinks/XLinks A keywords is represented by a single node

Ranking and Enumeration Order

Keyword Proximity Search on Graphs MSSYS 2006 Ranking Results Ranking of results is determined by size

Keyword Proximity Search on Graphs MSSYS 2006 Edges Have Weights edges incident to dblp have a large weight edges from cite to article have a medium weight

Keyword Proximity Search on Graphs MSSYS 2006 Order of Results Arbitrary Order Exact Order

Keyword Proximity Search on Graphs MSSYS 2006 Order of Results (cont’d) Heuristic Order C-Approximate Order

Measuring the Efficiency of Enumerations

Keyword Proximity Search on Graphs MSSYS 2006 Polynomial Runtime is not Appropriate for KPS In the theory of CS, the usual notion of efficiency is polynomial running time  That is, the algorithm terminates in time that is polynomial in the size of the input However, in KPS the number of results can be exponential in the size of the input  Algorithms cannot be expected to terminate in polynomial time  Even for two keywords Therefore, other notions are required

Keyword Proximity Search on Graphs MSSYS 2006 Time Efficiency Polynomial Total Time Polynomial runtime in the combined size of the input and the output Polynomial Delay The runtime between two successive results is polynomial in the size of the input

Keyword Proximity Search on Graphs MSSYS 2006 About Polynomial Delay With polynomial delay you can: Generate the first few results quickly Efficiently return results in pages In most cases of keyword search, this is the suitable notion of efficiency Goal: develop algorithms that enumerate KPS results with polynomial delay

Keyword Proximity Search on Graphs MSSYS 2006 Space Efficiency Polynomial Space Linearly-Incremental Space  i results require i times polynomial space in the input

Keyword Proximity Search on Graphs MSSYS 2006 Data and Query-and-Data Complexity Under query-and-data complexity, we assume that both the query and the data are of unbounded size  Many problems in database theory, e.g., computing joins of relational tables, are intractable under this measure In practice, however, queries are very small compared to the data Under data complexity, the size of the query is assumed to be fixed

Enumerating Results of KS with Polynomial Delay

Keyword Proximity Search on Graphs MSSYS 2006 Keyword Search with Polynomial Delay The following algorithm enumerates reduced subtrees (i.e., results of keyword search) with polynomial delay  Results are not ranked A different version of the algorithm for each of the three variants:  rooted  undirected  strong

Keyword Proximity Search on Graphs MSSYS 2006 Importance of the Algorithm An upper bound for ranked keyword search: Results can be enumerated in ranked order in polynomial total time  Generate all the results and then sort them In some cases, ranking is not required A basis for developing efficient heuristics that enumerate in an “almost” ranked order (discussed later)

The Algorithm for Enumerating Rooted Reduced Subtrees

Keyword Proximity Search on Graphs MSSYS 2006 Overview The algorithm uses two reductions Each reduction alone either does not solve the problem or runs in exponential total time However, the two reductions can be combined together to enumerate reduced subtrees with polynomial delay

Keyword Proximity Search on Graphs MSSYS 2006 Data Reduction 1. Choose an arbitrary node v in K 2. For each parent p of v do: I. In K: replace v with p II. In G: remove v III. Generate all results for the new input IV. Add p→v to each result of the new input KG

Keyword Proximity Search on Graphs MSSYS 2006 Example Showing Failure B AC K A B C Four results! Two with this root

Keyword Proximity Search on Graphs MSSYS 2006 Failure Example B AC K A B C

Keyword Proximity Search on Graphs MSSYS 2006 Failure Example B C K B C

Keyword Proximity Search on Graphs MSSYS 2006 Failure Example C K C

Keyword Proximity Search on Graphs MSSYS 2006 Failure ExampleK

Keyword Proximity Search on Graphs MSSYS 2006 Failure Example C B A Only one result! Three others are missing!

Keyword Proximity Search on Graphs MSSYS 2006 Why Data Reduction Fails We assumed that v is a leaf in every result It does not hold for structural nodes in recursive steps! Therefore, some results are not found! Solution(?): Repeat data reduction for every v in K  Exponential total time in the worst case!

Keyword Proximity Search on Graphs MSSYS 2006 Query Reduction 1. Remove one keyword from the query 2. Find all results for the smaller query 3. Extend each result to include the missing keyword, in every possible way K= {A,B,C}

Keyword Proximity Search on Graphs MSSYS 2006 Extending Partial Results In query reduction, we need to extend a result T of the query K\{k} to all results of the query K This is done as follows:  For all nodes v of T: Remove from G all nodes of T, except for v Find all simple directed paths P from v to k and print the concatenation of T and P If v is the root of T, we also need to concatenate T with all subtrees that are reduced w.r.t. v and k More details are can be found in the paper

Keyword Proximity Search on Graphs MSSYS 2006 Extensions by Directed Paths

Keyword Proximity Search on Graphs MSSYS 2006 Extensions by Directed Subtrees

Keyword Proximity Search on Graphs MSSYS 2006 Query Reduction is not Efficient! Query reduction completely solves the problem, but it is inefficient Problem: A subset of the query may have much more results than the query itself Exponential total time! 2 n results for {A,B} 1 result for {A,B,C}

Keyword Proximity Search on Graphs MSSYS 2006 Combining the Reductions In order to enumerate in polynomial total time, combine query and data reductions:  If some node v of K is reachable, in the data graph, from another node u of K, use query reduction remove v from K  Otherwise, use data reduction By combining the two reductions, results can be enumerated in polynomial total time

Keyword Proximity Search on Graphs MSSYS 2006 Achieving Polynomial Delay To achieve polynomial delay, we cannot wait until a recursive subroutine terminates Use coroutines instead of subroutines! That is, each recursive execution of the algorithm  stops after generating each result  resumes when the next result is required

Keyword Proximity Search on Graphs MSSYS 2006 routine 3routine 2routine 1 Subroutines Polynomial Total Time

Keyword Proximity Search on Graphs MSSYS 2006 routine 3routine 2routine 1 Coroutines Polynomial Delay

For papers and projects related to this topic, see the home page of Benny Kimelfeldhome page