Download presentation
Presentation is loading. Please wait.
Published byVeronica Moore Modified over 9 years ago
1
Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI
2
Introduction Overview of DBExplorer Symbol Table Design – Publish Keyword Search Generalized Matches Results and Conclusions CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010 Outline
3
Keyword-Based Search: Given a set of keywords, a ranked list of documents is returned. Traditional Database systems required: - knowledge of the schema Search over databases: Applying IR techniques from documents world to databases without the above requirements is a challenging task. Given a set of keywords, matching rows may need to be obtained by joining several tables on the fly. Example: Search for book on ‘Programming’ by ‘Ritchie’ Introduction CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010 AuthorsBooksBooksBooksStoresStoresAuthors
4
Given a set of query keywords, DBExplorer returns all matching rows(either from single table or by joining tables connected by foreign-key joins)such that each row contains all keywords. Goal: Enabling the keyword search without necessarily requiring to know the schema of the respective databases. Let’s see the case where there is a single Database. The system also allows to search multiple databases simultaneously. DBExplorer Overview(1/2) CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010
5
Requires 2 steps: Publish(Preprocessing step): Identifies the Database, along with the set of tables and columns within the database to be published. Builds the symbol table (auxiliary tables). Search step: Given a query of keywords Lookup symbol table to identify the tables, columns/cells containing keywords Enumerate join trees For each join tree, construct and execute SQL statement to select those rows that contain all keywords. Rank rows and return. DBExplorer Overview(2/2) CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010
6
Traditional IR techniques use data structures such as Inverted lists - to efficiently identify documents containing a query keyword. Symbol table is the key data structure used to store the information about keywords at different granularities. - Column level (Pub-Col): list of table. column - Cell level (Pub-Cell): list of table.column.rowid. -Row level : list of all rows that contain it.(have very little advantage) Symbol Table Designs CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010
7
The Pub-Col symbol table is almost always better unless certain columns do not have indexes. Space and time to build Pub-Col symbol tables is less since we only need to record the distinct values in a column. Keyword search performance: depends on the efficient generation and execution of the SQL statements. Pub-Cell returns more number of SQL statements than Pub-Col. Ease of symbol table Maintenance: Pub-Col table requires an update only if the insertions cause new values to be introduced in some column whereas Pub-Cell table need to be updated for every inserted row. - Similarly deletions. Symbol Table Granularity CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010
8
Hybrid symbol table: If an index is available for a column – publish with Pub-Col granularity otherwise with Pub-Cell granularity. Symbol Table Granularity CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010
9
Store Pub-Col symbol table as (keyword hash, ColId) FK-Comp(Foreign key): If there is key-foreign key relationship between c1 and c2, store only c1 CP-Comp: (not necessarily tied by fk relationships) Partition H into a minimum number of bipartite cliques Compress each clique. Store Pub-Cell table as (keyword hash, list of CellIds). Storing Symbol Tables in Databases CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010 Uncompressed hash table Compressed hash table ColumnsMap table v2v3v4v2v3v4 c1c2c1c2 x
10
1.Looks up symbol table to find tables/columns which contain at least one of the keywords. 2.Enumerate join trees View the schema graph G as an undirected graph and enumerate all possible join trees i.e., sub-trees of G such that a.the leaves belong to the MatchedTables and b.together, the leaves contain all keywords of the query Keyword Search(1/2) CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010 Schema graph Join trees
11
3. Identify the matching rows Each join tree is then mapped to a single SQL statement that joins the tables as specified in the tree, and selects those rows that contain all keywords The retrieved rows are ranked before being output. Keyword Search(2/2) CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010
12
Token Matches: The keyword in the query matches only a token or sub-string of an attribute value. (e.g., LIKE “%string%”) Pub-Prefix method: Some pre-computation is done but can perform token searches using B+ tree indexes. Clause is of the form WHERE T.C LIKE ‘P%K%’ During publishing of a database, for every keyword K, the entry (hash(K), T.C, P) is kept in the symbol table if there exists in column T.C which contains a token K and has prefix P. Generalized Matches(1/2) CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010
13
Example: Let the hash values of the searchable tokens ‘string’, ‘ball’ and ‘round’ be 1, 2, and 3 respectively. Consider searching keyword “string” Pub-Prefix table returns “th” and “no” and subsequent SQL will contain WHERE (T.C LIKE ‘th%ball%’) OR (T.C LIKE ‘an%ball%’) Generalized Matches(2/2) CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010 Database tablePub-Prefix table Hash Val Col Id Prefix 1Cth 1Cno 2Cth 2Can 3C Row Id C 1this is a string 2this string 3this is a ball 4no string 5any ball is round
14
Results(1/2) CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010 Fig 2: Quality of Compression techniques The search scales with the number of query keywords even when the databases have complex schema. The compression achieved for FK-Comp is consistently less when CP- Comp.
15
Results & Conclusions(2/2) CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010 Fig 3: Search time Vs Prefix length As the prefix length increases the discriminating abilities increases, and in the limit the prefix method degenerates to Pub-Cell If a full-text index is available, use Pub-Col. If only a traditional index is available and the column width is small, use Pub-Prefix, otherwise Pub-Cell.
16
Keywords: {livia, karsen, computer} Published Databases GUI for DBXplorer CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010
17
GUI for DBXplorer For each keyword, list of table and column pairs. Enumerated Join trees. CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010
18
GUI for DBXplorer Fig: Matching Rows CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010
19
[1] T. Feder, R. Motwani, Clique partitions, Graph Compression and Speeding-Up Algorithms, STOC, 1991. [2] S. Kapoor, H. Ramesh, Algorithms for Enumerating all Spanning Trees of Directed and Undirected Graphs, SIAM J. Computing, 1995. References CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010
20
? Questions CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.