Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI.

Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI

Introduction Overview of DBExplorer Symbol Table Design – Publish Keyword Search Generalized Matches Results and Conclusions CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010 Outline

 Keyword-Based Search: Given a set of keywords, a ranked list of documents is returned.  Traditional Database systems required: - knowledge of the schema  Search over databases: Applying IR techniques from documents world to databases without the above requirements is a challenging task. Given a set of keywords, matching rows may need to be obtained by joining several tables on the fly. Example: Search for book on ‘Programming’ by ‘Ritchie’ Introduction CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010 AuthorsBooksBooksBooksStoresStoresAuthors

 Given a set of query keywords, DBExplorer returns all matching rows(either from single table or by joining tables connected by foreign-key joins)such that each row contains all keywords.  Goal: Enabling the keyword search without necessarily requiring to know the schema of the respective databases.  Let’s see the case where there is a single Database.  The system also allows to search multiple databases simultaneously. DBExplorer Overview(1/2) CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

 Requires 2 steps: Publish(Preprocessing step): Identifies the Database, along with the set of tables and columns within the database to be published. Builds the symbol table (auxiliary tables). Search step: Given a query of keywords Lookup symbol table to identify the tables, columns/cells containing keywords Enumerate join trees For each join tree, construct and execute SQL statement to select those rows that contain all keywords. Rank rows and return. DBExplorer Overview(2/2) CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

 Traditional IR techniques use data structures such as Inverted lists - to efficiently identify documents containing a query keyword.  Symbol table is the key data structure used to store the information about keywords at different granularities. - Column level (Pub-Col): list of table. column - Cell level (Pub-Cell): list of table.column.rowid. -Row level : list of all rows that contain it.(have very little advantage) Symbol Table Designs CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

 The Pub-Col symbol table is almost always better unless certain columns do not have indexes. Space and time to build Pub-Col symbol tables is less since we only need to record the distinct values in a column. Keyword search performance: depends on the efficient generation and execution of the SQL statements. Pub-Cell returns more number of SQL statements than Pub-Col. Ease of symbol table Maintenance: Pub-Col table requires an update only if the insertions cause new values to be introduced in some column whereas Pub-Cell table need to be updated for every inserted row. - Similarly deletions. Symbol Table Granularity CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

Hybrid symbol table: If an index is available for a column – publish with Pub-Col granularity otherwise with Pub-Cell granularity. Symbol Table Granularity CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

 Store Pub-Col symbol table as (keyword hash, ColId)  FK-Comp(Foreign key): If there is key-foreign key relationship between c1 and c2, store only c1  CP-Comp: (not necessarily tied by fk relationships) Partition H into a minimum number of bipartite cliques Compress each clique.  Store Pub-Cell table as (keyword hash, list of CellIds). Storing Symbol Tables in Databases CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010 Uncompressed hash table Compressed hash table ColumnsMap table v2v3v4v2v3v4 c1c2c1c2 x

1.Looks up symbol table to find tables/columns which contain at least one of the keywords. 2.Enumerate join trees View the schema graph G as an undirected graph and enumerate all possible join trees i.e., sub-trees of G such that a.the leaves belong to the MatchedTables and b.together, the leaves contain all keywords of the query Keyword Search(1/2) CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010 Schema graph Join trees

3. Identify the matching rows Each join tree is then mapped to a single SQL statement that joins the tables as specified in the tree, and selects those rows that contain all keywords The retrieved rows are ranked before being output. Keyword Search(2/2) CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

 Token Matches: The keyword in the query matches only a token or sub-string of an attribute value. (e.g., LIKE “%string%”)  Pub-Prefix method: Some pre-computation is done but can perform token searches using B+ tree indexes. Clause is of the form WHERE T.C LIKE ‘P%K%’ During publishing of a database, for every keyword K, the entry (hash(K), T.C, P) is kept in the symbol table if there exists in column T.C which contains a token K and has prefix P. Generalized Matches(1/2) CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

 Example: Let the hash values of the searchable tokens ‘string’, ‘ball’ and ‘round’ be 1, 2, and 3 respectively. Consider searching keyword “string” Pub-Prefix table returns “th” and “no” and subsequent SQL will contain WHERE (T.C LIKE ‘th%ball%’) OR (T.C LIKE ‘an%ball%’) Generalized Matches(2/2) CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010 Database tablePub-Prefix table Hash Val Col Id Prefix 1Cth 1Cno 2Cth 2Can 3C Row Id C 1this is a string 2this string 3this is a ball 4no string 5any ball is round

Results(1/2) CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010 Fig 2: Quality of Compression techniques The search scales with the number of query keywords even when the databases have complex schema. The compression achieved for FK-Comp is consistently less when CP- Comp.

Results & Conclusions(2/2) CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010 Fig 3: Search time Vs Prefix length As the prefix length increases the discriminating abilities increases, and in the limit the prefix method degenerates to Pub-Cell If a full-text index is available, use Pub-Col. If only a traditional index is available and the column width is small, use Pub-Prefix, otherwise Pub-Cell.

 Keywords: {livia, karsen, computer} Published Databases GUI for DBXplorer CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

GUI for DBXplorer For each keyword, list of table and column pairs. Enumerated Join trees. CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

GUI for DBXplorer Fig: Matching Rows CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

[1] T. Feder, R. Motwani, Clique partitions, Graph Compression and Speeding-Up Algorithms, STOC, 1991. [2] S. Kapoor, H. Ramesh, Algorithms for Enumerating all Spanning Trees of Directed and Undirected Graphs, SIAM J. Computing, 1995. References CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

? Questions CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI.

Similar presentations

Presentation on theme: "Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI.

Similar presentations

Presentation on theme: "Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI."— Presentation transcript:

Similar presentations

About project

Feedback