Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI.

Slides:



Advertisements
Similar presentations
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
Advertisements

Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over.
Indexing DNA Sequences Using q-Grams
What is a Database By: Cristian Dubon.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Store RDF Triples In A Scalable Way Liu Long & Liu Chunqiu.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
ViST: a dynamic index method for querying XML data by tree structures Authors: Haixun Wang, Sanghyun Park, Wei Fan, Philip Yu Presenter: Elena Zheleva,
BTrees & Bitmap Indexes
A Robust System Architecture For Mining Semi-structured Data By Aby M Mathew CSE
Introduction to Structured Query Language (SQL)
Introduction to Databases CIS 5.2. Where would you find info about yourself stored in a computer? College Physician’s office Library Grocery Store Dentist’s.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
Introduction to Structured Query Language (SQL)
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
Indexing XML Data Stored in a Relational Database VLDB`2004 Shankar Pal, Istvan Cseri, Gideon Schaller, Oliver Seeliger, Leo Giakoumakis, Vasili Vasili.
Chapter 10 Queries and Updating Part C. SQL Copyright 2005 Radian Publishing Co.
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
The main mathematical concepts that are used in this research are presented in this section. Definition 1: XML tree is composed of many subtrees of different.
CODD’s 12 RULES OF RELATIONAL DATABASE
Computer Science 101 Database Concepts. Database Collection of related data Models real world “universe” Reflects changes Specific purposes and audience.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das Cathy Wang
Querying Structured Text in an XML Database By Xuemei Luo.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Database and Query Model ◦ Informal Model ◦ Formal Model ◦ Query and Answer Model 
CPS120: Introduction to Computer Science Lecture 19 Introduction to SQL.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Structure Query Language SQL. Database Terminology Employee ID 3 3 Last name Small First name Tony 5 5 Smith James
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Indexes and Views Unit 7.
1 Fast packet classification for two-dimensional conflict-free filters Department of Computer Science and Information Engineering National Cheng Kung University,
Database Management Supplement 1. 2 I. The Hierarchy of Data Database File (Entity, Table) Record (info for a specific entity, Row) Field (Attribute,
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Session 1 Module 1: Introduction to Data Integrity
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.
LINQ to DATABASE-2.  Creating the BooksDataContext  The code combines data from the three tables in the Books database and displays the relationships.
Manipulating Data Lesson 3. Objectives Queries The SELECT query to retrieve or extract data from one table, how to retrieve or extract data by using.
Graph Indexing From managing and mining graph data.
7 1 Database Systems: Design, Implementation, & Management, 7 th Edition, Rob & Coronel 7.6 Advanced Select Queries SQL provides useful functions that.
BY: Mark Gruszecki.  What is a Recursive Query?  Definition(s) and Algorithm(s)  Optimization Techniques  Practical Issues  Impact of each Optimization.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Introduction to File Processing with PHP. Review of Course Outcomes 1. Implement file reading and writing programs using PHP. 2. Identify file access.
INFORMATION TECHNOLOGY DATABASE MANAGEMENT. A database is a collection of information organized to provide efficient retrieval. The collected information.
MySQL Tutorial. Databases A database is a container that groups together a series of tables within a single structure Each database can contain 1 or more.
CS4432: Database Systems II Query Processing- Part 1 1.
Database System Architecture and Implementation
Learn about relations and their basic properties
Indexes By Adrienne Watt.
Indexing Structures for Files and Physical Database Design
 2012 Pearson Education, Inc. All rights reserved.
CIS 336 STUDY Lessons in Excellence-- cis336study.com.
CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE
Chapter 11: Indexing and Hashing
Indexing and Hashing Basic Concepts Ordered Indices
The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited)
Introduction To Structured Query Language (SQL)
Manipulating Data Lesson 3.
Chapter 11: Indexing and Hashing
Presentation transcript:

Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI

Introduction Overview of DBExplorer Symbol Table Design – Publish Keyword Search Generalized Matches Results and Conclusions CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010 Outline

 Keyword-Based Search: Given a set of keywords, a ranked list of documents is returned.  Traditional Database systems required: - knowledge of the schema  Search over databases: Applying IR techniques from documents world to databases without the above requirements is a challenging task. Given a set of keywords, matching rows may need to be obtained by joining several tables on the fly. Example: Search for book on ‘Programming’ by ‘Ritchie’ Introduction CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010 AuthorsBooksBooksBooksStoresStoresAuthors

 Given a set of query keywords, DBExplorer returns all matching rows(either from single table or by joining tables connected by foreign-key joins)such that each row contains all keywords.  Goal: Enabling the keyword search without necessarily requiring to know the schema of the respective databases.  Let’s see the case where there is a single Database.  The system also allows to search multiple databases simultaneously. DBExplorer Overview(1/2) CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

 Requires 2 steps: Publish(Preprocessing step): Identifies the Database, along with the set of tables and columns within the database to be published. Builds the symbol table (auxiliary tables). Search step: Given a query of keywords Lookup symbol table to identify the tables, columns/cells containing keywords Enumerate join trees For each join tree, construct and execute SQL statement to select those rows that contain all keywords. Rank rows and return. DBExplorer Overview(2/2) CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

 Traditional IR techniques use data structures such as Inverted lists - to efficiently identify documents containing a query keyword.  Symbol table is the key data structure used to store the information about keywords at different granularities. - Column level (Pub-Col): list of table. column - Cell level (Pub-Cell): list of table.column.rowid. -Row level : list of all rows that contain it.(have very little advantage) Symbol Table Designs CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

 The Pub-Col symbol table is almost always better unless certain columns do not have indexes. Space and time to build Pub-Col symbol tables is less since we only need to record the distinct values in a column. Keyword search performance: depends on the efficient generation and execution of the SQL statements. Pub-Cell returns more number of SQL statements than Pub-Col. Ease of symbol table Maintenance: Pub-Col table requires an update only if the insertions cause new values to be introduced in some column whereas Pub-Cell table need to be updated for every inserted row. - Similarly deletions. Symbol Table Granularity CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

Hybrid symbol table: If an index is available for a column – publish with Pub-Col granularity otherwise with Pub-Cell granularity. Symbol Table Granularity CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

 Store Pub-Col symbol table as (keyword hash, ColId)  FK-Comp(Foreign key): If there is key-foreign key relationship between c1 and c2, store only c1  CP-Comp: (not necessarily tied by fk relationships) Partition H into a minimum number of bipartite cliques Compress each clique.  Store Pub-Cell table as (keyword hash, list of CellIds). Storing Symbol Tables in Databases CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010 Uncompressed hash table Compressed hash table ColumnsMap table v2v3v4v2v3v4 c1c2c1c2 x

1.Looks up symbol table to find tables/columns which contain at least one of the keywords. 2.Enumerate join trees View the schema graph G as an undirected graph and enumerate all possible join trees i.e., sub-trees of G such that a.the leaves belong to the MatchedTables and b.together, the leaves contain all keywords of the query Keyword Search(1/2) CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010 Schema graph Join trees

3. Identify the matching rows Each join tree is then mapped to a single SQL statement that joins the tables as specified in the tree, and selects those rows that contain all keywords The retrieved rows are ranked before being output. Keyword Search(2/2) CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

 Token Matches: The keyword in the query matches only a token or sub-string of an attribute value. (e.g., LIKE “%string%”)  Pub-Prefix method: Some pre-computation is done but can perform token searches using B+ tree indexes. Clause is of the form WHERE T.C LIKE ‘P%K%’ During publishing of a database, for every keyword K, the entry (hash(K), T.C, P) is kept in the symbol table if there exists in column T.C which contains a token K and has prefix P. Generalized Matches(1/2) CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

 Example: Let the hash values of the searchable tokens ‘string’, ‘ball’ and ‘round’ be 1, 2, and 3 respectively. Consider searching keyword “string” Pub-Prefix table returns “th” and “no” and subsequent SQL will contain WHERE (T.C LIKE ‘th%ball%’) OR (T.C LIKE ‘an%ball%’) Generalized Matches(2/2) CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010 Database tablePub-Prefix table Hash Val Col Id Prefix 1Cth 1Cno 2Cth 2Can 3C Row Id C 1this is a string 2this string 3this is a ball 4no string 5any ball is round

Results(1/2) CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010 Fig 2: Quality of Compression techniques The search scales with the number of query keywords even when the databases have complex schema. The compression achieved for FK-Comp is consistently less when CP- Comp.

Results & Conclusions(2/2) CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010 Fig 3: Search time Vs Prefix length As the prefix length increases the discriminating abilities increases, and in the limit the prefix method degenerates to Pub-Cell If a full-text index is available, use Pub-Col. If only a traditional index is available and the column width is small, use Pub-Prefix, otherwise Pub-Cell.

 Keywords: {livia, karsen, computer} Published Databases GUI for DBXplorer CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

GUI for DBXplorer For each keyword, list of table and column pairs. Enumerated Join trees. CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

GUI for DBXplorer Fig: Matching Rows CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

[1] T. Feder, R. Motwani, Clique partitions, Graph Compression and Speeding-Up Algorithms, STOC, [2] S. Kapoor, H. Ramesh, Algorithms for Enumerating all Spanning Trees of Directed and Undirected Graphs, SIAM J. Computing, References CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

? Questions CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010