Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over.

Slides:



Advertisements
Similar presentations
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
Advertisements

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 12, Part A.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Discovering Queries based on Example Tuples
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
Distributed Databases John Ortiz. Lecture 24Distributed Databases2  Distributed Database (DDB) is a collection of interrelated databases interconnected.
Advanced Database Systems September 2013 Dr. Fatemeh Ahmadi-Abkenari 1.
ETEC 100 Information Technology
1 Overview of Storage and Indexing Chapter 8 (part 1)
1 Distributed Databases Chapter Two Types of Applications that Access Distributed Databases The application accesses data at the level of SQL statements.
Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.
1 The Information School of the University of Washington Nov 29fit forms © 2006 University of Washington More Forms INFO/CSE 100, Fall 2006 Fluency.
1 Overview of Storage and Indexing Chapter 8 1. Basics about file management 2. Introduction to indexing 3. First glimpse at indices and workloads.
Chapter 7 Managing Data Sources. ASP.NET 2.0, Third Edition2.
Access Tutorial 3 Maintaining and Querying a Database
Objectives of the Lecture :
Indexing structures for files D ƯƠ NG ANH KHOA-QLU13082.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Computer Science 101 Database Concepts. Database Collection of related data Models real world “universe” Reflects changes Specific purposes and audience.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das Cathy Wang
Module 5 Planning for SQL Server® 2008 R2 Indexing.
1 Index Structures. 2 Chapter : Objectives Types of Single-level Ordered Indexes Primary Indexes Clustering Indexes Secondary Indexes Multilevel Indexes.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
DBMS Implementation Chapter 6.4 V3.0 Napier University Dr Gordon Russell.
Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,
Comp 335 File Structures Hashing.
1 Overview of Storage and Indexing Chapter 8 (part 1)
Set Containment Joins: The Good, The Bad and The Ugly Karthikeyan Ramasamy Jointly With Jignesh Patel, Jeffrey F. Naughton and Raghav Kaushik.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
CE Operating Systems Lecture 17 File systems – interface and implementation.
Databases.  A database is simply a collection of information stored in an orderly manner.  A database can be as simple as a birthday book, address book.
Database Indexing 1 After this lecture, you should be able to:  Understand why we need database indexing.  Define indexes for your tables in MySQL. 
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
Session 1 Module 1: Introduction to Data Integrity
Last Updated : 27 th April 2004 Center of Excellence Data Warehousing Group Teradata Performance Optimization.
Basics of Databases and Information Retrieval1 Databases and Information Retrieval Lecture 1 Basics of Databases and Information Retrieval Instructor Mr.
Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.
SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.
10/3/2017 Chapter 6 Index Structures.
Database System Architecture and Implementation
Indexes By Adrienne Watt.
Indexing Structures for Files and Physical Database Design
CHP - 9 File Structures.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Methodology – Physical Database Design for Relational Databases
Introduction to Query Optimization
Evaluation of Relational Operations
Chapter 15 QUERY EXECUTION.
Indexing and Hashing Basic Concepts Ordered Indices
Normalization Normalization theory is based on the observation that relations with certain properties are more effective in inserting, updating and deleting.
Database Management System
Design tools and techniques for a relational database system
Chapter 11: Indexing and Hashing
Advance Database System
Presentation transcript:

Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over Relational Databases Presented by: DEEP PANCHOLI

Introduction The two most common types of search are Structured Search and Keyword Based Search Example of Structured Search A similar example is to search for books in booksellers database e.g. Books->Travel->Maps->Asia->Russia We all already know what is keyword based search and one example can be searching for Jim Gray on Microsoft Intranet to obtain matched rows

Introduction (cont) Problems faced with Keyword based search implementation Need to know schema Normalized databases Availability of indexes Built on the concept laid by BANKS paper explained in the last lecture. Symbol tables Compacting the symbol tables Search requirements

Overview of DBXplorer DBXplorer returns all rows either from single table or from multiple tables, using FK-joins, such that each row has all the keywords Publish 1. Identify a database and tables and columns within it that are to be enabled for search 2. Create auxiliary tables (Symbol Tables) Search 1. Look up the Symbol table 2. Searching in possible subsets of tables 3. Construct and execute SQL statement and rank the results before displaying to user

Different Symbol Table Designs We will only consider exact match problem Two important levels of granularities Column level granularity (Pub-Col) Cell level granularity (Pub-Cell) Table=Authors FnameLname JohnMarshall JohnShawn Archer JacquelynMarshall

Factors that affect granularity Space and time requirements Pub-Col is faster and occupies less space Keyword search performance Pub-Col if there is an index on the column Ease of Symbol table maintenance Pub-Col is easier to maintain as it contains updates only if there is addition of a new distinct values Hence, the Pub-Col alternative is almost always better than Pub- Cell unless if certain columns contain no indexes If an index is available for column, we should use Pub-Col granularity

Pub-Col representation Store simply as Keyword-ColId Alternative is to use Hashvalue-ColId since storing keywords is wasteful as strings can be long and of varying lengths Compression Algorithms FK-Comp: If column c1 is a subset of values in another column c2, we retain only values in c1 CP-Comp: It is used when pairs of columns share common keywords but are not tied by FK

Pub-Col Algorithm

Search Component Common step for all kinds of granularities It makes use of join trees Hence, if we join tables that occur in the join tree the resulting relation will contain all potential rows containing all keywords specified in the query Example of graph tree Finally SQL query is generated and run The result is then ranked before outputting. The basic approach is to rank them based on the number of joins involved which is similar to Banks approach

Search Algorithm

Case of Token matches Token matches are matches in which keyword match with a token or a substring of attribute value Pub-Prefix method efficiently enables token match capabilities by exploiting available B+ tree indexes Symbol table has entry (hash(k),T.C, P)

Case of Token matches (cont) Pub-Prefix method result is comparable to Pub-Cell method when the column width is small (i.e. less than 100 characters) For columns where strings are greater than hundreds of characters, Pub-Cell outperforms Pub-Prefix significantly Important issue is to determine the appropriate prefix length stored in symbol table. However, Pub-Prefix method is still being researched upon Other research is going on in field of stemming of query keywords

Experimental Results The experiments were carried out on a 450MHz 256 MB Intel P-3 machine. There were 4 databases used for evaluation: TPC-H data of sizes from 100 to 500 MB USR is Microsoft employee address DB of 130 MB with 19 tables ML is a 375 MB mailing list DB with 38 tables KB is a 365 MB DB with 84 tables containing information on articles and help manuals on various shipped products

System Architecture for DBXplorer

Experimental Results (cont) In particular the authors show the following: Pub-Col is compact compared to Pub-Cell Pub-Col scales linearly with data size and is independent of data distribution Pub-Prefix is compact compared to Pub-Cell and has a significantly better performance when full text indexes are not present

Pub-Col and Pub-Cell symbol table size comparison

Symbol table publishing time comparison

Query performance

Other Observations It was also noticed that search scales with number of query keywords. The query was varied with 2 to 10 keywords and still the average query time was between 1 to 1.3 seconds Also, it was noticed that FK-Comp and CP-Comp reduce the size of Pub-Col by a factor of 0.45 to 0.90 depending on size of original table However, it was noticed that compression added a negligible overhead on search performance

Effectiveness of Pub-Prefix method The Pub-prefix method was tested on workload consisting of 100 random keywords from character column of width 64 bytes in the KB database. It was noticed that the performance of Pub-Prefix increased with increase in Pub-Prefix length and gave the optimum performance at prefix-length of 8 This is because as the length increases, beyond a certain limit the optimizer decides to scan the original table compared to index search

Conclusion Although, we discussed only about a single database query, this technique can be applied to search multiple databases also DBXplorer is easy to use with any Database Management system As mentioned before, the Pub-Col alternative is the best when columns have indexes on them. A hybrid table can be created so that if there is an index for a column, we use Pub- Col granularity and if there is no index, we use Pub-Cell granularity