Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Chapter 5: Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
 Introduction  Views  Related Work  Preliminaries  Problems Discussed  Algorithm LPTA  View Selection Problem  Experimental Results.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Addressing Diverse User Preferences in SQL-Query-Result Navigation SIGMOD ‘07 Zhiyuan Chen Tao Li University of Maryland, Baltimore County Florida International.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Comparing Offline and Online Statistics Estimation for Text Retrieval from Overlapped Collections MS Thesis Defense Bhaumik Chokshi Committee Members:
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
Learning for Text Categorization
Learning Techniques for Information Retrieval Perceptron algorithm Least mean.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Modeling Modern Information Retrieval
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
Winter Semester 2003/2004Selected Topics in Web IR and Mining7-1 7 Top-k Queries on Web Sources and Structured Data 7.1 Top-k Queries over Autonomous Web.
Automated Ranking Of Database Query Results  Sanjay Agarwal - Microsoft Research  Surajit Chaudhuri - Microsoft Research  Gautam Das - Microsoft Research.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Computer Science 101 Database Concepts. Database Collection of related data Models real world “universe” Reflects changes Specific purposes and audience.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das Cathy Wang
The Effect of Collection Organization and Query Locality on IR Performance 2003/07/28 Park,
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Vector Space Models.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Ranking Instructor: Gautam Das Class notes Prepared by Sushanth Sivaram Vallath.
Session 1 Module 1: Introduction to Data Integrity
Automatic Categorization of Query Results A Paper by Kaushik Chakarbati, Surajit Chaudhari, Seung -won Hwang Presented by Arjun Saraswat.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Automatic Categorization of Query Results Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang Sushruth Puttaswamy.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.
Chapter 7. Classification and Prediction
Indexes By Adrienne Watt.
Database Management System
Information Retrieval and Web Search
Chapter 15 QUERY EXECUTION.
Probabilistic Ranking of Database Query Results
Panagiotis G. Ipeirotis Luis Gravano
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Probabilistic Databases
Boolean and Vector Space Retrieval Models
Yan Huang - CSCI5330 Database Implementation – Query Processing
Prefer: A System for the Efficient Execution
Probabilistic Ranking of Database Query Results
Presentation transcript:

Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik 30th VLDB Conference Toronto,Canada,2004 Presented By Abhishek Jamloki

Realtor DB: Table D=(TID, Price, City, Bedrooms, Bathrooms, LivingArea, SchoolDistrict, View, Pool, Garage, BoatDock)  SQL query: Select * From D Where City=Seattle AND View=Waterfront

Consider a database table D with n tuples {t1, …, tn} over a set of m categorical attributes A = {A1, …, Am} a query Q: SELECT * FROM D WHERE X1=x1 AND … AND Xs=xs where each Xi is an attribute from A and xi is a value in its domain. specified attributes: X ={X1, …, Xs} unspecified attributes: Y = A – X Let S be the answer set of Q How to rank tuples in S and return top-k tuples to the user?

 IR Treatment Query Reformulation Automatic Ranking Correlations are ignored in high dimensional spaces of IR Automated Ranking function proposed based on 1)A global score of unspecified attributes 2)A conditional score (strength of correlation between specified and unspecified attributes) Automatic estimation using workload and data analysis

Bayes’ Rule Product Rule Document t, Query Q R: Relevant document set R = D - R: Irrelevant document set

 Each tuple t is treated as a document  Partition t into two parts t(X): contains specified attributes t(Y): contains unspecified attributes  Replace t with X and Y  Replace R with D

 Comprehensive dependency models have unacceptable preprocessing and query processing costs  Choose a middle ground.  Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed

 Workload W: a collection of ranking queries that have been executed on our system in the past.  Represented as a set of “tuples”, where each tuple represents a query and is a vector containing the corresponding values of the specified attributes.  We approximate R as all query “tuples” in W that also request for X (approximation is novel to this paper)  Properties of the set of relevant tuples R can be obtained by only examining the subset of the workload that contains queries that also request for X Substitute p(y | R) as p(y | X,W)

 p(y | W) the relative frequencies of each distinct value y in the workload  p( y | D) relative frequencies of each distinct value y in the database (similar to IDF concept in IR)  p(x | y,W) confidences of pair-wise association rules in the workload, that is: (#of tuples in W that contains x, y)/total # of tuples in W  p(x | y,D): (#of tuples in D that contains x, y)/total # of tuples in D  Stored as auxiliary tables in the intermediate knowledge representation layer

 p(y | w) {AttName, AttVal, Prob} ◦ B + Tree index on (AttName, AttVal)  p(y | D) {AttName, AttVal, Prob} ◦ B + Tree index on (AttName, AttVal)  p(x | y,W) {AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob} ◦ B + Tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)  p(x | y,D) {AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob} ◦ B + Tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)

Preprocessing - Atomic Probabilities Module  Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y Execution  Select Tuples that Satisfy the Query  Scan and Compute Score for Each Result-Tuple  Return Top-K Tuples

 Trade off between pre-processing and query processing  Pre-compute ranked lists of the tuples for all possible “atomic” queries. Then at query time, given an actual query that specifies a set of values X, we “merge” the ranked lists corresponding to each x in X to compute the final Top-K tuples.  We should be able to perform merging without scanning the entire ranked lists  Threshold algorithm can be used for this purpose  A feasible adaptation of TA should keep the number of sorted streams small  Number of sorted streams will depend on number of attributes in database

 At query time we do a TA-like merging of several ranked lists (i.e. sorted streams)  The required number of sorted streams depends only on the number of specified attribute values in the query and not on the total number of attributes in the database  Such a merge operation is only made possible due to the specific functional form of our ranking function resulting from our limited independence assumptions

 Index Module: takes as inputs the association rules and the database, and for every distinct value x, creates two lists Cx and Gx, each containing the tuple-ids of all data tuples that contain x, ordered in specific ways.  Conditional List Cx: consists of pairs of the form, ordered by descending CondScore TID: tuple-id of a tuple t that contains x  Global List Gx: consists of pairs of the form, ordered by descending GlobScore, where TID is the tuple-id of a tuple t that contains x and

 At query time we retrieve and multiply the scores of t in the lists Cx1,…,Cxs and in one of Gx1,…,Gxs. This requires only s+1 multiplications and results in a score2 that is proportional to the actual score. Two kinds of efficient access operations are needed:  First, given a value x, it should be possible to perform a GetNextTID operation on lists Cx and Gx in constant time, tuple-ids in the lists should be efficiently retrievable one-by- one in order of decreasing score. This corresponds to the sorted stream access of TA.  Second, it should be possible to perform random access on the lists, that is, given a TID, the corresponding score (CondScore or GlobScore) should be retrievable in constant time.

 These lists are stored as database tables –  CondList C x {AttName, AttVal, TID, CondScore} B + Tree index on (AttName, AttVal, CondScore)  GlobList G x {AttName, AttVal, TID, GlobScore} B + Tree index on (AttName, AttVal, GlobScore)

 Space consumed by the lists is O(mn) bytes (m is the number of attributes and n the number of tuples of the database table)  We can store only a subset of the lists at preprocessing time, at the expense of an increase in the query processing time.  Determining which lists to retain/omit at preprocessing time done by analyzing the workload.  Store the conditional lists Cx and the corresponding global lists Gx only for those attribute values x that occur most frequently in the workload  Probe the intermediate knowledge representation layer at query time to compute the missing information

 The following Datasets were used: MSR HomeAdvisor Seattle ( Internet Movie Database ( Software and Hardware: Microsoft SQL Server2000 RDBMS P4 2.8-GHz PC, 1 GB RAM C#, Connected to RDBMS through DAO

 Evaluated using two ranking methods 1) Conditional 2) Global Several hundred workload queries were collected for both the datasets and ranking algorithm trained on this workload

 For each query Q i, generate a set H i of 30 tuples likely to contain a good mix of relevant and irrelevant tuples  Let each user mark 10 tuples in H i as most relevant to Q i  Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm

 Users were given the Top-5 results of the two ranking methods for 5 queries (different from the previous survey), and were asked to choose which rankings they preferred

 Compared performance of the various implementations of the Conditional algorithm: List Merge, its space-saving variant and Scan  Datasets used:

 Completely automated approach for the Many-Answers Problem which leverages data and workload statistics and correlation  Probabilistic IR models were adapted for structured data.  Experiments demonstrate efficiency as well as quality of the ranking system

 Many relational databases contain text columns in addition to numeric and categorical columns. Whether correlations between text and non-text data can be leveraged in a meaningful way for ranking ?  Comprehensive quality benchmarks for database ranking need to be established