Probabilistic Ranking of Database Query Results

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over.

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia.

Probabilistic Information Retrieval Part I: Survey Alexander Dekhtyar department of Computer Science University of Maryland.

A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.

Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

PREFER: A System for the Efficient Execution of Multi-parametric Ranked Queries Vagelis Hristidis University of California, San Diego Nick Koudas AT&T.

Addressing Diverse User Preferences in SQL-Query-Result Navigation SIGMOD ‘07 Zhiyuan Chen Tao Li University of Maryland, Baltimore County Florida International.

Introduction to Information Retrieval (Part 2) By Evren Ermis.

Efficient Query Evaluation on Probabilistic Databases

Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.

Exploring Reduction for Long Web Queries Niranjan Balasubramanian, Giridhar Kuamaran, Vitor R. Carvalho Speaker: Razvan Belet 1.

Information Discovery on Vertical Domains Vagelis Hristidis Assistant Professor School of Computing and Information Sciences Florida International University.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

1 Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan* Stanford University Surajit Chaudhuri Vivek Narasayya Microsoft Research.

Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.

Modeling Modern Information Retrieval

ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.

Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.

Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.

FLANN Fast Library for Approximate Nearest Neighbors

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Winter Semester 2003/2004Selected Topics in Web IR and Mining7-1 7 Top-k Queries on Web Sources and Structured Data 7.1 Top-k Queries over Autonomous Web.

MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.

DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.

DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das Cathy Wang

Querying Structured Text in an XML Database By Xuemei Luo.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Designing Example Critiquing Interaction Boi Faltings Pearl Pu Marc Torrens Paolo Viappiani IUI 2004, Madeira, Portugal – Wed Jan 14, 2004 LIAHCI.

Facilitating Document Annotation using Content and Querying Value.

Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:

A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.

Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.

Effective Keyword-Based Selection of Relational Databases By Bei Yu, Guoliang Li, Karen Sollins & Anthony K. H. Tung Presented by Deborah Kallina.

Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.

Automatic Categorization of Query Results A Paper by Kaushik Chakarbati, Surajit Chaudhari, Seung -won Hwang Presented by Arjun Saraswat.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.

Automatic Categorization of Query Results Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang Sushruth Puttaswamy.

Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

15.1 – Introduction to physical-Query-plan operators

CS 540 Database Management Systems

Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang

Chapter 15 QUERY EXECUTION.

Query Execution Presented by Khadke, Suvarna CS 257

Database Management Systems (CS 564)

Information Retrieval Models: Probabilistic Models

Machine Learning for Online Query Relaxation

Query Execution Presented by Jiten Oswal CS 257 Chapter 15

Panagiotis G. Ipeirotis Luis Gravano

Retrieval Performance Evaluation - Measures

Information Retrieval and Web Design

Prefer: A System for the Efficient Execution

Probabilistic Ranking of Database Query Results

Probabilistic Information Retrieval

Presentation transcript:

Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik Presented by Weimin He CSE@UTA

Outline Motivation Problem Definition System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems 11/21/2018 Weimin He CSE@UTA

Motivating example Realtor DB: Table D=(TID, Price , City, Bedrooms, Bathrooms, LivingArea, SchoolDistrict, View, Pool, Garage, BoatDock) SQL query: Select * From D Where City=Seattle AND View=Waterfront 11/21/2018 Weimin He CSE@UTA

Motivation Many-answers problem Two alternative solutions: Query reformulation Automatic ranking Apply probabilistic model in IR to DB tuple ranking 11/21/2018 Weimin He CSE@UTA

Problem Definition Given a database table D with n tuples {t1, …, tn} over a set of m categorical attributes A = {A1, …, Am} and a query Q: SELECT * FROM D WHERE X1=x1 AND … AND Xs=xs where each Xi is an attribute from A and xi is a value in its domain. The set of attributes X ={X1, …, Xs} is known as the set of attributes specified by the query, while the set Y = A – X is known as the set of unspecified attributes Let be the answer set of Q How to rank tuples in S and return top-k tuples to the user ? 11/21/2018 Weimin He CSE@UTA

System Architecture 11/21/2018 Weimin He CSE@UTA

Intuition for Ranking Function Select * From D Where City=“Seattle” And View=“Waterfront” Score of a Result Tuple t depends on Global Score: Global Importance of Unspecified Attribute Values E.g., Homes with good school districts are globally desirable Conditional Score: Correlations between Specified and Unspecified Attribute Values E.g., Waterfront  BoatDock 11/21/2018 Weimin He CSE@UTA

Probabilistic Model in IR Bayes’ Rule Product Rule Document t, Query Q R: Relevant document set R = D - R: Irrelevant document set 11/21/2018 Weimin He CSE@UTA

Adaptation of PIR to DB Tuple t is considered as a document Partition t into t(X) and t(Y) t(X) and t(Y) are written as X and Y Derive from initial scoring function until final ranking function is obtained 11/21/2018 Weimin He CSE@UTA

Preliminary Derivation 11/21/2018 Weimin He CSE@UTA

Limited Independence Assumptions Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed 11/21/2018 Weimin He CSE@UTA

Continuing Derivation 11/21/2018 Weimin He CSE@UTA

Workload-based Estimation of Assume a collection of “past” queries existed in system Workload W is represented as a set of “tuples” Given query Q and specified attribute set X, approximate R as all query “tuples” in W that also request for X All properties of the set of relevant tuple set R can be obtained by only examining the subset of the workload that caontains queries that also request for X 11/21/2018 Weimin He CSE@UTA

Final Ranking Function 11/21/2018 Weimin He CSE@UTA

Pre-computing Atomic Probabilities in Ranking Function Relative frequency in W Relative frequency in D (#of tuples in W that conatains x, y)/total # of tuples in W (#of tuples in D that conatains x, y)/total # of tuples in D 11/21/2018 Weimin He CSE@UTA

Example for Computing Atomic Probabilities Select * From D Where City=“Seattle” And View=“Waterfront” Y={SchoolDistrict, BoatDock, …} D=10,000 W=1000 W{excellent}=10 W{waterfront &yes}=5 p(excellent|W)=10/1000=0.1 p(excellent|D)=10/10,000=0.01 p(waterfront|yes,W)=5/1000=0.005 p(waterfront|yes,D)=5/10,000=0.0005 11/21/2018 Weimin He CSE@UTA

Indexing Atomic Probabilities {AttName, AttVal, Prob} B+ tree index on (AttName, AttVal) {AttName, AttVal, Prob} B+ tree index on (AttName, AttVal) {AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob} B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight) {AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob} B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight) 11/21/2018 Weimin He CSE@UTA

Scan Algorithm Preprocessing - Atomic Probabilities Module Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y Execution Select Tuples that Satisfy the Query Scan and Compute Score for Each Result-Tuple Return Top-K Tuples 11/21/2018 Weimin He CSE@UTA

Beyond Scan Algorithm Scan algorithm is Inefficient Many tuples in the answer set Another extreme Pre-compute top-K tuples for all possible queries Still infeasible in practice Trade-off solution Pre-compute ranked lists of tuples for all possible atomic queries At query time, merge ranked lists to get top-K tuples 11/21/2018 Weimin He CSE@UTA

Two kinds of Ranked List CondList Cx {AttName, AttVal, TID, CondScore} B+ tree index on (AttName, AttVal, CondScore) GlobList Gx {AttName, AttVal, TID, GlobScore} B+ tree index on (AttName, AttVal, GlobScore) 11/21/2018 Weimin He CSE@UTA

Index Module 11/21/2018 Weimin He CSE@UTA

List Merge Algorithm 11/21/2018 Weimin He CSE@UTA

Experimental Setup Datasets: MSR HomeAdvisor Seattle (http://houseandhome.msn.com/) Internet Movie Database (http://www.imdb.com) Software and Hardware: Microsoft SQL Server2000 RDBMS P4 2.8-GHz PC, 1 GB RAM C#, Connected to RDBMS through DAO 11/21/2018 Weimin He CSE@UTA

Quality Experiments Conducted on Seattle Homes and Movies tables Collect a workload from users Compare Conditional Ranking Method in the paper with the Global Method [CIDR03] 11/21/2018 Weimin He CSE@UTA

Quality Experiment-Average Precision For each query Qi , generate a set Hi of 30 tuples likely to contain a good mix of relevant and irrelevant tuples Let each user mark 10 tuples in Hi as most relevant to Qi Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm 11/21/2018 Weimin He CSE@UTA

Quality Experiment- Fraction of Users Preferring Each Algorithm 5 new queries Users were given the top-5 results 11/21/2018 Weimin He CSE@UTA

Performance Experiments Datasets Compare 2 Algorithms: Scan algorithm List Merge algorithm 11/21/2018 Weimin He CSE@UTA

Performance Experiments – Pre-computation Time 11/21/2018 Weimin He CSE@UTA

Performance Experiments – Execution Time 11/21/2018 Weimin He CSE@UTA

Performance Experiments – Execution Time 11/21/2018 Weimin He CSE@UTA

Performance Experiments – Execution Time 11/21/2018 Weimin He CSE@UTA

Conclusion and Open Problems Automatic ranking for many-answers Adaptation of PIR to DB Mutiple-table query Non-categorical attributes 11/21/2018 Weimin He CSE@UTA