Answering Why-not Questions on Top-K Queries Andy He and Eric Lo The Hong Kong Polytechnic University.

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

Davide Mottin, Senjuti Basu Roy, Alice Marascu, Yannis Velegrakis, Themis Palpanas, Gautam Das A Probabilistic Optimization Framework for the Empty-Answer.

Web Information Retrieval

Computer Science and Engineering Inverted Linear Quadtree: Efﬁcient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.

A lightweight framework for testing database applications Joe Tang Eric Lo Hong Kong Polytechnic University.

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.

Progressive Computation of The Min-Dist Optimal-Location Query Donghui Zhang, Yang Du, Tian Xia, Yufei Tao* Northeastern University * Chinese University.

SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007.

Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.

Active Learning and Collaborative Filtering

A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.

Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.

Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor Presented by Sergey Shepshelvich 1.

Efficient Skyline Querying with Variable User Preferences on Nominal Attributes Raymond Chi-Wing Wong 1, Ada Wai-Chee Fu 2, Jian Pei 3, Yip Sing Ho 2,

Creating Difficult Instances of the Post Correspondence Problem Presenter: Ling Zhao Department of Computing Science University of Alberta March 20, 2001.

Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.

Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.

WISER: Newspapers online : an introduction to the scope and range of recent and current newspapers available on Oxlip, including hints on effective search.

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

PRESENTED BY- HARSH SINGH A Random Walk Approach to Sampling Hidden Databases By Arjun Dasgupta, Dr. Gautam Das and Heikki Mannila.

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

1 CE 530 Molecular Simulation Lecture 7 David A. Kofke Department of Chemical Engineering SUNY Buffalo

Catching the Best Views of Skyline: A Semantic Approach Based on Decisive Subspaces Jian Pei # Wen Jin # Martin Ester # Yufei Tao + # Simon Fraser University,

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,

Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Introduction to Inferential Statistics. Introduction  Researchers most often have a population that is too large to test, so have to draw a sample from.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

1 Exact Top-k Nearest Keyword Search in Large Networks Minhao Jiang†, Ada Wai-Chee Fu‡, Raymond Chi-Wing Wong† † The Hong Kong University of Science and.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Top-k Similarity Join over Multi- valued Objects Wenjie Zhang Jing Xu, Xin Liang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia.

Reverse Top-k Queries Akrivi Vlachou *, Christos Doulkeridis *, Yannis Kotidis #, Kjetil Nørvåg * *Norwegian University of Science and Technology (NTNU),

Privacy Preservation of Aggregates in Hidden Databases: Why and How? Arjun Dasgupta, Nan Zhang, Gautam Das, Surajit Chaudhuri Presented by PENG Yu.

Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.

K-Hit Query: Top-k Query Processing with Probabilistic Utility Function SIGMOD2015 Peng Peng, Raymond C.-W. Wong CSE, HKUST 1.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Efficient Processing of Top-k Spatial Preference Queries

1University of Texas at Arlington.  Introduction  Motivation  Requirements  Paper’s Contribution.  Related Work  Overview of Ripple Join  Rank.

Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.

Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.

Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.

Presented by Suresh Barukula 2011csz  Top-k query processing means finding k- objects, that have highest overall grades.  A query in multimedia.

Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.

An Approximate Nearest Neighbor Retrieval Scheme for Computationally Intensive Distance Measures Pratyush Bhatt MS by Research(CVIT)

Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin.

Presented By Anirban Maiti Chandrashekar Vijayarenu

CSE 6392 – Data Exploration and Analysis in Relational Databases April 20, 2006.

On Top-n Reverse Top-k Queries: Variants, Algorithms, and Applications 陳良弼 Arbee L.P. Chen National Chengchi University 9/21/2012 at NCHU.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.

Lecture 8 Source detection NASSP Masters 5003S - Computational Astronomy

Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.

Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga.

03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.

Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.

Learning Analogies and Semantic Relations Nov William Cohen.

Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.

XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.

Probabilistic Skylines on Uncertain Data (VLDB2007) Jian Pei et al Supervisor: Dr Benjamin Kao Presenter: For Date: 22 Feb 2008 ??: the possible world.

Dense-Region Based Compact Data Cube

Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS

A paper on Join Synopses for Approximate Query Answering

Probabilistic Data Management

Top-k Query Processing

Spatial Online Sampling and Aggregation

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Efficient Processing of Top-k Spatial Preference Queries

Presentation transcript:

Answering Why-not Questions on Top-K Queries Andy He and Eric Lo The Hong Kong Polytechnic University

Background  The database community has focused on the performance issues for decades  Recently more people turn their focus on to the usability issues  Supporting keyword search  Query auto-completion  Explaining your query result (a.k.a. Why and Why-Not Questions) 2/33

Why-Not Questions  You post a query Q  Database returns you a result R  R gives you “surprise”  E.g., a tuple m that you are expecting in the result is missing, you ask “WHY??!”  You pose a why-not question (Q,R,m)  Database returns you an explanation E 3/33

The (short) history of Why-Not  Chapman and Jagadish  “Why Not?” [SIGMOD 09]  Select-Project-Join (SPJ) Questions  Explanation E = “tell you which operator excludes the expected tuple”  Hung, Che, A.H. Doan, and J. Naughton  “On the Provenance of Non-Answers to Queries Over Extracted Data” [PVLDB 09]  SPJ Queries  Explanation E =“tell you how to modify the data” 4/33

The (short) history of Why-Not  Herschel and Herandez  “Explaining Missing Answers to SPJUA Queries” [PVLDB 10]  SPJUA Queries  Explanation E =“tell you how to modify the data”  Tran and C.Y. Chan  “How to Conquer why-not Questions” [SIGMOD 10]  SPJA Queries  Explanation E =“tell you how to modify your query” 5/33

About this work  Why-Not question on Top-k queries.  Hotel  Top-3 Hotel  Weighting w origin =  Result  Rank 1: Sheraton  Rank 2: Westin  Rank 3: InterContinental  “WHY my favorite Renaissance NOT in the Top-3 result?”  If my value of k is too small?  Or I should revise my weighting?  Or need to modify both k and weighting?  Explanation E = “tell you how to refine your Top-K query in order to get your favorites back to the result” 6/33

One possible answer - only modify k  Original query Q(k original =3,w original = )  The ranking of Renaissance under the original weighting w original =  Rank 1: Sheraton  Rank 2: Westin  Rank 3: InterContinental  Rank 4: Hilton  Rank 5: Renaissance  Refined query #1: Q 1 (k=3,w= ) 5 7/33 X

Another possible answer - only modify weighting  Original query Q(k=3,w original = )  Refined query #1: Q 1 (k=5,w= )  If we set weighting w=  Rank 1: Hotel E  Rank 2: Hotel F  Rank 3: Renaissance  Refined query #2: Q 2 (k=3,w= ) 8/33

Yet another possible answer - modify both  Original query Q(k=3,w= )  Refined query #1: Q 1 (k=5,w= )  Refined query #2: Q 2 (k=3,w= )  If we set weighting w=  Rank 1: Hotel A  Rank 2: Hotel B  Rank 3: Hotel C  …  Rank 10000: Renaissance  Refined query #3: Q 3 (k=10000,w= ) 9/33

Our objective  Find the refined query that minimizes a penalty function with the missing tuple m in the Top-K results Prefer Modify KPMK Prefer Modify WeightingPMW Never Mind (Default)NM 10/33

Basic idea  For each weighting w i ∈ W  Run PROGRESS(w i, UNTIL-SEE-m)  Obtain the ranking r i of m under the weighting w i  Form a refined query Q i (k=r i,w=w i )  Return the refined query with the least penalty W is infinite!!! 11/33

Our approach: sampling  For each weighting w i ∈ W  Run PROGRESS(w i, UNTIL-SEE-m)  Obtain the ranking r i of m under the weighting w i  Form a refined query Q i (k=r i,w=w i )  Return the refined query with the least penalty W is a set of weightings draw from a restricted weighting space Key Theorem: The optimal refined query Q best is either Q 1 or else Q best has a weighting w best in a restricted weighting space. 12/33 W

How large the sample size should be?  We say a refined query is the best-T% refined query if its penalty is smaller than (1-T)% refined queries  And we hope to get such a query with a probability larger than a threshold Pr 13/33

The PROGRESS operation can be expensive  Original query Q(k=3,w original = )  Refined query #1: Q 1 (k=5,w= )  If we set weighting w=  Rank 1: Hotel A  Rank 2: Hotel B  Rank 3: Hotel C  …  Rank 10000: Renaissance  Refined query: Q 2 (k=10000,w= ) Very Slow ！！！ 14/33

Two optimization techniques  Stop each PROGRESS operation early  Skip some PROGRESS operations 15/33

Stop earlier  The original query Q(k=3,w origin = )  Refined query #1: Q 1 (k=5,w= )  If we set weighting w=  Rank 1: Hotel A  Rank 2: Hotel B  Rank 3: Hotel C  …  Rank 5: Hotel D  … 16/33

Skip PROGRESS operation(a)  Similar weightings may lead to similar rankings  Based on “Reverse Top-K” paper, ICDE’10  Therefore  The query result of PROGRESS(w x, UNTIL-SEE-m)  could be used to deduce  The query result of PROGRESS(w y, UNTIL-SEE-m)  [Provided that w x and w y are similar] 17/33

Skip PROGRESS operation(a)  E.g., Original query Q(k=3,w origin = )  Refined query #1: Q 1 (k=5,w= ) Score under w= HotelScore Sheraton10 Westin9 InterContinental8 Hilton7 Renaissance6 Score under w= HotelScore Sheraton9 Westin10 InterContinental7 Hilton8 Renaissance5 How the score looks like if we set w= 18/33

Skip PROGRESS operation(b)  We can skip a weighting w if we find its change ∆w between the original weighting w origin is too large.  E.g., We have a refined query with penalty equals to 0.5, for a weighting w, if its changing ∆w is 1. We can totally skip it. 19/33

Experiments  Case Study on NBA data  Experiments on Synthetic Data 20/33

Case study on NBA data  Compare with a pure random sampling version  Which do not draw sample from the restricted weighting space but from the complete weighting space 21/33

Find the top-3 centers in NBA history  5 Attributes (Weighting = 1/5)  POINTS  REBOUND  BLOCKING  FIELD GOAL  FREE THROW  Initial Result  Rank 1: Chamberlain  Rank 2: Abdul-Jabber  Rank 3: O’Neal 22/33

Find the top-3 centers in NBA history Sampling on the restricted sampling space Sampling on the whole weighting space Refined queryTop-3Top-7 ∆k 04 Time (ms) Penalty Why Not ?! We choose “Prefer Modify Weighting” 23/33

Synthetic Data  Uniform, Anti-correlated, Correlated  Scalability 24/33

Varying query dimensions 25/33

Varying k o 26/33

Varying the ranking of the missing object 27/33

Varying the number of missing objects 28/33

Varying T% 29/33 Time Quality

Varying Pr 30/33

Optimization effectiveness 31/33

Conclusions  We are the first one to answer why-not question on top-k query  We prove that finding the optimal answer is computationally expensive  A sampling based method is proposed  The optimal answer is proved to be in a restricted sample space  Two optimization techniques are proposed  Stop each PROGRESS operation early  Skip some PROGRESS operations 32/33

Thanks Q&A

Deal with multiple missing objects M  We have to modify the algorithm a litte bit:  Do a simple filtering on the set of missing objects  If m i dominates m j in the data space  Remove m i from M Because every time m j shows up in a top-k result, m i must be there  Condition UNTIL-SEE-m becomes UNTIL-SEE-ALL- OBJECTS-IN-M 34/33

Penalty Model  Original Query Q(3, w origin )  Refined Query Q 1 (5, w origin )  Penalty of changing k  ∆ k = = 2  Penalty of changing w  ∆ w = ||w origin -w origin || 2 =0  Basic penalty model  Penalty(5,w 0 ) = λ k ∆ k + λ w ∆ w  ( λ k + λ w = 1) 35/33

Normalized penalty function 36/33