Automatic Categorization of Query Results A Paper by Kaushik Chakarbati, Surajit Chaudhari, Seung -won Hwang Presented by Arjun Saraswat.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Google News Personalization: Scalable Online Collaborative Filtering
Heuristic Search techniques
Indexing DNA Sequences Using q-Grams
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Random Forest Predrag Radenković 3237/10
Diversity Maximization Under Matroid Constraints Date : 2013/11/06 Source : KDD’13 Authors : Zeinab Abbassi, Vahab S. Mirrokni, Mayur Thakur Advisor :
Clustering Categorical Data The Case of Quran Verses
Fast Algorithms For Hierarchical Range Histogram Constructions
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Types of Algorithms.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Addressing Diverse User Preferences in SQL-Query-Result Navigation SIGMOD ‘07 Zhiyuan Chen Tao Li University of Maryland, Baltimore County Florida International.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Feature Selection for Regression Problems
THE QUERY COMPILER 16.6 CHOOSING AN ORDER FOR JOINS By: Nitin Mathur Id: 110 CS: 257 Sec-1.
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
LEARNING DECISION TREES
Cost-Based Plan Selection Choosing an Order for Joins Chapter 16.5 and16.6 by:- Vikas Vittal Rao ID: 124/227 Chiu Luk ID: 210.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Focused Matrix Factorization for Audience Selection in Display Advertising BHARGAV KANAGAL, AMR AHMED, SANDEEP PANDEY, VANJA JOSIFOVSKI, LLUIS GARCIA-PUEYO,
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Bug Localization with Machine Learning Techniques Wujie Zheng
Querying Structured Text in an XML Database By Xuemei Luo.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Copyright © Curt Hill Query Evaluation Translating a query into action.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.
OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Types of Algorithms. 2 Algorithm classification Algorithms that use a similar problem-solving approach can be grouped together We’ll talk about a classification.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
CSC5101 Advanced Algorithms Analysis
Bootstrapped Optimistic Algorithm for Tree Construction
Relaxing Queries Presented by Ashwin Joshi Kapil Patil Sapan Shah.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.
Automatic Categorization of Query Results Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang Sushruth Puttaswamy.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
1 VLDB, Background What is important for the user.
By N.Gopinath AP/CSE.  A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
12. Principles of Parameter Estimation
Database Management System
A paper on Join Synopses for Approximate Query Answering
Data Mining K-means Algorithm
Chapter 12: Query Processing
Probabilistic Ranking of Database Query Results
Classification by Decision Tree Induction
Practical Database Design and Tuning
12. Principles of Parameter Estimation
Presentation transcript:

Automatic Categorization of Query Results A Paper by Kaushik Chakarbati, Surajit Chaudhari, Seung -won Hwang Presented by Arjun Saraswat

Flow of the Presentation 1.Introduction 2.Motivation 3.Basics of Model 4.Cost Estimation 5.Algorithms 6.Experimental Evaluation 7.Conclusion

INTRODUCTION

Introduction This paper basically solves the “too many answers” problem. This phenomenon of too many answers is often referred to as Information overload. Information overload happens when the user is not certain what she is looking for, In such situations user generally fires a broad query in order to avoid exclusion of potentially interesting results. There are two techniques to handle information overload Categorization and Ranking, this paper talks about the categorization Technique.

MOTIVATION

Motivation Example :A user fires a query on the MSN House & Home Database with following specifications: Area: Seattle/Bellevue Area of Washington, USA Price Range :$200,000 to $300,000 The query returns 6045 results, it is hard for the user to Separate the interesting ones from the uninteresting ones, which leads to lot of wastage of user time and effort. This problem is solved by the Categorization techniques introduced by this paper, such queries are answered by hierarchal category structure that are based on the contents of the answer set. The main motive is to reduce the information overload

Motivation Fig1. Structured hierarchal categorization results of the Example Query

Basics of Model

R = set of tuples or it can be either base relation or materialized view or result of a query Q. Q = SPJ (select-project-join) query. A hierarchal categorization of R is a recursive partitioning of the tuples in R based on the data attributes and their values, this is shown in Fig.1. Base Case : At the root or level 0 contains all the tuples in R, this tuple set is partitioned into mutually disjoint categories using a single attribute. Inductive Step : At a given node C at level (l-1), the partitioning of set of tuples tset(C) contained in C in ordered mutually disjoint subcategories (level l nodes) is done using the attribute which is same for all nodes at level(l-1).

Basics of Model The partitioning of node C is only done if it contains more than certain number of tuples and the attribute on which it is done is called categorizing attribute of level l and sub-categorizing Attribute of level (l-1). An attribute used once is not used again at later levels. Category Label : The predicate label (C) describing node C. Example :`Neighborhood : Redmond, Bellevue’ and `Price : 200k - 225k’ Tuple-set (tset(C)) : The set of tuples contained in C, either occurring directly or indirectly under its subcategories. Example : tset for category with label 'Neighborhood :Seattle’, is the set of all homes in R that are located in Seattle.

Basics of Model Important points to remember for each level: Determine the categorizing attribute for that level. Attribute partitioning is done in such a way as to minimize the information overload on the user. Exploration Model: It has two models that capture the two common scenarios. 1.All Scenario. 2.One Scenario.

Basics of Model The model of exploration of the subtree rooted at an arbitrary node C : EXPLORE C if C is non leaf node CHOOSE one of the following : (1)Examine all tuples in tset(C)//option SHOWTUPLES (2)for(i=1;i≤n;i++)//option SHOWCAT Examine Label of this subcategory C i CHOOSE one of the following : (2.1)EXPLORE C i (2.2)Ignore C i else//C is a leaf node Examine all tuples in tset(C)//SHOWTUPLES is only option

Basic of Model 2.One Scenario : EXPLORE C if C is non leaf node CHOOSE one of the following : (1)Examine all tuples in tset(C) from the beginning till first relevant tuple found//option SHOWTUPLES (2)for(i=1;i≤n;i++)//option SHOWCAT Examine Label of the i th subcategory C i CHOOSE one of the following : (2.1)EXPLORE C i (2.2)Ignore C i If (choice=Explore) break;//examine till first relevant tuple else//C is a leaf node Examine all tuples in tset(C) from beginning till first relevant tuple found//SHOWTUPLES is only option

Cost Estimation

Cost Model for ‘All’ Scenario Cost All (X,T) = information overload cost or simply cost. X = a given user exploration. T = Tree. We want to generate the tree that would minimize the number of items this particular user needs to examine. We use the aggregate knowledge of previous user behavior in order to estimate the information overload cost Cost All (T) that a user will face, on average, during an exploration using a given category tree T.

Cost Estimation Exploration Probability: The probability P(C) that the user exploring T explores category C, using either SHOWTUPLES or SHOWCAT, upon examining its label. SHOWTUPLES Probability: The probability P w (C) that the user goes for option ‘SHOWTUPLES’ for category C given that she explores C. The SHOWCAT probability of C. Cost Model for All Scenario Consider a Non-Leaf Node C of T Cost Al l (Tc) = cost of exploring the subtree Tc rooted at C we denote Cost Al l (Tc) by Cost Al l (C) as we know the cost is always calculated in context of the given Tree.

Cost Estimation If ‘SHOWTUPLES’ is Chosen for C then Cost = Pw(C)*|tset(C)| If ‘SHOWCAT’ is Chosen for C the Cost = Cost of first component = K *n (where K is the cost of examining a category label relative to the cost of examining a data tuple.) Cost of Second Component = Cost Al l (Ci), if she chooses to explore Ci, 0 if she chooses to ignore it. Cost All (C) = Pw(C)*|tset(C)|+(1-Pw(C)) * (K*n + Σ P(Ci)*CostAll(Ci)) (1) If C is leaf node then Cost Al l (C) = |tset(C)|

Cost Estimation Cost Model for ‘ONE’ Scenario Cost One (T) = information overload cost Let us consider the Cost for SHOWTUPLES Option = Pw(C)*frac(C)*|tset(C)| Cost for SHOWCAT option = (1-Pw(C)) * Σ (Prob. that Ci is the first category explored* (K*i + Cost One (Ci))) Total Cost of One Scenario= Cost One (C) =Pw(C)*frac(C)*|tset(C)| + (1-Pw(C)) * Σ (Prob. that Ci is the first category explored* (K*i + Cost One (Ci)))

Cost Estimation Σ (Prob. that C i is the first category explored* (K*i + CostOne(Ci))) The probability that Ci is the first category explored (i.e., probability that the user explores Ci but none of C1 to C(i-1)), is (i-1) j =1 ∏(1-P(C j )) * P(C i ) Final Cost One Term Cost One (C) = Pw(C)*frac(C)*|tset(C)| + (1-Pw(C)) * i=1 n Σ (i-1) j =1 ∏(1-P(C j )) * P(C i ) *(K*i + Cost One (Ci))) (2) In case C is a leaf node then Cost One (C) = frac(C)*|tset(C)|

Cost Estimation Using Workload to Estimate Probabilities P(C) and Pw(C) are needed for the Cost One (T) and Cost All (T) We use the aggregate knowledge of previous user behavior to estimate these probabilities automatically. Computing SHOWTUPLES Probability: When a User explores the a non-leaf node C, there are two Choices SHOWCAT or SHOWTUPLES. SA(C) = Subcategorizing attribute of C

Cost Estimation SHOWCAT Probability W i = Workload Query. U i = User. If U i has specified a selection condition on SA(C) in W i, given a condition on SA(C) means user is interested in few values, if there is no condition it means user is interested in all values SA(C).

Cost Estimation N Attr (A) = the number of queries in the workload that contain selection condition on attribute A and N is the total number of queries in the workload. N Attr (SA(C))/N = fraction of users that are interested in a few values of SA(C). SHOWCAT probability of C = N Attr (SA(C))/N SHOWTUPLES probability of C = (1- N Attr (SA(C))/N)

Cost Estimation Computing Exploration Probability P(C): P(C) = probability that user explores category C. P(C) = P(User explores C | User examines label of C). P(C) = P(User explores C) / P(User examines label of C). User examines label if she explores the parent of label say C’ and chooses SHOWCAT for C’. P(C) = P( user explores C)/P(User explores C’ and chooses SHOWCAT for C’) P( user explores C) P( user explores C’)*P(User chooses SHOWCAT for C’|user explores C’) P(User chooses SHOWCAT for C'| User explores C’) is the SHOWCAT probability of C’ = N Attr (SA(C’))/N.

Cost Estimation A user explores C if she, upon examining the label of C, thinks that there may be one or more tuples in tset(C) that is of interest to her. P(User explores C) / P(User explores C’) is simply the probability that the user is interested in predicate label(C). So, P (user interested in predicate label C) P(C) = N Attr (SA(C))/ N

Cost Estimation CA(C) = Categorizing attribute of C selection condition on CA(C) overlaps with the predicate label(C), it means that U i is interested in the predicate label(C). N Overlap (C) = number of queries in the workload Whose selection condition on CA(C) overlaps with label(C) P(User interested in predicate label(C)) = N Overlap (C)/N. So now we get N Overlap (C) P(C) = N Attr (CA(C))

Algorithms

Now we can calculate the information overload cost for a given Tree, we can enumerate all possible category tree’s on R, and Chose the one with minimum cost. This will give the Cost- Optimal tree but will be expensive the sense that there will be large number of categorization trees possible. In order to solve this problem we need to : 1.Eliminate a subset of relatively unattractive attributes without considering any of their partitioning. 2.For every attribute selected above, obtain a good partitioning efficiently instead of enumerating all the possible partitioning.

Algorithms Reducing the Choices of Categorizing Attribute : 1.)Eliminate the uninteresting attributes using the following simple heuristic: if an attribute A occurs in less than a fraction x of the queries in the workload, i.e., N Attr (A)/N < x, we eliminate A. The threshold x will need to be specified by the system designer/domain expert. 2.) For attribute elimination, we preprocess the workload and maintain, for each potential categorizing attribute A, the number N Attr (A) of queries in the workload that contain selection condition on A.

Algorithms Partitioning for Categorical Attributes In this paper only single value partitioning of R is considered. Consider the case where the user query Q contains a selection condition of the form “A IN {v 1, …, v k }” on A. Example : Neighborhood IN {“Neighborhood:Redmond”, “Neighborhood:Bellevue”, etc.).} Among the single-value partitioning, we want to choose the one with the minimum cost. Since the set of categories is identical in all possible single- value partitioning, the only factor that impacts the cost of a single valued partitioning is the order in which the categories are presented to the user.

Algorithms The Cost All (T) is not affected by the ordering, so we will consider Only cost Cost One (T), now Cost One (T) is minimum when categories are presented in increasing order of 1/(P(C i )+ Cost One (C i ). Heuristic to present the categories in decreasing order of P(C i ). P(C i ) =N Overlap (C i )/N Attr (A), as C i corresponds to a single value N Overlap (C i ) = is the number of queries in the workload, whose selection condition on A contains v i in the IN Clause To obtain the partitioning we simply sort the values in IN clause in decreasing order of occ(v i ).

Algorithms

Partitioning for Numeric Attributes Let Vmin and Vmax be the minimum and maximum values that the tuples in R can take in attribute A. Let us consider a point v (Vmin < v < Vmax). If a significant number of query ranges in the workload begin or end at v, it is a good point to split as the workload suggests that most users would be interested in just one bucket, If none of them begin or end at v, hence v is not a good point to split, if we partition the range into m-buckets then (m-1) points should be selected where queries begin or end splitpoints. The splitpoints are not the only factors determining cost, the other factor is the number of tuples in each bucket. This kind of heuristic will not give best partitioning in the sense of cost.

Algorithms Let us consider the point v again (Vmin < v < Vmax). Let startV and endV denote the number of query ranges in the Workload starting and ending at v respectively. We use SUM (startV, endV) as the “goodness score” of the point v.

Algorithms Multilevel Categorization: ALGORITHM : 1.For multilevel categorization, for each level l, we need to determine the categorizing attribute A and for each category C in level (l-1), partition the domain of values of A in tset(C) such that the information overload is minimized. 2.The algorithm creates the categories level by level all categories at level (l-1) are created and added to tree T before any category at level l. S denote the set of categories at level (l-1) with more than M tuples. 3.For each such candidate attribute A, we partition each category C in S using the partitioning for Categorical Attributes and Numerical attributes 4. Compute the cost of the attribute-partitioning combination for each candidate attribute A and select the attribute α with the minimum cost. For each category C in S, we add the partitions of C based on α to T. 5. This Completes the node creation at level l.

Experimental Evaluation Evaluation is done on the following : 1.Evaluate the accuracy of cost models in modeling information overload. 2.Evaluate our cost based categorization algorithm and compare them with categorization that do not consider such cost models. Database : MSN House&Home M = 20 All Experiments are a conducted on Compaq Evo W Ghz CPU 768MB RAM, running on Windows XP. Dataset : for both the experiments Single table called ListProperty, it contains 1.7 million rows. Workload comprises 176,262 query strings representing searches conducted by home buyers on MSN House & Home website. In both the studies paper’s cost based is compared to two techniques No-Cost and Attr-Cost No-Cost : it uses the same level by categorization but categorizing attr- -butes at each level arbitrarily (without replacement).

Experimental Evaluation Attr-Cost: Attr-cost’ technique selects the attribute with the lowest cost as the categorizing attribute at each level but considers only those partitioning considered by the ‘No cost’ technique. Simulated User-Study Due to the difficulty of conducting a large- scale real-life user study, we develop a novel way to simulate a large scale user study. We pick a subset of 100 queries from the workload and imagine them as user explorations, Workload Query W is referred to as Synthetic Exploration. estimated (average) cost =Cost All (T) actual cost = Cost All (W,T) of exploration 8 Mutually disjoint subsets of 100 synthetic explorations are considered. Figure is Correlation between actual cost and Estimated Cost.

Experimental Evaluation Figure on the left is Cost of various techniques for 8 Subsets. Figure on the right is Pearson's correlation between the estimated cost and actual cost.

Experimental Evaluation Real Life –User study Tasks 1. Any neighborhood in Seattle/Bellevue, Price < 1 Million. 2. Any neighborhood in Bay Area – Penin /SanJose, Price between 300K and 500K selected neighborhoods in NYC – Manhattan, Bronx, Price < 1 Million 4. Any neighborhood in Seattle/Bellevue, Price between 200K and 400K, Bedroom Count between 3 and 4.

Experimental Evaluation Real Life –User study Figure on the left is Average Cost (no. of items examined till she finds all the relevant tuples) of various techniques Figure on the right is Average number relevant tuples found by users for the various techniques.

Experimental Evaluation Real Life –User study Figure on the left is Average Normalized Cost (items examined by user/ relevant tuple found) of various techniques Figure on the right is Average Cost (till she finds the first relevant tuple found) of various techniques

Experimental Evaluation Real Life –User study Figure on the left Results of the post study survey Figure on the right Average execution time of the cost – based categorization algorithm.

Conclusion This paper gives a solution for the problem Information Overload by purposing the automatic categorization of Query results. The solution is to dynamically generate a labeled, hierarchical category structure the user can determine whether a category is relevant or not by examining simply its label and explore only the relevant categories, thereby reducing information overload.

Thank You