Antara Ghosh Jignashu Parikh

Slides:



Advertisements
Similar presentations
Protein Secondary Structure Prediction Using BLAST and Relaxed Threshold Rule Induction from Coverings Leong Lee Missouri University of Science and Technology,
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Random Forest Predrag Radenković 3237/10
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Imbalanced data David Kauchak CS 451 – Fall 2013.
The Volcano/Cascades Query Optimization Framework
Fast Algorithms For Hierarchical Range Histogram Constructions
Searching on Multi-Dimensional Data
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Efficiency concerns in Privacy Preserving methods Optimization of MASK Shipra Agrawal.
Decision Tree Rong Jin. Determine Milage Per Gallon.
Decision Tree Algorithm
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Ensemble Learning (2), Tree and Forest
ROUGH SET THEORY AND FUZZY LOGIC BASED WAREHOUSING OF HETEROGENEOUS CLINICAL DATABASES Yiwen Fan.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.
Chapter 9 – Classification and Regression Trees
Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Fuzzy Systems Michael J. Watts
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Patch Based Prediction Techniques University of Houston By: Paul AMALAMAN From: UH-DMML Lab Director: Dr. Eick.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Using decision trees to build an a framework for multivariate time- series classification 1 Present By Xiayi Kuang.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
Graph Indexing From managing and mining graph data.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Machine Learning: Ensemble Methods
Hybrid BDD and All-SAT Method for Model Checking
CS 440 Database Management Systems
Fuzzy Systems Michael J. Watts
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
A Black-Box Approach to Query Cardinality Estimation
A paper on Join Synopses for Approximate Query Answering
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
Overview Part 1 – Gate Circuits and Boolean Equations
Ch9: Decision Trees 9.1 Introduction A decision tree:
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Chapter 3 向量量化編碼法.
Evaluation of Relational Operations
Decision Trees Greg Grudic
On Spatial Joins in MapReduce
William Norris Professor and Head, Department of Computer Science
Sidharth Mishra Dr. T.Y. Lin CS 257 Section 1 MH 222 SJSU - Fall 2016
Paraskevi Raftopoulou, Euripides G.M. Petrakis
Machine Learning: Lecture 3
Discriminative Frequent Pattern Analysis for Effective Classification
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Statistical Learning Dong Liu Dept. EEIS, USTC.
Similarity Search: A Matching Based Approach
Overview of Query Evaluation
Implementation of Relational Operations
Nearest Neighbors CSC 576: Data Mining.
Evaluation of Relational Operations: Other Techniques
A Framework for Testing Query Transformation Rules
Self-organizing Tuple Reconstruction in Column-stores
Danger Prediction by Case-Based Approach on Expressways
Continuous Density Queries for Moving Objects
Avoid Overfitting in Classification
CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,
Efficient Processing of Top-k Spatial Preference Queries
Donghui Zhang, Tian Xia Northeastern University
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Presentation transcript:

Antara Ghosh Jignashu Parikh QUERY CLUSTERING Antara Ghosh Jignashu Parikh

OVERVIEW PROBLEM DEFINITION MOTIVATION ISSUES OUR APPROACH Notion of Similarity Algorithm for Computing Similarity Clustering Classification Experimental Results Future Work Conclusion

PROBLEM DEFINITION Define Query Space and find clusters of “similar” queries. Build a Framework for Finding Similarity Clustering Queries Classification of new queries

MOTIVATION Complex Queries are common place Plan Space grows exponentially with number of joins and relations involved Optimisation time to the range of 8-15 secs reported by Roy et. al. Query Space considerably smaller than Plan Space Short Circuit the optimiser!!

ISSUES Establishing the Notion of Similarity Dealing with “different similarities” Clustering Queries based on Similarity Huge Query Space- Infinite? How many Clusters are there? Classification Searching through large number of Clusters

Notion of Similarity “Syntactico-semantic” similarity given that table sizes are comparable in terms of sizes and estimated result size. Checking syntactico-semantic similarity Table access through index or not Types of operations each table is involved in, equality non-equality No of predicates corresponding tables are involved. Mapping compatible tables with respect to original and estimated result size.

Notion of similarity (contd.) The select and join predicate of the query must be compatible. Total projected attribute size A special check for each table if all the attributes accessed in it is indexed.

Definitions E A C D B F 6 1 4 5 2 3 Degree of a table: the number of the predicates the table is involved Index type counts of a table: the number of 2-way and 1-way indices on each predicate of this table Operation type count: the number of equi-joins and non-equi-joins the table is involved in. Index flag: this flag is set when all the selection and prediction predicates of a table is on indexed attributes Validity of mapping: map table only if they have same value for all the variables above

Algorithm If number of tables in two queries same 1. Match semantics at query level Check if the total degree, index count, operation type count, and index flags are same over all the tables. 2 For all valid mapping among the tables Find the distance of tables among the mapped table based on the formula: distance=0.7*(absolute(s1-s2)/max(s1,s2)) +0.3*(absolute(so1-so2)/max(so1,so2)) where si=original size of table i = original cardinality* original rowsize soi=estimated result size of table I =rf * original cardinality * projected row size Find the mapping with minimal sum of such distances If it is less than magic threshold number then the queries are called similar, otherwise not

Time complexity The mapping reduces from all possible combination of table joins and type of joins in a query, to mapping of tables among two queries. Table mapping is done at the end of algorithm, when the queries are syntactico-semantically compatible. Tables are partitioned according to their degrees. Mapping is done among the tables of same partition. Time complexity = p1!+p2!+p3!+……+pk! where p1+p2 +..+pk=N and N=no of tables Where as for all such mapping the complexity will become of order N!

Similar-looking queries different execution plans select resume from emp_resume where empno>'000010' select empno from emp_resume where empno>'000010'

Different looking Queries-Similar Plans! select * from employee as a, emp_act as b, emp_photo as c where a.empno=b.empno and b.empno=c.empno and a.empno>'000000' and b.empno<'000400‘ and c.empno between '000010' and '000390' select a.firstnme, a.lastname ,b.projno, c.resume from employee as a, emp_act as b, emp_resume as c where a.empno=b.empno and b.empno=c.empno

Clustering Queries Given a set of Queries (Usually Large) generate clusters efficiently Cannot afford several passes of the database of queries. A single pass algorithm, similar to the leader algorithm (Hartinger 1975) is proposed. Even BIRCH can be used

Algorithm For Clustering Inputs: A set of Queries Q1 to Qn and a Threshold T Output: K-clusters Algorithm Start with Q1 and make a cluster C1 with Q1 as a representative. Let k = No. of classes Ri = Representative of Cluster Ci For i = 2 to n do If Qi is similar to any one of R1 to Rk, say Rj, then add Qi to Cluster Cj. Else if there is no such cluster then add produce one more cluster Ck+1 with representative Rk= Qi. End if End.

Classification Host of classification methods can be applied Decision Tree Based Classification Why? Most of the features are deterministic- Rule based system suits well Storage Minimal- No need to store the clusters once we have the decision trees C5.0 Decision Tree Induction Program used Uses Training data for generating decision rules Decision rules split the Query Space Chooses the hierarchy of rules based on decreasing Information gain from top to bottom

GENERATING DECISION TREES Set of 400 Queries Used for Training + Test for 6 clusters 50 Attributes/Features Per Query Clusters were as under 1 (82) and 2 (72): Single Relation Queries with same semantics but different Size Ratios 3 (30): Two-Relation/Single Join Queries 4 (107) and 5 (73): Three Relation/Double Join Query Clusters differing in semantics 6 (35): Four-Relation/Triple Join Queries Leave-one Out Cross Validation Performed

C5.0: Results Classification Accuracy: Time Taken Training Set: 94.7% Mean Leave-One Out Accuracy: 88.2% Time Taken Induction+Classification Training Set (1+400): 0.1 sec » 0.00025 sec/q Leave One Out (400+400): 9.3 seconds We can surely beat the query optimiser! Most of the errors in class 1 and 2- The difference is much more fuzzy there.

Future work Sub-query matching Size based mapping-calibrating weights and thresholds Dealing with more sophisticated queries Extensive testing for clustering and classification

Conclusions Proposed framework is completely scalable. Concept of similarity can be applied to any number of relations Time Complexity of the algorithm has been tuned for dealing with queries involving large number of relations. Clustering and classifications schemes selected are meant for dealing with large data. Metric of Similarity proposed is more general than using DB2 for clustering. Query space may be smaller than Plan space but still we are dealing with “Infinity” Thorough testing using a large real-life query set should be used to test the mettle of the algorithm. if the query space being infinite is a dark cloud, the silver lining is- If the optimiser can handle the Plan space, we should be able to deal with the query space!