Antara Ghosh Jignashu Parikh QUERY CLUSTERING Antara Ghosh Jignashu Parikh
OVERVIEW PROBLEM DEFINITION MOTIVATION ISSUES OUR APPROACH Notion of Similarity Algorithm for Computing Similarity Clustering Classification Experimental Results Future Work Conclusion
PROBLEM DEFINITION Define Query Space and find clusters of “similar” queries. Build a Framework for Finding Similarity Clustering Queries Classification of new queries
MOTIVATION Complex Queries are common place Plan Space grows exponentially with number of joins and relations involved Optimisation time to the range of 8-15 secs reported by Roy et. al. Query Space considerably smaller than Plan Space Short Circuit the optimiser!!
ISSUES Establishing the Notion of Similarity Dealing with “different similarities” Clustering Queries based on Similarity Huge Query Space- Infinite? How many Clusters are there? Classification Searching through large number of Clusters
Notion of Similarity “Syntactico-semantic” similarity given that table sizes are comparable in terms of sizes and estimated result size. Checking syntactico-semantic similarity Table access through index or not Types of operations each table is involved in, equality non-equality No of predicates corresponding tables are involved. Mapping compatible tables with respect to original and estimated result size.
Notion of similarity (contd.) The select and join predicate of the query must be compatible. Total projected attribute size A special check for each table if all the attributes accessed in it is indexed.
Definitions E A C D B F 6 1 4 5 2 3 Degree of a table: the number of the predicates the table is involved Index type counts of a table: the number of 2-way and 1-way indices on each predicate of this table Operation type count: the number of equi-joins and non-equi-joins the table is involved in. Index flag: this flag is set when all the selection and prediction predicates of a table is on indexed attributes Validity of mapping: map table only if they have same value for all the variables above
Algorithm If number of tables in two queries same 1. Match semantics at query level Check if the total degree, index count, operation type count, and index flags are same over all the tables. 2 For all valid mapping among the tables Find the distance of tables among the mapped table based on the formula: distance=0.7*(absolute(s1-s2)/max(s1,s2)) +0.3*(absolute(so1-so2)/max(so1,so2)) where si=original size of table i = original cardinality* original rowsize soi=estimated result size of table I =rf * original cardinality * projected row size Find the mapping with minimal sum of such distances If it is less than magic threshold number then the queries are called similar, otherwise not
Time complexity The mapping reduces from all possible combination of table joins and type of joins in a query, to mapping of tables among two queries. Table mapping is done at the end of algorithm, when the queries are syntactico-semantically compatible. Tables are partitioned according to their degrees. Mapping is done among the tables of same partition. Time complexity = p1!+p2!+p3!+……+pk! where p1+p2 +..+pk=N and N=no of tables Where as for all such mapping the complexity will become of order N!
Similar-looking queries different execution plans select resume from emp_resume where empno>'000010' select empno from emp_resume where empno>'000010'
Different looking Queries-Similar Plans! select * from employee as a, emp_act as b, emp_photo as c where a.empno=b.empno and b.empno=c.empno and a.empno>'000000' and b.empno<'000400‘ and c.empno between '000010' and '000390' select a.firstnme, a.lastname ,b.projno, c.resume from employee as a, emp_act as b, emp_resume as c where a.empno=b.empno and b.empno=c.empno
Clustering Queries Given a set of Queries (Usually Large) generate clusters efficiently Cannot afford several passes of the database of queries. A single pass algorithm, similar to the leader algorithm (Hartinger 1975) is proposed. Even BIRCH can be used
Algorithm For Clustering Inputs: A set of Queries Q1 to Qn and a Threshold T Output: K-clusters Algorithm Start with Q1 and make a cluster C1 with Q1 as a representative. Let k = No. of classes Ri = Representative of Cluster Ci For i = 2 to n do If Qi is similar to any one of R1 to Rk, say Rj, then add Qi to Cluster Cj. Else if there is no such cluster then add produce one more cluster Ck+1 with representative Rk= Qi. End if End.
Classification Host of classification methods can be applied Decision Tree Based Classification Why? Most of the features are deterministic- Rule based system suits well Storage Minimal- No need to store the clusters once we have the decision trees C5.0 Decision Tree Induction Program used Uses Training data for generating decision rules Decision rules split the Query Space Chooses the hierarchy of rules based on decreasing Information gain from top to bottom
GENERATING DECISION TREES Set of 400 Queries Used for Training + Test for 6 clusters 50 Attributes/Features Per Query Clusters were as under 1 (82) and 2 (72): Single Relation Queries with same semantics but different Size Ratios 3 (30): Two-Relation/Single Join Queries 4 (107) and 5 (73): Three Relation/Double Join Query Clusters differing in semantics 6 (35): Four-Relation/Triple Join Queries Leave-one Out Cross Validation Performed
C5.0: Results Classification Accuracy: Time Taken Training Set: 94.7% Mean Leave-One Out Accuracy: 88.2% Time Taken Induction+Classification Training Set (1+400): 0.1 sec » 0.00025 sec/q Leave One Out (400+400): 9.3 seconds We can surely beat the query optimiser! Most of the errors in class 1 and 2- The difference is much more fuzzy there.
Future work Sub-query matching Size based mapping-calibrating weights and thresholds Dealing with more sophisticated queries Extensive testing for clustering and classification
Conclusions Proposed framework is completely scalable. Concept of similarity can be applied to any number of relations Time Complexity of the algorithm has been tuned for dealing with queries involving large number of relations. Clustering and classifications schemes selected are meant for dealing with large data. Metric of Similarity proposed is more general than using DB2 for clustering. Query space may be smaller than Plan space but still we are dealing with “Infinity” Thorough testing using a large real-life query set should be used to test the mettle of the algorithm. if the query space being infinite is a dark cloud, the silver lining is- If the optimiser can handle the Plan space, we should be able to deal with the query space!