Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008

Motivation Selectivity estimation of approximate string matching queries Applications –Misspelling correction/suggestion –Data integration and data cleaning –Query optimization (generating query plans)

Approximate String Matching String similarity measures –Edit distance –Hamming distance –Jaccard similarity co-efficient Edit distance –Minimum number of edit (insertion, deletion, replacement) operations to convert a string to the other

Short Identifying Substring SIS by Chaudhuri, et al. [2] –String s usually has a substring s’ that if an attribute value contains s, it almost always contains s’ –Thus, approximate selectivity of long string queries with their shorter substrings

Related Work SEPIA [3] –Clusters similar strings –Selects a pivot for each cluster –Captures the edit distance distribution with histograms –For each query, visit all the clusters and estimate the number of strings within the distance threshold

Problem Statement Given a query string s q and a bag of strings DB estimate the size of the answer set Interested in low edit thresholds (1-3)

Basic Definitions Q-gram –Any string of length q N-gram table –Frequencies of all q-grams for q=1…N Ans(s q,iDjImR) = set of strings s’ such that s q can be converted to s’ with i deletions, j insertions and m replacements Ans(s q,k) = set of string s’ obtained from s q with exactly k edit operations

Examples Ans(“abcd”,1R) = {“?bcd”,”a?cd”,”ab?d”,”abc?”} Alphabet for extended Q-grams = 3-gram table for “beau” contains frequencies for –1-grams (b, e, a, u) –2-grams (#b, be, ea, au, u$) –3-grams (#be, bea, eau, au$ Extended 3-gram table also contains frequencies for –For 2-grams (?b, ?e, ?u, b?, e?, a?, u?, ??, #?, ?$ –For 3-grams (?ea, #?e, ??$, etc.)

Replacement semi-lattice Assume only replacements are allowed E.g. Ans(“abcd”,2R) –Possible answers = ab??, a?c?, ?bc?, a??d, ?b?d, ??cd Find value of | Ans(“abcd”,2R)| using S 1 = ab??, …, S 6 =??cd

Replacement semi-lattice (Cont.) Get the values of intersections from this table and plug them into the formula for |Ans(“abcd”,2R)| Semi-lattice for Ans(“abcd”,2R)

General Formulas Generalize the above idea to find |Ans(s q,kR)| The general formulas for deletion is very trivial and can be shown to always be the sum of the frequencies of the level-0 nodes The general case for insertion can be very complex, only interested in at most 3 insertions

Estimate selectivity General idea –group Ans(s q,k) by the length of the strings (l-k...l+k) –Estimate the size of each subset separately Ans(“abcde”,2) –5 subsets, having strings of size 3 to 7 –Length 3 is Ans(“abcde”,2D) –Length 5 is Ans(“abcde”,1I1D) U Ans(“abcde”,2R) Lots of overlap

Estimate selectivity (Cont.) Combined Approach –Obtain base strings for both sets –Remove redundant base strings Ans(“abcde”,2R) generates “abc??” Ans(“abcde”,1I1D) generates “abcd?” “abc??” has all the strings in “abcd?” Remove “abcd?” from base strings

Estimate selectivity (cont.) BasicEQ, for a given string length –Find the base strings (remove redundancies) –Iteratively intersect base strings to obtain r-intersections (r = 2..|base strings|) This will generate new nodes in the hierarchy –Partition the nodes and estimate their frequencies –Add these estimated frequencies

Estimate selectivity (cont.) Node Partitioning –Partition the nodes, so that every node q in a partition has the same coefficient C q –C q is the number of times q appears in all the intersections of base strings –For each partition find C q and sum of frequencies of its nodes

Frequency Estimation Estimate the frequency of an extended q-gram in the extended N-gram table Maximal Overlap (MO) [4] –Finds the substring in the table that has the maximum overlap with s q MAX approach –If MO(“abc?”) < MO(“abcd”), then set MO(“abcd”) for “abc?” MO+ –Find the substring with the minimum frequency MM –Combination of MAX and MO+

Estimate selectivity (cont.) BasicEQ is efficient if the general formulas are applicable Propose OptEQ that adds two enhancements to BasicEQ –Approximates the co-efficient C q but achieves a better performance –Groups the set of strings obtained in each iteration of BasicEQ to obtain faster intersection tests (for being empty)

Experimetal Evaluation (method, N B, N E, PT)

Experimetal Evaluation

Space vs. Accuracy

Conclusions Proposed OptEQ –Approximates coefficients of partitions –Groups semi-lattices to obtain scalability –More accurate than SEPIA –Exploits disk space to give higher precisions MM and Max estimates give good results

References [1] H. Lee, R. T. Ng, and K. Shim, “Extending Q-grams to estimate selectivity of string matching with low edit distance”, VLDB 2007 [2] S. Chaudhuri, V. Ganti, and L. Gravano “Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem”, ICDE 2004 [3] L. Jin and C. Li, “Selectivity Estimation for Fuzzy String Predicates in Large Data Sets”, VLDB 2005 [4] H. V. Jagadish, R. T. Ng and D. Srivastava. “Substring Selectivity Estimation”, PODS 1999

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.

Similar presentations

Presentation on theme: "Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.

Similar presentations

Presentation on theme: "Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008."— Presentation transcript:

Similar presentations

About project

Feedback