Download presentation
Presentation is loading. Please wait.
Published byKerry Fleming Modified over 9 years ago
1
Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica Answering Approximate Queries Efficiently
2
2 30,000-Foot View of Info Systems Data Repository (RDBMS, Search Engines, etc.) Query Answers matching conditions
3
3 Example: a movie database StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Samuel JacksonStar Wars: Episode III - Revenge of the Sith2005Sci-Fi SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonGoodfellas1990Drama ………… Tom Find movies starred Samuel Jackson
4
4 How about our governor: Schwarrzenger? StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Samuel JacksonStar Wars: Episode III - Revenge of the Sith2005Sci-Fi SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonGoodfellas1990Drama ………… The user doesn’t know the exact spelling!
5
5 Relaxing Conditions StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Samuel JacksonStar Wars: Episode III - Revenge of the Sith2005Sci-Fi SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonGoodfellas1990Drama ………… Find movies with a star “similar to” Schwarrzenger.
6
6 In general: Gap between Queries and Facts Errors in the query –The user doesn’t remember a string exactly –The user unintentionally types a wrong string Samuel Jackson … Schwarzenegger Samuel Jackson Keanu Reeves Star … Samuel L. Jackson Schwarzenegger Samuel L. Jackson Keanu Reeves Star Relation R Relation S Errors in the database: –Data often is not clean by itself –Especially true in data integration and cleansing
7
7 “Did you mean…?” features in Search Engines
8
8 What if we don’t want the user to change the query? Answering Queries Approximately Data Repository (RDBMS, Search Engines, etc.) Query Answers matching conditions approximately
9
9 Technical Challenges How to relax conditions? –Name: “Schwarzenegger” vs “Schwarrzenger” –Salary: “in [50K,60K]” vs “in [49K,63K]” How to answer queries efficiently? –Index structures –Selectivity estimation See our three recent VLDB papers
10
10 Rest of the talk Selectivity estimation of fuzzy predicates Our approach: SEPIA Construction and maintenance of SEPIA Experiments Other works
11
11 Queries with Fuzzy String Predicates Stars: name similar to “Schwarrzenger” Employees: SSN similar to “430-87-7294” Customers: telephone number similar to “412-0964” Similar to: –a domain-specific function –returns a similarity value between two strings Examples: –Edit distance: ed(Schwarrzenger, Schwarzenegger)=2 –Cosine similarity –Jaccard coefficient distance –Soundex –… Database
12
12 A widely used metric to define string similarity Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 Example Similarity Function: Edit Distance
13
13 Selectivity of Fuzzy Predicates star SIMILARTO ’Schwarrzenger’ Selectivity: # of records satisfying the predicate StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Samuel JacksonStar Wars: Episode III - Revenge of the Sith2005Sci-Fi SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonGoodfellas1990Drama …………
14
14 Selectivity Estimation: Problem Formulation A bag of strings Input: fuzzy string predicate P(q, δ) star SIMILARTO ’Schwarrzenger’ Output: # of strings s that satisfy dist(s,q) <= δ
15
15 Why Selectivity Estimation? SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND year BETWEEN [1980,1989]; StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Samuel JacksonStar Wars: Episode III - Revenge of the Sith2005Sci-Fi SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonGoodfellas1990Drama ………… Movies SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND year BETWEEN [1970,1971]; The optimizer needs to know the selectivity of a predicate to decide a good plan.
16
16 No “nice” order for strings Lexicographical order? –Similar strings could be far from each other: Kammy/Cammy –Adjacent strings have different selectivities: Cathy/Catherine Using traditional histograms?
17
17 Outline Selectivity estimation of fuzzy predicates Our approach: SEPIA –Overview –Proximity between strings –Estimation algorithm Construction and maintenance of SEPIA Experiments Other works
18
18 Our approach: SEPIA Selectivity Estimation of Approximate Predicates Intuition
19
19 Proximity between Strings Edit Distance? Not discriminative enough
20
20 Edit Vector from s1 to s2 A vector –I: # of insertions –D: # of deletions –S: # of substitutions in a sequence of edit operations with their edit distance –Easily computable –Not symmetric –Not unique, but tend to be (ed <= 3 91% unique)
21
21 Why Edit Vector? More discriminative
22
22 SEPIA histograms: Overview
23
23 Frequency table for each cluster
24
24 Global PPD Table Proximity Pair Distribution table
25
25 SEPIA histograms: summary
26
26 Selectivity Estimation: ed(lukas, 2) Do it for all v2 vectors in each cluster, for all clusters Take the sum of these contributions
27
27 Selectivity Estimation for ed(q,d) For each cluster C i For each v2 in frequency table of C i Use (v1,v2,d) to lookup PPD Take the sum of these f * N Pruning possible (triangle inequality)
28
28 Outline Selectivity estimation of fuzzy predicates Our approach: SEPIA –Overview –Proximity between strings –Estimation algorithm Construction and maintenance of SEPIA Experiments Other works
29
29 Clustering Strings Two example algorithms Lexicographic order based. K-Medoids –Choose initial pivots –Assign strings to its closest pivot –Swap a pivot with another string –Reassign the strings
30
30 Number of Clusters It affects: Cluster quality –Similarity of strings within each cluster Costs: –Space –Estimation time
31
31 Constructing Frequency Tables For each cluster, group strings based on their edit vector from the pivot Count the frequency for each group
32
32 Constructing PPD Table Get enough samples of string triplets (q,p,s) Propose a few heuristics –ALL_RAND –CLOSE_RAND –CLOSE_LEX –CLOSE_UNIQUE
33
33 Dynamic Maintenance: Frequency Table Take insertion as an example
34
34 Dynamic Maintenance: PPD
35
35 Improving Estimation Accuracy Reasons of estimate errors –Miss hits in PPD. –Inaccurate percentage entries in PPD. Improvement: use sample fuzzy predicates to analyze their estimation errors
36
36 Relative-Error Model Use the errors to build a model Use the model to adjust initial estimation
37
37 Outline Motivation: selectivity estimation of fuzzy predicates Our approach: SEPIA –Overview –Proximity between strings –Estimation algorithm Construction and maintenance of SEPIA Experiments Other works
38
38 Data Citeseer: –71K author names –Length: [2,20], avg = 12 Movie records from UCI KDD repository: –11K movie titles. –Length: [3,80], avg = 35 Introduced duplicates: –10% of records –# of duplicates: [1,20], uniform Final results: –Citeseer: 142K author names –UCI KDD: 23K movie titles
39
39 Setting Test bed –PC: 2.4G P4, 1.2GB RAM, Windows XP –Visual C++ compiler Query workload: –Strings from the data –String not in the data –Results similar Quality measurements –Relative error: (f est – f real ) / f real –Absolute relative error : |f est – f real | / f real
40
40 Clustering Algorithms K-Metoids is better
41
41 Quartile distribution of relative errors Data set 1. CLOSE_RAND; 1000 clusters
42
42 Number of Clusters
43
43 Effectiveness of Applying Relative-Error Model
44
44 Dynamic Maintenance
45
45 Other work 1: Relaxing SQL queries with Selections/Joins SELECT * FROM Jobs J, Candidate C WHERE J.Salary = 5 JobsCandidates JIDCompanyZipcodeSalary CID ZipcodeExpSalaryWorkYear r1Broadcom9204780s1936521203 r2Intel9365295s2926121306 r3Microsoft82632120s3826321005 r4IBM90391130s4903911501...……… ………
46
46 Query Relaxation: Skyline!
47
47 Other work 2: Fuzzy predicates on attributes of mixed types SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND |year – 1977| <= 3; StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Samuel JacksonStar Wars: Episode III - Revenge of the Sith2005Sci-Fi SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonGoodfellas1990Drama ………… Movies
48
48 Mixed-Typed Predicates String attributes: edit distance Numeric attributes: absolute numeric difference SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND |year – 1977| <= 3;
49
49 MAT-tree: Intuition Indexing on two attributes is more effective than two separate indexing structures Numeric attribute: B-tree String attribute: tree-based index structure?
50
50 MAT-tree: Overview Tree-based indexing structure: –Each node has MBR for both numeric attribute and string attribute Compressing strings as a “compressed trie” that fits into a limited space An edit distance between a string and compressed trie can be computed Experiments show that MAT-tree is very efficient
51
51 Conclusion It’s important to support answering approximate queries efficiently Our results so far: –SEPIA: provides accurate selectivity estimation for fuzzy string predicates –Relaxing SQL queries with selections and joins –MAT-tree: indexing structure supporting fuzzy queries with mixed-types predicates
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.