Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica Answering Approximate Queries Efficiently.

Slides:



Advertisements
Similar presentations
1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.
Advertisements

Jiaheng Lu, University of California, Irvine
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Space-Constrained Gram-Based Indexing for Efficient.
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
Indexing DNA Sequences Using q-Grams
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Relaxing Join and Selection Queries Rares Vernica UC Irvine, USA Joint work with Nick Koudas, Chen Li, and Anthony K. H. Tung.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
1 Efficient Record Linkage in Large Data Sets Liang Jin, Chen Li, Sharad Mehrotra University of California, Irvine DASFAA, Kyoto, Japan, March 2003.
Fast Algorithms For Hierarchical Range Histogram Constructions
Similarity and Distance Sketching, Locality Sensitive Hashing
Similarity Search on Bregman Divergence, Towards Non- Metric Indexing Zhenjie Zhang, Beng Chi Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung.
A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
The Flamingo Software Package on Approximate String Queries Chen Li UC Irvine and Bimaple
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm 1, Shengyue Ji 1, Chen Li 1, Jiaheng.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Liang Jin (UC Irvine) Nick Koudas (AT&T) Chen Li (UC Irvine)
Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.
Heuristic alignment algorithms and cost matrices
Reza Sherkat ICDE061 Reza Sherkat and Davood Rafiei Department of Computing Science University of Alberta Canada Efficiently Evaluating Order Preserving.
Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore VLDB’2005 * Liang Jin and.
Liang Jin and Chen Li VLDB’2005 Supported by NSF CAREER Award IIS Selectivity Estimation for Fuzzy String Predicates in Large Data Sets.
1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring.
Abstract Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in the literature take a landmark embedding.
Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently Xiaochun Yang, Bin Wang Chen Li Northeastern.
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University,
DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.
Physical Database Design Chapter 6. Physical Design and implementation 1.Translate global logical data model for target DBMS  1.1Design base relations.
Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.
Histograms for Selectivity Estimation
Advanced Accounting Information Systems Day 10 answers Organizing and Manipulating Data September 16, 2009.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 7 (Part II) INTRODUCTION TO STRUCTURED QUERY LANGUAGE (SQL) Instructor.
Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.
Effective Keyword-Based Selection of Relational Databases By Bei Yu, Guoliang Li, Karen Sollins & Anthony K. H. Tung Presented by Deborah Kallina.
Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore * Liang Jin and Chen Li:
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)
Improving Search for Emerging Applications * Some techniques current being licensed to Bimaple Chen Li UC Irvine.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08.
AQAX: Approximate Query Answering for XML Josh Spiegel, M. Pontikakis, S. Budalakoti, N. Polyzotis Univ. of California Santa Cruz.
Efficient Approximate Search on String Collections Part I
Supporting Ranking and Clustering as Generalized Order-By and Group-By
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
RE-Tree: An Efficient Index Structure for Regular Expressions
Anthony Okorodudu CSE ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan.
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)
Integrating XML Data Sources Using Approximate Joins
Top-k String Similarity Search with Edit-Distance Constraints
Efficient Record Linkage in Large Data Sets
On the Designing of Popular Packages
Similarity Search: A Matching Based Approach
Panagiotis G. Ipeirotis Luis Gravano
Minwise Hashing and Efficient Search
Relaxing Join and Selection Queries
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Relax and Adapt: Computing Top-k Matches to XPath Queries
Presentation transcript:

Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica Answering Approximate Queries Efficiently

2 30,000-Foot View of Info Systems Data Repository (RDBMS, Search Engines, etc.) Query Answers matching conditions

3 Example: a movie database StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Samuel JacksonStar Wars: Episode III - Revenge of the Sith2005Sci-Fi SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonGoodfellas1990Drama ………… Tom Find movies starred Samuel Jackson

4 How about our governor: Schwarrzenger? StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Samuel JacksonStar Wars: Episode III - Revenge of the Sith2005Sci-Fi SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonGoodfellas1990Drama ………… The user doesn’t know the exact spelling!

5 Relaxing Conditions StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Samuel JacksonStar Wars: Episode III - Revenge of the Sith2005Sci-Fi SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonGoodfellas1990Drama ………… Find movies with a star “similar to” Schwarrzenger.

6 In general: Gap between Queries and Facts Errors in the query –The user doesn’t remember a string exactly –The user unintentionally types a wrong string Samuel Jackson … Schwarzenegger Samuel Jackson Keanu Reeves Star … Samuel L. Jackson Schwarzenegger Samuel L. Jackson Keanu Reeves Star Relation R Relation S Errors in the database: –Data often is not clean by itself –Especially true in data integration and cleansing

7 “Did you mean…?” features in Search Engines

8 What if we don’t want the user to change the query? Answering Queries Approximately Data Repository (RDBMS, Search Engines, etc.) Query Answers matching conditions approximately

9 Technical Challenges How to relax conditions? –Name: “Schwarzenegger” vs “Schwarrzenger” –Salary: “in [50K,60K]” vs “in [49K,63K]” How to answer queries efficiently? –Index structures –Selectivity estimation See our three recent VLDB papers

10 Rest of the talk Selectivity estimation of fuzzy predicates Our approach: SEPIA Construction and maintenance of SEPIA Experiments Other works

11 Queries with Fuzzy String Predicates Stars: name similar to “Schwarrzenger” Employees: SSN similar to “ ” Customers: telephone number similar to “ ” Similar to: –a domain-specific function –returns a similarity value between two strings Examples: –Edit distance: ed(Schwarrzenger, Schwarzenegger)=2 –Cosine similarity –Jaccard coefficient distance –Soundex –… Database

12 A widely used metric to define string similarity Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 Example Similarity Function: Edit Distance

13 Selectivity of Fuzzy Predicates star SIMILARTO ’Schwarrzenger’ Selectivity: # of records satisfying the predicate StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Samuel JacksonStar Wars: Episode III - Revenge of the Sith2005Sci-Fi SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonGoodfellas1990Drama …………

14 Selectivity Estimation: Problem Formulation A bag of strings Input: fuzzy string predicate P(q, δ) star SIMILARTO ’Schwarrzenger’ Output: # of strings s that satisfy dist(s,q) <= δ

15 Why Selectivity Estimation? SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND year BETWEEN [1980,1989]; StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Samuel JacksonStar Wars: Episode III - Revenge of the Sith2005Sci-Fi SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonGoodfellas1990Drama ………… Movies SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND year BETWEEN [1970,1971]; The optimizer needs to know the selectivity of a predicate to decide a good plan.

16 No “nice” order for strings Lexicographical order? –Similar strings could be far from each other: Kammy/Cammy –Adjacent strings have different selectivities: Cathy/Catherine Using traditional histograms?

17 Outline Selectivity estimation of fuzzy predicates Our approach: SEPIA –Overview –Proximity between strings –Estimation algorithm Construction and maintenance of SEPIA Experiments Other works

18 Our approach: SEPIA Selectivity Estimation of Approximate Predicates Intuition

19 Proximity between Strings Edit Distance? Not discriminative enough

20 Edit Vector from s1 to s2 A vector –I: # of insertions –D: # of deletions –S: # of substitutions in a sequence of edit operations with their edit distance –Easily computable –Not symmetric –Not unique, but tend to be (ed <= 3  91% unique)

21 Why Edit Vector? More discriminative

22 SEPIA histograms: Overview

23 Frequency table for each cluster

24 Global PPD Table Proximity Pair Distribution table

25 SEPIA histograms: summary

26 Selectivity Estimation: ed(lukas, 2) Do it for all v2 vectors in each cluster, for all clusters Take the sum of these contributions

27 Selectivity Estimation for ed(q,d) For each cluster C i For each v2 in frequency table of C i Use (v1,v2,d) to lookup PPD Take the sum of these f * N Pruning possible (triangle inequality)

28 Outline Selectivity estimation of fuzzy predicates Our approach: SEPIA –Overview –Proximity between strings –Estimation algorithm Construction and maintenance of SEPIA Experiments Other works

29 Clustering Strings Two example algorithms Lexicographic order based. K-Medoids –Choose initial pivots –Assign strings to its closest pivot –Swap a pivot with another string –Reassign the strings

30 Number of Clusters It affects: Cluster quality –Similarity of strings within each cluster Costs: –Space –Estimation time

31 Constructing Frequency Tables For each cluster, group strings based on their edit vector from the pivot Count the frequency for each group

32 Constructing PPD Table Get enough samples of string triplets (q,p,s) Propose a few heuristics –ALL_RAND –CLOSE_RAND –CLOSE_LEX –CLOSE_UNIQUE

33 Dynamic Maintenance: Frequency Table Take insertion as an example

34 Dynamic Maintenance: PPD

35 Improving Estimation Accuracy Reasons of estimate errors –Miss hits in PPD. –Inaccurate percentage entries in PPD. Improvement: use sample fuzzy predicates to analyze their estimation errors

36 Relative-Error Model Use the errors to build a model Use the model to adjust initial estimation

37 Outline Motivation: selectivity estimation of fuzzy predicates Our approach: SEPIA –Overview –Proximity between strings –Estimation algorithm Construction and maintenance of SEPIA Experiments Other works

38 Data Citeseer: –71K author names –Length: [2,20], avg = 12 Movie records from UCI KDD repository: –11K movie titles. –Length: [3,80], avg = 35 Introduced duplicates: –10% of records –# of duplicates: [1,20], uniform Final results: –Citeseer: 142K author names –UCI KDD: 23K movie titles

39 Setting Test bed –PC: 2.4G P4, 1.2GB RAM, Windows XP –Visual C++ compiler Query workload: –Strings from the data –String not in the data –Results similar Quality measurements –Relative error: (f est – f real ) / f real –Absolute relative error : |f est – f real | / f real

40 Clustering Algorithms K-Metoids is better

41 Quartile distribution of relative errors Data set 1. CLOSE_RAND; 1000 clusters

42 Number of Clusters

43 Effectiveness of Applying Relative-Error Model

44 Dynamic Maintenance

45 Other work 1: Relaxing SQL queries with Selections/Joins SELECT * FROM Jobs J, Candidate C WHERE J.Salary = 5 JobsCandidates JIDCompanyZipcodeSalary CID ZipcodeExpSalaryWorkYear r1Broadcom s r2Intel s r3Microsoft s r4IBM s ……… ………

46 Query Relaxation: Skyline!

47 Other work 2: Fuzzy predicates on attributes of mixed types SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND |year – 1977| <= 3; StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Samuel JacksonStar Wars: Episode III - Revenge of the Sith2005Sci-Fi SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonGoodfellas1990Drama ………… Movies

48 Mixed-Typed Predicates String attributes: edit distance Numeric attributes: absolute numeric difference SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND |year – 1977| <= 3;

49 MAT-tree: Intuition Indexing on two attributes is more effective than two separate indexing structures Numeric attribute: B-tree String attribute: tree-based index structure?

50 MAT-tree: Overview Tree-based indexing structure: –Each node has MBR for both numeric attribute and string attribute Compressing strings as a “compressed trie” that fits into a limited space An edit distance between a string and compressed trie can be computed Experiments show that MAT-tree is very efficient

51 Conclusion It’s important to support answering approximate queries efficiently Our results so far: –SEPIA: provides accurate selectivity estimation for fuzzy string predicates –Relaxing SQL queries with selections and joins –MAT-tree: indexing structure supporting fuzzy queries with mixed-types predicates