String Similarity Measures and Joins with Synonyms

Name: String Similarity Measures and Joins with Synonyms
Uploaded: 2017-08-22T13:11:56+00:00
Duration: PTM16S14
Channel: Charles Haley
Description: String Similarity Measures and Joins with Synonyms

String Similarity Measures and Joins with Synonyms
Jiaheng Lu Renmin University of China Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang

Motivation Example (String Measure)
S1=“International Conference on Management of Data NY USA” S2=“SIGMOD 2013 New York United States” no semantic

SIGMOD  ACM's Special Interest Group on Management Of Data
Example (String Measure) S1=“International Conference on Management of Data NY USA” S2=“SIGMOD 2013 New York United States” Synonyms SIGMOD  International Conference on Management of Data NY  New York USA  United States SIGMOD  ACM's Special Interest Group on Management Of Data How to use the existing synonyms?

Using R to return the maximal Jaccard similarity
Research Problem 1--(String Measurements) Input Two strings s and t, and a set of synonyms R Output Using R to return the maximal Jaccard similarity Jaccard(s,t,R)

Return all similar pairs , such that Jaccard(s,t,R)>=
Problem 2-- (String Similarity Join) Input Two set of strings S and T, and a set of synonyms R, and a threshold value Output Return all similar pairs , such that Jaccard(s,t,R)>=

An example of similarity join
Table S1 Table S2 ID String q1 2013 ACM Intl Conf on Management of Data USA q2 Very Large Data Bases Conf q3 VLDB Conf q4 ICDE 2013 ID String s1 SIGMOD s2 VLDB Synonyms SIGMOD  International Conference on Management of Data VLDB  Very Large Data Bases

Existing works on approximate string match with synonyms
Transform based framework (JaccT) [1], compared with our method. Machine leaning method [2], Hidden Markov Model-based measure. Depend on training data, not efficient Transform based framework (JaccT) [1], compared with our method. Machine leaning method [2], Hidden Markov Model-based measure. Depend on training data, not efficient [1] A. Arasu, S. Chaudhuri, and R. Kaushik. Transformation-based framework for record matching. In ICDE, pages 40–49, 2008. [2] M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, pages 39–48, 2003. [1] A. Arasu, S. Chaudhuri, and R. Kaushik. Transformation-based framework for record matching. In ICDE, pages 40–49, 2008. [2] M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, pages 39–48, 2003.

Outline Motivation & Problem Statement String Similarity Measures
String Similarity Joins Experimental Results Conclusion

String Similarity Measures (Full-expansion)
S1=“International Conference on Management of Data NY USA” S2=“SIGMOD 2013 New York United States” Synonyms SIGMOD  ACM's Special Interest Group on Management Of Data SIGMOD  International Conference on Management of Data NY  New York USA  United States Expanding using all synonyms S1’=" International Conference on Management of Data NY USA SIGMOD New York United States " S2’=" SIGMOD 2013 New York United States International Conference on Management of Data NY USA ACM's Special Interest Group on Management Of Data Jaccard(S1’,S2’)= 13/18 = 0.72

String Similarity Measures (Selective-expansion)
S1=“International Conference on Management of Data NY USA” S2=“SIGMOD 2013 New York United States of America” Synonyms SIGMOD  ACM's Special Interest Group on Management Of Data SIGMOD  International Conference on Management of Data NY  New York USA  United States Expanding using only good synonyms S1’=" International Conference on Management of Data NY USA SIGMOD New York United States " S2’=" SIGMOD 2013 New York United States International Conference on Management of Data NY USA " Jaccard(S1’,S2’)= 13/14 = 0.93

String Similarity Measures (Selective)
Selective-expansion is: NP-hard : Reduction from 3-SAT Choose synonyms that can increase current similarity by computing the similarity-gain Greedy algorithm Property Optimal, when more than 70% cases in practice.

Generate Signatures with full expansion
Similarity Joins (Filtering and Verification) Filtering candidates Verify candidates Generate Signatures with full expansion Similarity Measures Full expansion Prefix method Selective expansion LSH method

String Similarity Joins (SN-Join)
Prefix method Global ordering: {a b c d e f g h i j k l} S1=“c k, e, a, f” S2=“d, b, f, e, k” Order the strings Threshold=0.8 S1’=“a, c, e, f, k” S2’=“b, d, e, f, k” Get signatures Sig(s1)=“a, c” Sig(s2)=“b, d” No overlap Jacc(s1,s2)<0.8

Signatures selection is important
How to select signatures to enhance the signature filtering power? It is unrealistic to find a “one-size-fits-all” solution.

Estimation-based signatures selection
. Three steps to select signatures: Generate multiple signatures schemes for each data set. Given two tables for join, quickly estimate the filtering power of each scheme. Select the scheme with the best filtering power.

An example on estimator
Self-join: ID String Signatures q1 2013 ACM Intl Conf on Management of Data USA ACM, International, Conference, on q2 Very Large Data Bases Conf Conf, Conference q3 VLDB Conf q4 ICDE 2013 ICDE Filtering results (candidates): (q2,q3) ,(q1,q2) (q1,q3) ACM Conf Conference International on ICDE q1 q2 q3 q1 q2 q3 q1 q1 q4

Applying FM sketches on inverted lists
Self-join: ID String Signatures q1 2013 ACM Intl Conf on Management of Data USA ACM, International, Conference, on q2 Very Large Data Bases Conf Conf, Conference q3 VLDB Conf q4 ICDE 2013 ICDE Filtering results (candidates): (q2,q3) ,(q1,q2), (q1,q3) ACM Conf Conference International on ICDE Using Flajolet-Martin (FM) sketch for each list q1 q2 q3 q1 q2 q3 q1 q1 q4

FM sketches （Flajolet and Martin JCSS 1985）
Estimates the number of distinct items in a multi-set of values from [0,…, M-1] Assume a hash function h(x) that maps incoming values x in [0,…, M-1] uniformly across [0,…, 2L-1], where L = O(logM) Let lsb(y) denote the position of the least-significant 1 bit in the binary representation of y A value x is mapped to lsb(h(x)) Number of distinct values: 5 x = 5 h(x) = lsb(h(x)) = 2 1 BITMAP

Estimating the filtering power of a signature scheme
Constructing a two-dimensional hash sketch Computing tighter upper and lower bounds of candidates size

String Similarity Filtering with Length Filter
Filtering candidates Verify candidates Generate Signatures Compute lengths Similarity Measures Full expansion Prefix method Length filter Selective expansion LSH method

String Similarity Joins (SI-Join)
Length filtering Strings S1=“a b c d e” S2=“x y z” Full-expansion Length range S1’=“a b c d e f g h k” S2’=“x y z s” s1: [5, 9] s2: [3, 4] Synonyms a->f g h x-> s b->k Jacc(s1,s2,R)<0.9

String Similarity Joins (SI-Join)
Filtering candidates Verify candidates Generate Signatures Compute lengths Similarity Measures Full expansion Prefix/LSH method Length filter Selective expansion

String Similarity Joins (SI-tree)

Data sets and algorithms
Compared method: JaccT [Arasu et al. ICDE 2008] Three datasets: Data # of strings String Len (avg/max) #of Synonyms # of applied synonyms (avg/max) USPS 1M 6.75/15 300 2.19/5 CONF 10K 5.84/14 1000 1.43/4 SPROT 10.32/20 37.78/104

String Similarity Measures
Effectiveness of different similarity measurements String Similarity Measures Selective-expansion (SE) achieves the best effectiveness.

String Similarity Joins
Efficiency of algorithms S: selective expansion F: full expansion String Similarity Joins SI-Join achieve the best performance.

Prefix scheme VS. LSH schemee
Prefix V.s. LSH Prefix scheme VS. LSH schemee Prefix is better LSH is better

Estimation effectiveness

？ Conclusion and future work String similarity measure with synonyms
Two new measures and a new join algorithm One estimator for signature selection Future work: how to deal with synonym ambiguity E.g. UW = University of Washington UW = University of Waterloo ？ OR

String Similarity Measures and Joins with Synonyms

String Similarity Measures and Joins with Synonyms

Similar presentations

Presentation on theme: "String Similarity Measures and Joins with Synonyms"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

String Similarity Measures and Joins with Synonyms

Similar presentations

Presentation on theme: "String Similarity Measures and Joins with Synonyms"— Presentation transcript:

Similar presentations

About project

Feedback