Power-Law Based Estimation of Set Similarity Join Size Hongrae Lee, University of British Columbia Raymond T. Ng, University of British Columbia Kyuseok.

Power-Law Based Estimation of Set Similarity Join Size Hongrae Lee, University of British Columbia Raymond T. Ng, University of British Columbia Kyuseok Shim, Seoul National University

Highly Similar, but Not The Same, Data 2 Nearly word-for-word copy The duplicate does not cite the original Similar news articles

Introduction Finding all pairs of similar objects is a very common task – Near duplicate detection – Data integration – Record linkage – Web search 3 Sunita Sarawagi, Alok Kirpal: Efficient set joins on similarity predicates. SIGMOD Conference 2004: 743-754 S. Sarawagi, A. Kirpal, Efficient set joins on similarity predicates. SIGMOD sp|P45680|YFMU_COXBU HYPOTHETICAL 15.8 KD PROTEIN IN FMU-RP... sp|P45680|YFMU_COXBU H 15.8 KD PROTEIN IN FMU-RP...

Set Similarity Join (SSJoin) SSJoin is proposed as a general framework for finding similar objects Input – two collections of sets, R and S – similarity function sim – similarity threshold τ Output – all pairs (r,s) r ∈ R, s ∈ S, such that sim(r,s) ≥ τ 4 {bolt, destroy, 200, meter, record} {bolt, smashes, 200, meter, world, record, berlin} word n-gram Documents set Jaccard similarity = 0.5

Estimation of SSJoin Size SSJoin in RDBMS – SSJoin operator as a primitive operator [Chaudhuri, Ganti, Kaushik 06] – Data cleaning as a repetitive operation [Fuxman, Fazli, Miller 05] Efficient and accurate estimation of SSJoin size is crucial in query optimization – Poor size estimations can result in sub-optimal plans 5 SSJ NL Seek S.AR.AT.B SSJ HM Scan S.AR.AT.B different opt-plans depending on SSJ size

Problem Statement Input – a collection of sets R (self-join) – threshold τ on Jaccard similarity J S Output – the number of pairs (r,s), SSJ(τ), such that J S (r,s) ≥ τ, r, s ∈ R and r≠ s. Jaccard similarity J S – J S (r,s) = |r ∩ s| / |r ∪ s| – e.g., J S ({1,2,3},{2,3,4}) = |{2,3}| / |{1,2,3,4}| = 0.5 6

Related Work Set similarity join (or selection) – [Sarawagi, Kirpal 04], [Chaudhuri, Ganti, Kaushik 06], [Arasu, Ganti, Kaushik 06], [Bayardo, Ma, Srikant 07], [Xiao, Wang, Lin 08], [Xiao, Wang, Lin, Yu 08], [Hadjieleftheriou, Chandel, Koudas, Srivastava 08], [Xiao, Wang, Lin, Shang 09] Hashed Samples: selectivity estimation of set similarity selection queries – [ Hadjieleftheriou, Yu, Koudas, Srivastava 08] Estimation of the number of frequent patterns – [Chuang, Huang, Chen 08], [Jin, McCallen, Breitbart, Fuhry, Wang 09], [Boley, Grosskreutz 09] 7

Outline Introduction Signature pattern & Lattice counting Power-law based estimation Correction of the estimation Experimental results 8

Min-Hash Signature Min-wise hash function – Prob [h(r) = h(s)] = |r ∩ s| / |r ∪ s| Min-Hash signature – Use M min-wise hash functions, h 1,…,h M – J S (r,s) ≈ fraction of signatures for which Min-hash values agree 9 {1,3,53,55,23,534,…} {2,3,50,51,52,53,…} [4,3,5,2] [4,3,3,5] r s sig(r) sig(s) J S (r,s) ≈ 2/4 M=4 [Cohen 97] [Broder, Glassman, Manasse, Zweig 97]

Min-Hash Representation of Sets We work on Min-Hash signatures of sets – Succinct representation enables faster analysis – Min-Hash signatures preserve Jaccard similarity between original sets – Might be readily available 10 DB r1{7,10,19,52,67} r2{10,19,43,52} r3{10,13,43,52,67,85} r4{10,38,43,49,80,94} r5{3,25,29,47,50,66,73,75} Sig(DB) sig (r1)[4,3,5,2] sig (r2)[4,3,3,5] sig (r3)[4,3,2,2] sig (r4)[3,3,3,2] sig (r5)[1,1,1,2] M (signature size) = 4

Signature Pattern Define signature pattern to represent frequently co- occurring signature values Signature pattern – A Min-Hash signature possibly with ‘X’ ‘X’: don’t care position – A signature (set) matches a pattern if it (its signature) agrees on all non-X positions with the pattern e.g., [4,3,5,2] matches patterns [4,3,X,X] or [X,3,5,2] (and many more), but does not match [4,3,2,X] (position matters) – length: # non-X positions – freq (support count): # matching signatures in the DB 11

An Example Signature Pattern 12 Sig(DB) sig (r1)[4,3,5,2] sig (r2)[4,3,3,5] sig (r3)[4,3,2,2] sig (r4)[3,3,3,2] sig (r5)[1,1,1,2] [4,3,X,X] Signature pattern Pattern Length 2 Pattern Freq 3 (r1,r2,r3)

# Similar Pairs by Pattern Frequency Pattern freq f, length i  pairs have at least i matching positions in their signatures (J S ≥ i /M) – pattern length  J S (estimated) – pattern frequency  # pairs 13 Sig(DB) sig (r1)[4,3,5,2] sig (r2)[4,3,3,5] sig (r3)[4,3,2,2] sig (r4)[3,3,3,2] sig (r5)[1,1,1,2] [4,3,X,X] Signature pattern Pattern Length 2 Pattern Freq 3 (r1,r2,r3) signature pairs match at least 2 positions J S (r1,r2), J S (r2,r3), J S (r3,r1) ≥ 2/4 (est.) 3232 ( ) f2f2 ( )

SSJoin Size By Pattern Frequency Given threshold τ, we find all patterns with length ≥ τ*M For each pattern, pairs satisfy τ 14 LengthMatching setFreqMatching pair set# pairs 2r1, r2, r33{(r1,r2),(r1,r3),(r2,r3)}3 2r1, r32{(r1,r3)}1 2r2, r42{(r2,r4)}1 2r1, r3, r43{(r1,r3),(r1,r4),(r3,r4)}3 3r1, r32{(r1,r3)}1 Signature pattern sig1=[4, 3, X, X] sig2=[4, X, X, 2] sig3=[X, 3, 3, X] sig4=[X, 3, X, 2] sig5=[4, 3, X, 2] Naïve approach for SSJoin Size: sum # pairs from all patterns ∑=9 There are overlaps in pattern frequency and thus # pairs We need the cardinality of union of matching pair sets when τ = 0.5 freq 2 ( )

Computing the Union Size We can compute the union size with Inclusion-Exclusion (IE) formula – Combinatorial # operations! 15 Signature patternMatching pair set sig1=[4, 3, X, X]S1={(r1,r2),(r1,r3),(r2,r3)} sig2=[4, X, X, 2]S2={(r1,r3)} sig3=[X, 3, X, 2]S3={(r1,r3),(r1,r4),(r3,r4)} |S1 ⋃ S2 ⋃ S3| = |S1| + |S2|+ |S3| − (|S1 ⋂ S2|+ |S2 ⋂ S3| +|S3 ⋂ S1|) + |S1 ⋂ S2 ⋂ S3|

Efficient Evaluation of IE-Formula [4,3,X,X][4,X,X,2][X,3,X,2] [4,3,X,2] SSJ(0.5)=|S1 ⋃ S2 ⋃ S3| =|S1| + |S2|+ |S3| − (|S1 ⋂ S2|+ |S2 ⋂ S3| +|S3 ⋂ S1|) + |S1 ⋂ S2 ⋂ S3| =1*(|S1| + |S2|+ |S3|) + (−3 + 1) *|S4| 16 (r1,r2) (r1,r3) (r2,r3) (r1,r3) (r1,r4) (r3,r4) (r1,r3) S2 S1 S3 S4 Pattern LatticeMatching Pair Lattice Patterns and matching pairs exhibit lattice structure layer nodes according to the pattern length (= level) edges: inclusion relationship patterns length < τ*M are not shown

Lattice Counting Compute SSJoin size from ‘pattern distribution’ (# patterns per each length and frequency) Basically simplified IE-formula computation using lattices Does not store actual matching sets or pair sets, only counts! 17 Signature patternLengthMatching setFreqMatching pair set# pairs sig1=[4, 3, X, X]2r1, r2, r33{(r1,r2),(r1,r3),(r2,r3)}3 sig2=[4, X, X, 2]2r1, r32{(r1,r3)}1 sig3=[X, 3, 3, X]2r2, r42{(r2,r4)}1 sig4=[X, 3, X, 2]2r1, r3, r43{(r1,r3),(r1,r4),(r3,r4)}3 sig5=[4, 3, X, 2]3r1, r32{(r1,r3)}1 Pattern Distribution LengthFrequency# pattern 222 32 321 See the paper for details Please see the paper for details

Pattern Distribution LengthFrequency# pairs 222 32 321 pattern frequency # of patterns (pattern count) level 2 (pattern length=2) level 3 (pattern length=3) there 2 patterns that match 3 sets and whose length is 2 i.e., sig1=[4,3,X,X]  (r1,r2,r3) sig4=[X,3,X,2]  (r1,r3,r4) 19 If we have exact pattern dist., we can exactly estimate SSJoin size

Exact Pattern Distribution Computing exact pattern distribution is infeasible – We need pattern distribution for freq >= 2 (min freq for generating a pair)  Minimum support threshold = 2 – Most frequent pattern mining algorithms are not designed to handle such a low support threshold – Even if they could, it would take too long to be used for query optimization purposes 20

Power-Law Distribution of Pattern Count 21 minimum support threshold mined pattern distribution missing pattern distribution A Power-law distribution is observed in # patterns-frequency relationship (or pattern count-support count) [Chuang, Huang, Chen 08] Power law: count = β*frequency -α

SSJoin Size Estimation 1.Find frequent patterns with ξ > 2 2.Estimate the parameters of the Power-law distribution at each level with the acquired patterns 3.Compute the full pattern distribution based on the estimated parameters 4.Compute SSJoin size with Lattice Counting formula 22

Systematic Overestimation By Min- Hash Big overestimation is observed e.g., relative error J S =0.4: 10332% J S =0.5: 2614% J S =0.6 : 573% 24 # pair – similarity plot of exhaustive pair-wise comparison

Effect of Skewed Distribution 01234 # matching position T(i) 10,0001,000100 10 1 1001 100 – 2*10 + 100 + 1 = 181 # pairs with J S =i/M 25 Assume 10% of pairs have +1 or -1 more matching positions in their Min-Hash signatures

Probabilistic Modeling s={1,2,3,4,5,6,8} r={1,2,4,5,7} J S (r,s) =4/8=0.5 0.5 sig(r) sig(s) 0.5 Pr (J=j | I=i) ≡ Prob (j matching position when J S =i/M) E [ # matching position] = 2 Prob (3 matching position) ? 0.5 3 (1-0.5) 4-3 4 3 () 26

Considering All # Pairs # matching position 01234 T(i): # pairs with J S =i/M O(j): # pairs with j matching pos in sig T(0) O(0) T(1)T(2) T(3)T(4) O(1) O(2) O(3)O(4) O(2) = T(0)*P(2|0) +T(1)*P(2|1) +T(2)*P(2|2) +T(3)*P(2|3) +T(4)*P(2|4) O(0) O(1) O(2) O(3) O(4) T(0) T(1) T(2) T(3) T(4) P(0|0) P(0|1) P(0|2) P(0|3) P(0|4) P(1|0) P(1|1) P(1|2) P(1|3) P(1|4) P(2|0) P(2|1) P(2|2) P(2|3) P(2|4) P(3|0) P(3|1) P(3|2) P(3|3) P(3|4) P(4|0) P(4|1) P(4|2) P(4|3) P(4|4) = AT=O Observed size by Min-Hash True Size Transition Probability 27

NNLS Optimization AT=O T=A -1 O Subject to X ≥ 0 A is non-singular We actually have an estimated vector O’, not the exact O O is highly skewed and lower entries make higher entries negligible We solve Non-negative least square (NNLS) constrained optmization problem Scale the matrix by a weight matrix W, W i,i =1/O(i) and W i,j =0,i ≠j ∥ WAX – WO ∥ 28 T may have negative values

SSJoin Size Estimation Algorithm 29 Min-Hash Signatures of DB Partial Pattern Distribution (# patterns for each length) SSJoin Size Error Correction Est. Full Pattern Distribution Freq. pattern mining algorithm No need for actual patterns Only count # patterns Power-law parameter estimation Lattice Counting NNLS optimization

Experimental Setup Dataset – DBLP, 800K – IBM Quest synthetic data, 50K Compared algorithms – LC(ξ) : the proposed solution with a minimum support threshold of ξ – Independent Sum (IS) : without lattice counting – LCNC(ξ) : LC without the error correction step – HS(ρ) : Hashed samples[Hadjieleftheriou, Yu, Koudas, Srivastava 08] adapted to SSJoin Opt_Merge [Sarawagi, Kirpal 04] ρ: sampling ratio Evaluation metric – Accuracy: actual count, relative error – Runtime: pre-processing time, estimation time 31

Accuracy 32 LC delivers accurate estimations for high similarity thresholds HS: random samples will miss many highly similar pairs DBLP Synthetic Data HS: accurate enough for very low similarity thresholds

Runtime 33 DBLP 40K Estimation timePre-processing time LC is faster (with better accuracy) LC’s pre-processing time is smaller

Effect of Error Correction Step 34 Huge overestimation without considering the overlaps Error correction step effectively reduces the overestimation Computational overhead of the error correction step is negligible AccuracyRuntime

Scalability 35 Estimation timePre-processing time Much slower increase in runtime and pre-processing time than HS, random sampling

Summary Proposed a SSJoin size estimation algorithm based Min-hash signatures and frequent pattern mining technique with the error correction Evaluated the proposed algorithm with synthetic and real-world databases Future work – Apply recent developments in estimating the number of frequent patterns: random sampling 36

Thank you 37

Lattice Structure in Patterns Patterns and corresponding matching pair sets have lattice structure – Partial order by inclusion relationship, lub and glb by intersection and union – E.g, if a set matches [4,3,X,2] it matches all of its children – If a set matches both [4,3,X,X] and [4,X,X,2], it also matches [4,3,X,2] We can compute the union size by Inclusion-Exclusion (IE) formula Lattice structures greatly simplifies the IE-formula computation 38 [4,3,X,X][4,X,X,2][X,3,X,2] [4,3,X,2] (r1,r2) (r1,r3) (r2,r3) (r1,r3) (r1,r4) (r3,r4) (r1,r3) S2 S1 S3 S4 Pattern LatticeMatching Pair Lattice

Lattice Counting Lattice Counting (LC) – Efficient computation of IE-formula exploiting the underlying lattices – Level sum F i : # pairs with i matching values in their signatures – Coefficient C i collapses repeated computation of the same results into a single operation [Lee, Ng, Shim 07] Only needs # patterns of freq f and length i – e.g., if τ = 0.5 and M=4, SSJ(0.5) = LC(2) and LC needs # patterns of length 2,3 and 4 for each frequency – Does not need actual patterns 39 LC(t) = ∑ t≤i≤M C i *F i, t= τ*M C i : coefficient for level i F i : level i sum

Parameter Estimation Might Fail for Longer Patterns There are in general a smaller number of higher-level (longer) patterns We may not have enough points for parameter estimation LC(t) requires all pattern dist. for level t ~ M – LC(t) = ∑ t ≤ i ≤ M C i *F i Our solutions – Approximate Lattice Counting – Interpolation 40 Enough points for parameter estimation, i = 3 Not enough points for parameter estimation, i = 9

Approximate Lattice Counting [_,_,X,X][_,X,_,X][_,X,X,_][X,_,_,X][X,_,X,_][X,X,_,_] [_,_,_,X][_,_,X,_][_,X,_,_][X,_,_,_] [_,_,_,_] [_,_,X,X][_,X,_,X][_,X,X,_][X,_,_,X][X,_,X,_][X,X,_,_] [_,_,_,X][_,_,X,_][_,X,_,_][X,_,_,_] LC(t) = ∑ t ≤ i ≤ M C i *F i t level M … t t + k … LC k (t) = ∑ t ≤ i ≤ t+k C k,i *F i 41 Partial independence assumption: ignore high level nodes only considering nodes up to level t+k t = τ*M, k: approximation constant Full lattice Partial lattice

Estimation with Limited Pattern Distribution An observation – SSJoin size is highly skewed and Pair count – Jaccard similarity exhibits a Power-law relationship Used for interpolation when very low support thresholds or NNLS optimization failure 42 Jaccard similarity Pair count (SSJoin size) Jaccard similarity Pair count (SSJoin size) DBLP

Power Hypothesis 43 DBLP Synthetic Data

Power-Law Based Estimation of Set Similarity Join Size Hongrae Lee, University of British Columbia Raymond T. Ng, University of British Columbia Kyuseok.

Similar presentations

Presentation on theme: "Power-Law Based Estimation of Set Similarity Join Size Hongrae Lee, University of British Columbia Raymond T. Ng, University of British Columbia Kyuseok."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Power-Law Based Estimation of Set Similarity Join Size Hongrae Lee, University of British Columbia Raymond T. Ng, University of British Columbia Kyuseok.

Similar presentations

Presentation on theme: "Power-Law Based Estimation of Set Similarity Join Size Hongrae Lee, University of British Columbia Raymond T. Ng, University of British Columbia Kyuseok."— Presentation transcript:

Similar presentations

About project

Feedback