Presentation is loading. Please wait.

Presentation is loading. Please wait.

Weighted Exact Set Similarity Join

Similar presentations


Presentation on theme: "Weighted Exact Set Similarity Join"— Presentation transcript:

1 Weighted Exact Set Similarity Join
The Pennsylvania State University Dongwon Lee

2 Set Similarity Join Def. Set Similarity Join (SSJoin): Between collections A and B, find X pairs of objects whose similarity > t: If X = “MOST”  Approximate SSJoin If X = “ALL”  Exact SSJoin 0.7 : {Lake, Monona, Wisc, Dane, County} 0.5 0.4 : {University, Mendota, Wisc, Dane,} 0.2 0.9 0.1 A B Wisconsin DB Seminar, 2009

3 Set Similarity Join Weighted vs. Unweighted
Weighting quantifies relative importance of token Eg, “Microsoft” is more important than “Copr.” How to assign meaningful weights to tokens is an important problem itself Not further discussed here Wisconsin DB Seminar, 2009

4 Set Similarity Join Approximate SSJoin Exact SSJoin
Allows some false positives/negatives Eg, LSH as solution Exact SSJoin Does not allow any false positives/negatives Needs to be scalable Weighted + Exact SSJoin Will simply call “WESSJoin” UESSJoin WESSJoin exact UASSJoin WASSJoin approx. unweighted weighted Wisconsin DB Seminar, 2009

5 Applications of WESSJoin
Entity resolution Web document genre classification Find all pairs of documents w. similar contents Query refinement for web search For a query, find another w. similar search result Movie recommendation Identify users who have similar movie tastes w.r.t. the rented movies  Focus on string data represented as SET Eg, document, web page, record Wisconsin DB Seminar, 2009

6 Research Issues Why not express WESSJoin in SQL?
Join predicate as UDF Cartesian product followed by UDF processing  Inefficient evaluation Special handling for WESSJoin needed Scalability Support diverse similarity (or distance) functions Eg, Overlap, Jaccard, Cosine vs. Edit, … Support diverse computation models Eg, Threshold vs. Top-k Wisconsin DB Seminar, 2009

7 Similarity/Distance Functions
Jaccard Coefficient: J(x,y) = Overlap similarity: O(x,y) = Cosine similarity: C(x,y) = Hamming distance H(x,y) = Levenshtein distance L(x,y): min # of edit operations to transform x to y Wisconsin DB Seminar, 2009

8 Properties of sim() Similarity functions can be re-written to each other equivalently J(x,y) > t  O(x,y) > t/(1+t) (|x|+|y|) O(x,y) > t  H(x,y) < |x|+|y|-2t C(x,y) > t  O(x,y) > Eg, x: {Lake, Mendota, Monona} y: {Wisc, Dane, Mendota, Lake} J(x,y) > 0.5 ?  O(x,y) > 2.3 ? Set representation: k-gram, word, phrase, … Wisconsin DB Seminar, 2009

9 Naïve Solution All pair-wise comparison between A and B
Nested-loop: |A||B| comparisons The sim() evaluation may be costly Eg, Generalized Jaccard Similarity function with O(|x|3) For x in A: For y in B: If sim(x,y) > t, return (x,y); A, B: table x, y: record as set Wisconsin DB Seminar, 2009

10 Naïve Solution Example
B ID Content 1 {Lake, Mendota} 2 {Lake, Monona, Area} 3 {Lake, Mendota, Monona, Dane} ID Content 4 {Lake, Monona, University} 5 {Monona, Research, Area} 6 {Lake, Mendota, Monona, Area} O(x,y) > 2 ? O(x,y) ID=4 ID=5 ID=6 ID=1 1 2 ID=2 3 ID=3 Wisconsin DB Seminar, 2009

11 Naïve Solution Example
B ID Content 1 {Lake, Mendota} 2 {Lake, Monona, Area} 3 {Lake, Mendota, Monona, Dane} ID Content 4 {Lake, Monona, University} 5 {Monona, Research, Area} 6 {Lake, Mendota, Monona, Area} J(x,y) > 0.6 ? J(x,y)) ID=4 ID=5 ID=6 ID=1 0.25 0.5 ID=2 0.4 0.75 ID=3 0.2 0.16 0.6 Wisconsin DB Seminar, 2009

12 2-Step Framework Step 1: “Blocking”
Using Index/heuristics/filtering/etc, reduce # of candidates to compare Step 2: sim() only within candidate sets O(|A||C|) s.t. |C| << |B| For x in A: Using Foo, find a candidate set C in B For y in C: If sim(x,y) > t, return (x,y); Wisconsin DB Seminar, 2009

13 Variants for “Foo” “Foo”: How to identify candidate set C
Fast Accurate: no false positives/negatives Many Variants for “Foo” Inverted Index [Sarawagi et al, SIGMOD 04] Size filtering [Arasu et al, VLDB 06] Prefix Index [Chaudhuri et al, ICDE 06] Prefix + Inverted Index [Bayardo et al, WWW 07] Bound filtering [On et al, ICDE 07] Position Index [Xiao et al, WWW 08] Wisconsin DB Seminar, 2009

14 Inverted Index [Sarawagi et al, SIGMOD 04]
B ID Content 1 {Lake, Mendota} 2 {Lake, Monona, Area} 3 {Lake, Mendota, Monona, Dane} ID Content 4 {Lake, Monona, University} 5 {Monona, Research, Area} 6 {Lake, Mendota, Monona, Area} Inverted Index (IDX) for A Inverted Index (IDX) for B Token in A ID List Area 2 Dane 3 Lake 1, 2, 3 Mendota 1, 3 Monona 2, 3 Token in B ID List Area 5 Lake 4, 6 Mendota 6 Monona 4, 5, 6 Research University 4 Wisconsin DB Seminar, 2009

15 Inverted Index [Sarawagi et al, SIGMOD 04]
B ID Content 1 {Lake, Mendota} 2 {Lake, Monona, Area} 3 {Lake, Mendota, Monona, Dane} ID Content 4 {Lake, Monona, University} 5 {Monona, Research, Area} 6 {Lake, Mendota, Monona, Area} For x in A: Using IDX, find a candidate set C in B For y in C: If sim(x,y) > t, return (x,y); Inverted Index (IDX) for B Token in B ID List Area 5 Lake 4, 6 Mendota 6 Monona 4, 5, 6 Research University 4 ID=1: {Lake, Mendota} ID=2: … ID=3: … Candidate set C: {4,6} + {6} = {4, 6} Wisconsin DB Seminar, 2009

16 Inverted Index [Sarawagi et al, SIGMOD 04]
B ID Content 1 {Lake, Mendota} 2 {Lake, Monona, Area} 3 {Lake, Mendota, Monona, Dane} ID Content 4 {Lake, Monona, University} 5 {Monona, Research, Area} 6 {Lake, Mendota, Monona, Area} For x in A: Using IDX, find a candidate set C in B For y in C: If sim(x,y) > t, return (x,y); Inverted Index (IDX) for B Token in B ID List Area 5 Lake 4, 6 Mendota 6 Monona 4, 5, 6 Research University 4 ID=1: {Lake, Mendota} ID=2: … ID=3: … ID Freq. 4 1 6 2 Candidate set C: O(x,y) > 2 Wisconsin DB Seminar, 2009

17 Size Filtering [Arasu et al, VLDB 06]
Idea: Build index on the size of inputs Jaccard Coefficient J= Upperbound for Jaccard: Bounding |y| w.r.t. |x|: Combining two  x x y y Wisconsin DB Seminar, 2009

18 Size Filtering [Arasu et al, VLDB 06]
Intuition: If t and |x| are given, |y| is bounded Eg, x: {Lake, Mendota} y: {Lake, Mendota, Monona, Area} J(x,y) > 0.8 ? Then, according to: |x|=2, t=0.8  1.6 <= |y| <= 2.5 However, |y| = 4 y cannot satisfy t=0.8  no need to compute J(x,y) at all Wisconsin DB Seminar, 2009

19 Size Filtering [Arasu et al, VLDB 06]
For x in A: Using IDX, find a candidate set C in B For y in C: If sim(x,y) > t, return (x,y); Algorithm For all input strings, build B-tree w.r.t. their sizes Given a set x, using B-tree index, find a candidate y in B s.t. Wisconsin DB Seminar, 2009

20 Prefix Index [Chaudhuri et al, ICDE 06]
Intuition: If two sets are very similar, their prefixes, when ordered, must have some common tokens Eg. x: {Dane, University, Monona, Mendota} y: {Area, Lake, Mendota, Monona, Wisc} O(x,y) > 3 ? x’: {Dane, Mendota, Monona, University} y’: {Area, Lake, Mendota, Monona, Wisc} Prefixes Wisconsin DB Seminar, 2009

21 Prefix Index [Chaudhuri et al, ICDE 06]
Theorem 1: If there is no overlap btw. Prefix(x) and Prefix(y), then sim(x,y) > t, where: If sim()=Overlap, Prefix(x)=|x| - (t-1) If sim()=Jaccard, Prefix(x)=|x|-Ceiling(t*|x|)+1 Algorithm using Theorem 1: Given a set x For each token t_x in the prefix of x Using an index, locate a candidate y that contains t_x in the prefix of y If sim(x,y) > t, return (x,y) Wisconsin DB Seminar, 2009

22 Prefix + Inverted Index [Bayardo et al, WWW 07]
ID Content 1 {Lake, Mendota} 2 {Lake, Monona, Area} 3 {Lake, Mendota, Monona, Dane} ID Content 4 {Lake, Monona, University} 5 {Monona, Research, Area} 6 {Lake, Mendota, Monona, Area} Token ID List DF Order Area 2, 5 2 4 Dane 3 1 Lake 1, 2, 3, 4, 6 5 6 Mendota 1, 3, 6 Monona 2, 3, 4, 5, 6 7 Research University Inverted Index (IDX) for both A and B Create a universal order: Put rare tokens front Order: Dane > Research > University > Area > Mendota > Lake > Monona Wisconsin DB Seminar, 2009

23 Prefix + Inverted Index [Bayardo et al, WWW 07]
Ordered A Ordered B ID Content 1 {Mendota, Lake} 2 {Area, Lake, Monona} 3 {Dane, Mendota, Lake, Monona} ID Content 4 {University, Lake, Monona} 5 {Research, Area, Monona} 6 {Area, Mendota, Lake, Monona} Order: Dane > Research > University > Area > Mendota > Lake > Monona Wisconsin DB Seminar, 2009

24 Prefix + Inverted Index [Bayardo et al, WWW 07]
Ordered A Ordered B ID Content 1 {Mendota, Lake} 2 {Area, Lake, Monona} 3 {Dane, Mendota, Lake, Monona} ID Content 4 {University, Lake, Monona} 5 {Research, Area, Monona} 6 {Area, Mendota, Lake, Monona} O(x,y) > 2 Prefix(x)=|x|-(t-1)=|x|-1 Prefix Inverted Index for B Token in B ID List Area 5 Lake 4, 6 Mendota 6 Research University 4 ID=1: {Mendota, Lake} ID=2: … ID=3: … Candidate set C: {6} Wisconsin DB Seminar, 2009

25 Prefix + Inverted Index [Bayardo et al, WWW 07]
Ordered A Ordered B ID Content 1 {Mendota, Lake} 2 {Area, Lake, Monona} 3 {Dane, Mendota, Lake, Monona} ID Content 4 {University, Lake, Monona} 5 {Research, Area, Monona} 6 {Area, Mendota, Lake, Monona} O(x,y) > 2 Prefix(x)=|x|-(t-1)=|x|-1 Prefix Inverted Index for B Token in B ID List Area 5 Lake 4, 6 Mendota 6 Research University 4 ID=1: … ID=2: {Area, Lake, Monona} ID=3: … Candidate set C: {5} + {4,6} = {4,5,6} Wisconsin DB Seminar, 2009

26 Prefix + Inverted Index [Bayardo et al, WWW 07]
Ordered A Ordered B ID Content 1 {Mendota, Lake} 2 {Area, Lake, Monona} 3 {Dane, Mendota, Lake, Monona} ID Content 4 {University, Lake, Monona} 5 {Research, Area, Monona} 6 {Area, Mendota, Lake, Monona} O(x,y) > 2 Prefix(x)=|x|-(t-1)=|x|-1 Prefix Inverted Index for B Token in B ID List Area 5 Lake 4, 6 Mendota 6 Research University 4 ID=1: … ID=2: … ID=3: {Dane, Mendota, Lake, Monona} Candidate set C: {6} + {4,6} = {4,6} Wisconsin DB Seminar, 2009

27 Position Index [Xiao et al, WWW 08]
Order: Dane > Research > University > Area > Mendota > Lake > Monona Eg, x: {Dane, Research, Area, Mendota, Lake} y: {Research, Area, Mendota, Lake, Monona} O(x,y) > 4 ? Prefix(x) = Prefix(y) = 5 – (4 -1) = 2 “Research” is common btw prefixes  (x,y) is a candidate pair  need to compute sim(x,y) Wisconsin DB Seminar, 2009

28 Position Index [Xiao et al, WWW 08]
Order: Dane > Research > University > Area > Mendota > Lake > Monona Eg, x: {Dane, Research, Area, Mendota, Lake} y: {Research, Area, Mendota, Lake, Monona} O(x,y) > 4 ? Prefix(x) = Prefix(y) = 5 – (4 -1) = 2 Estimation of max overlap = overlap in prefixes + min # of unseen tokens = 1 + min(3,4) = 4 > t  No need to compute sim(x,y) ! Wisconsin DB Seminar, 2009

29 Bound Filtering [On et al, ICDE 07]
Generalized Jaccard (GJ) similarity Two sets: x = {a1, …, a|x|}, y = {b1, …, b|y|} Normalized weight of the maximum bipartite matching M in the bipartite graph (N = x U y, E=x X y) Wisconsin DB Seminar, 2009

30 Bound Filtering [On et al, ICDE 07]
x y 0.7 0.7 0.5 0.5 0.4 0.4 0.2 0.9 0.2 0.9 0.1 0.1 x y M: maximum weight bipartite matching Wisconsin DB Seminar, 2009

31 Bound Filtering [On et al, ICDE 07]
Issues GJ captures more semantics btw. two sets via the weighted bipartite matching than Jaccard But more costly to compute: maximum weight bipartite matching Bellman-Ford: O(V2E) Hungarian: O(V3) For x in A: Using Foo, find a candidate set C in B For y in C: If GJ(x,y) > t, return (x,y); Wisconsin DB Seminar, 2009

32 Bound Filtering [On et al, ICDE 07]
Bipartite matching computation is expensive because of the requirement No node in the bipartite graph can have more than one edge incident on it Relax this constraint: For each element ai in x, find an element bj in y with the highest element-level similarity  S1 For each element bj in y, find an element ai in x with the highest element-level similarity  S2 Complexity becomes linear: O(|x|+|y|) Wisconsin DB Seminar, 2009

33 Bound Filtering [On et al, ICDE 07]
x y 0.7 0.7 S1 S1 0.5 0.5 0.4 0.4 0.2 0.9 0.2 0.9 0.1 0.1 x y 0.7 S2 0.5 S2 0.4 0.2 0.9 0.1 x y Wisconsin DB Seminar, 2009

34 Bound Filtering [On et al, ICDE 07]
Properties: Numerator of UB is at least as large as that of GJ Denominator of UB is no larger than that of GJ Similar arguments for LB Theorem 2 LB <= GJ <= UB Wisconsin DB Seminar, 2009

35 Bound Filtering [On et al, ICDE 07]
For x in A: Using Foo, find a candidate set C in B For y in C: If GJ(x,y) > t, return (x,y); Algorithm Compute UB(x,y) If UB(x,y) <= t  GJ(x,y) <= t  (x,y) is not an answer Else Compute LB(x,y) If LB(x,y) > t  GJ(x,y) > t  (x,y) is an answer Else compute GJ(x,y) LB <= GJ <= UB Wisconsin DB Seminar, 2009

36 Takeaways WESSJoin finds ALL pairs of sets btw two collections whose similarity > t Good abstraction for various problems 2 step framework is promising Step 1: reduce candidates Step 2: similarity computation among candidates Less researched issues Comparison among different WESSJoin methods WESSJoin + top-k/skyline/MapReduce/etc Wisconsin DB Seminar, 2009

37 Reference [Sarawagi et al, SIGMOD 04] Sunita Sarawagi, Alok Kirpal: Efficient set joins on similarity predicates, SIGMOD 2004. [Arasu et al, VLDB 06] Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik, Efficient exact set-similarity joins, VLDB 2006. [Chaudhuri et al, ICDE 06] Surajit Chaudhuri, Venkatesh Ganti, Raghav Kaushik: A Primitive Operator for Similarity Joins in Data Cleaning. ICDE 2006. [Bayardo et al, WWW 07] R. J. Bayardo, Yiming Ma, Ramakrishnan Srikant. Scaling Up All-Pairs Similarity Search, WWW 2007. [On et al, ICDE 07] Byung-Won On, Nick Koudas, Dongwon Lee, Divesh Srivastava, Group Linkage, ICDE 2007. [Xiao et al, WWW 08] Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu. Efficient Similarity Joins for Near Duplicate Detection. WWW 2008. Wei Wang. Efficient Exact Similarity Join Algorithms: Jeffrey D. Ullman. High-Similarity Algorithms: Wisconsin DB Seminar, 2009


Download ppt "Weighted Exact Set Similarity Join"

Similar presentations


Ads by Google