Download presentation
Presentation is loading. Please wait.
Published byFriedrich Hoch Modified over 5 years ago
1
Text Joins for Data Cleansing and Integration in an RDBMS
Luis Gravano Panagiotis G. Ipeirotis Nick Koudas Divesh Srivastava Columbia University AT&T Labs - Research
2
Problem: Same entity has multiple textual representations
Why Text Joins? Service B HATRONIC CORP EUROAFT INC EUROAFT CORP … Service A EUROAFT CORP HATRONIC INC … Problem: Same entity has multiple textual representations 4/5/2019 Columbia University
3
Matching Text Attributes using Cosine Similarity
Similar tuples should share “infrequent” tokens Infrequent token (high weight) EUROAFT CORP ≈ EUROAFT INC ≠ HATRONIC CORP Common token (low weight) Similarity = Σ weight(token, t1) * weight(token, t2) token Problem: Given two relations, report tuple pairs with similarity above threshold φ 4/5/2019 Columbia University
4
Computing Text Joins in an RDBMS
Create in SQL relations RiWeights (token weights from Ri) 2 1 … INC HATRONIC CORP EUROAFT Token 0.01 0.98 W 0.02 R1Weights 0.03 3 0.05 0.97 0.95 R2Weights Computes similarity for many useless pairs Expensive operation! SELECT r1w.tid AS tid1, r2w.tid AS tid2 FROM R1Weights r1w, R2Weights r2w WHERE r1w.token = r2w.token GROUP BY r1w.tid, r2w.tid HAVING SUM(r1w.weight*r2w.weight) ≥ φ Compute similarity of each tuple pair R1 R2 Name 1 EUROAFT CORP 2 HATRONIC INC … Name 1 HATRONIC CORP 2 EUROAFT INC 3 EUROAFT CORP … R1 R2 Similarity EUROAFT CORP EUROAFT INC 0.98 1.00 HATRONIC CORP 0.01 HATRONIC INC 0.02 4/5/2019 Columbia University
5
Sampling Step for Text Joins
Similarity = Σ weight(token, t1) * weight(token, t2) Similarity is a sum of products Products cannot be high when weight is small Can (safely) drop low weights from RiWeights (adapted from [Cohen & Lewis, SODA97] for efficient execution inside an RDBMS) RiWeights Token W EUROAFT 0.9144 HATRONIC 0.8419 … CORP INC RiSample → Sampling 20 times Token #TIMES SAMPLED EUROAFT 18 (18/20=0.90) HATRONIC 17 (17/20=0.85) Eliminates low similarity pairs (e.g., “EUROAFT INC” with “HATRONIC INC”) 4/5/2019 Columbia University
6
Sampling-Based Text Joins in SQL
R1Weights R2Sample R1 Token W 1 EUROAFT 0.98 CORP 0.02 2 HATRONIC INC 0.01 … Token W 1 HATRONIC 0.98 CORP 0.02 2 EUROAFT 0.95 INC 0.05 3 0.97 0.03 Name 1 EUROAFT CORP 2 HATRONIC INC … Fully implemented in pure SQL! SELECT r1w.tid AS tid1, r2s.tid AS tid2 FROM R1Weights r1w, R2Sample r2s, R2sum r2sum WHERE r1w.token = r2s.token AND r1w.token = r2sum.token GROUP BY r1w.tid, r2s.tid HAVING SUM(r1w.weight*r2sum.total*r2s.c) ≥ S*φ R1 R2 Similarity EUROAFT CORP EUROAFT INC 0.98 0.9 HATRONIC INC HATRONIC CORP 4/5/2019 Columbia University
7
SQL statements tested in MS SQL Server and available for download at:
Contributions “WHIRL [Cohen, SIGMOD98] inside an RDBMS”: Scalability, no data exporting/importing Different tokens choices: Words: Captures word swaps, deletion of common words Q-grams: All the above, plus spelling mistakes, but slower SQL statements tested in MS SQL Server and available for download at: 4/5/2019 Columbia University
8
Questions? 4/5/2019 Columbia University
9
Overflow Slides 4/5/2019 Columbia University
10
Recall for 3-grams Upcoming WWW 2003 paper 4/5/2019
Columbia University
11
Precision for 3-grams Upcoming WWW 2003 paper 4/5/2019
Columbia University
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.