Text Joins for Data Cleansing and Integration in an RDBMS Luis Gravano Panagiotis G. Ipeirotis Nick Koudas Divesh Srivastava Columbia University AT&T Labs - Research
Problem: Same entity has multiple textual representations Why Text Joins? Service B HATRONIC CORP EUROAFT INC EUROAFT CORP … Service A EUROAFT CORP HATRONIC INC … Problem: Same entity has multiple textual representations 4/5/2019 Columbia University
Matching Text Attributes using Cosine Similarity Similar tuples should share “infrequent” tokens Infrequent token (high weight) EUROAFT CORP ≈ EUROAFT INC ≠ HATRONIC CORP Common token (low weight) Similarity = Σ weight(token, t1) * weight(token, t2) token Problem: Given two relations, report tuple pairs with similarity above threshold φ 4/5/2019 Columbia University
Computing Text Joins in an RDBMS Create in SQL relations RiWeights (token weights from Ri) 2 1 … INC HATRONIC CORP EUROAFT Token 0.01 0.98 W 0.02 R1Weights 0.03 3 0.05 0.97 0.95 R2Weights Computes similarity for many useless pairs Expensive operation! SELECT r1w.tid AS tid1, r2w.tid AS tid2 FROM R1Weights r1w, R2Weights r2w WHERE r1w.token = r2w.token GROUP BY r1w.tid, r2w.tid HAVING SUM(r1w.weight*r2w.weight) ≥ φ Compute similarity of each tuple pair R1 R2 Name 1 EUROAFT CORP 2 HATRONIC INC … Name 1 HATRONIC CORP 2 EUROAFT INC 3 EUROAFT CORP … R1 R2 Similarity EUROAFT CORP EUROAFT INC 0.98 1.00 HATRONIC CORP 0.01 HATRONIC INC 0.02 4/5/2019 Columbia University
Sampling Step for Text Joins Similarity = Σ weight(token, t1) * weight(token, t2) Similarity is a sum of products Products cannot be high when weight is small Can (safely) drop low weights from RiWeights (adapted from [Cohen & Lewis, SODA97] for efficient execution inside an RDBMS) RiWeights Token W EUROAFT 0.9144 HATRONIC 0.8419 … CORP 0.01247 INC 0.00504 RiSample → Sampling 20 times Token #TIMES SAMPLED EUROAFT 18 (18/20=0.90) HATRONIC 17 (17/20=0.85) Eliminates low similarity pairs (e.g., “EUROAFT INC” with “HATRONIC INC”) 4/5/2019 Columbia University
Sampling-Based Text Joins in SQL R1Weights R2Sample R1 Token W 1 EUROAFT 0.98 CORP 0.02 2 HATRONIC INC 0.01 … Token W 1 HATRONIC 0.98 CORP 0.02 2 EUROAFT 0.95 INC 0.05 3 0.97 0.03 Name 1 EUROAFT CORP 2 HATRONIC INC … Fully implemented in pure SQL! SELECT r1w.tid AS tid1, r2s.tid AS tid2 FROM R1Weights r1w, R2Sample r2s, R2sum r2sum WHERE r1w.token = r2s.token AND r1w.token = r2sum.token GROUP BY r1w.tid, r2s.tid HAVING SUM(r1w.weight*r2sum.total*r2s.c) ≥ S*φ R1 R2 Similarity EUROAFT CORP EUROAFT INC 0.98 0.9 HATRONIC INC HATRONIC CORP 4/5/2019 Columbia University
SQL statements tested in MS SQL Server and available for download at: Contributions “WHIRL [Cohen, SIGMOD98] inside an RDBMS”: Scalability, no data exporting/importing Different tokens choices: Words: Captures word swaps, deletion of common words Q-grams: All the above, plus spelling mistakes, but slower SQL statements tested in MS SQL Server and available for download at: http://www.cs.columbia.edu/~pirot/DataCleaning/ 4/5/2019 Columbia University
Questions? 4/5/2019 Columbia University
Overflow Slides 4/5/2019 Columbia University
Recall for 3-grams Upcoming WWW 2003 paper 4/5/2019 Columbia University
Precision for 3-grams Upcoming WWW 2003 paper 4/5/2019 Columbia University