Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Joins for Data Cleansing and Integration in an RDBMS

Similar presentations


Presentation on theme: "Text Joins for Data Cleansing and Integration in an RDBMS"— Presentation transcript:

1 Text Joins for Data Cleansing and Integration in an RDBMS
Luis Gravano Panagiotis G. Ipeirotis Nick Koudas Divesh Srivastava Columbia University AT&T Labs - Research

2 Problem: Same entity has multiple textual representations
Why Text Joins? Service B HATRONIC CORP EUROAFT INC EUROAFT CORP Service A EUROAFT CORP HATRONIC INC Problem: Same entity has multiple textual representations 4/5/2019 Columbia University

3 Matching Text Attributes using Cosine Similarity
Similar tuples should share “infrequent” tokens Infrequent token (high weight) EUROAFT CORP EUROAFT INC HATRONIC CORP Common token (low weight) Similarity = Σ weight(token, t1) * weight(token, t2) token Problem: Given two relations, report tuple pairs with similarity above threshold φ 4/5/2019 Columbia University

4 Computing Text Joins in an RDBMS
Create in SQL relations RiWeights (token weights from Ri) 2 1 INC HATRONIC CORP EUROAFT Token 0.01 0.98 W 0.02 R1Weights 0.03 3 0.05 0.97 0.95 R2Weights Computes similarity for many useless pairs Expensive operation! SELECT r1w.tid AS tid1, r2w.tid AS tid2 FROM R1Weights r1w, R2Weights r2w WHERE r1w.token = r2w.token GROUP BY r1w.tid, r2w.tid HAVING SUM(r1w.weight*r2w.weight) ≥ φ Compute similarity of each tuple pair R1 R2 Name 1 EUROAFT CORP 2 HATRONIC INC Name 1 HATRONIC CORP 2 EUROAFT INC 3 EUROAFT CORP R1 R2 Similarity EUROAFT CORP EUROAFT INC 0.98 1.00 HATRONIC CORP 0.01 HATRONIC INC 0.02 4/5/2019 Columbia University

5 Sampling Step for Text Joins
Similarity = Σ weight(token, t1) * weight(token, t2) Similarity is a sum of products Products cannot be high when weight is small Can (safely) drop low weights from RiWeights (adapted from [Cohen & Lewis, SODA97] for efficient execution inside an RDBMS) RiWeights Token W EUROAFT 0.9144 HATRONIC 0.8419 CORP INC RiSample Sampling 20 times Token #TIMES SAMPLED EUROAFT 18 (18/20=0.90) HATRONIC 17 (17/20=0.85) Eliminates low similarity pairs (e.g., “EUROAFT INC” with “HATRONIC INC”) 4/5/2019 Columbia University

6 Sampling-Based Text Joins in SQL
R1Weights R2Sample R1 Token W 1 EUROAFT 0.98 CORP 0.02 2 HATRONIC INC 0.01 Token W 1 HATRONIC 0.98 CORP 0.02 2 EUROAFT 0.95 INC 0.05 3 0.97 0.03 Name 1 EUROAFT CORP 2 HATRONIC INC Fully implemented in pure SQL! SELECT r1w.tid AS tid1, r2s.tid AS tid2 FROM R1Weights r1w, R2Sample r2s, R2sum r2sum WHERE r1w.token = r2s.token AND r1w.token = r2sum.token GROUP BY r1w.tid, r2s.tid HAVING SUM(r1w.weight*r2sum.total*r2s.c) ≥ S*φ R1 R2 Similarity EUROAFT CORP EUROAFT INC 0.98 0.9 HATRONIC INC HATRONIC CORP 4/5/2019 Columbia University

7 SQL statements tested in MS SQL Server and available for download at:
Contributions “WHIRL [Cohen, SIGMOD98] inside an RDBMS”: Scalability, no data exporting/importing Different tokens choices: Words: Captures word swaps, deletion of common words Q-grams: All the above, plus spelling mistakes, but slower SQL statements tested in MS SQL Server and available for download at: 4/5/2019 Columbia University

8 Questions? 4/5/2019 Columbia University

9 Overflow Slides 4/5/2019 Columbia University

10 Recall for 3-grams Upcoming WWW 2003 paper 4/5/2019
Columbia University

11 Precision for 3-grams Upcoming WWW 2003 paper 4/5/2019
Columbia University


Download ppt "Text Joins for Data Cleansing and Integration in an RDBMS"

Similar presentations


Ads by Google