Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Joins in an RDBMS for Web Data Integration

Similar presentations


Presentation on theme: "Text Joins in an RDBMS for Web Data Integration"— Presentation transcript:

1 Text Joins in an RDBMS for Web Data Integration
Luis Gravano Panagiotis G. Ipeirotis Nick Koudas Divesh Srivastava Columbia University AT&T Labs - Research

2 Same entity has multiple textual representations.
Why Text Joins? Web Service B HATRONIC CORP EUROAFT INC EUROAFT CORP Web Service A EUROAFT CORP HATRONIC INC Problem: Same entity has multiple textual representations. 12/1/2018 Columbia University

3 Matching Text Attributes
Need for a similarity metric! Many desirable properties: Match entries with typing mistakes Microsoft Windpws XP vs. Microsoft Windows XP Match entries with abbreviated information Zurich International Airport vs. Zurich Intl. Airport Match entries with different formatting conventions Dept. of Computer Science vs. Computer Science Dept. …and combinations thereof 12/1/2018 Columbia University

4 Matching Text Attributes using Edit Distance
Edit Distance: Character insertions, deletions, and modifications to transform one string to the other EUROAFT CORP - EURODRAFT CORP → 2 COMPUTER SCI. COMPUTER → 3 KIA INTERNATIONAL KIA → 13 Good for: spelling errors, short word insertions and deletions Problems: word order variations, long word insertions and deletions “Approximate String Joins” – VLDB 2001 12/1/2018 Columbia University

5 Matching Text Attributes using Cosine Similarity
Similar entries should share “infrequent” tokens Infrequent token (high weight) EUROAFT CORP EUROAFT INC HATRONIC CORP Common token (low weight) Similarity = Σ weight(token, t1) * weight(token, t2) token Different token choices result in similarity metrics with different properties 12/1/2018 Columbia University

6 Using Words and Cosine Similarity
Using words as tokens: Infrequent token (high weight) Split each entry into words Similar entries share infrequent words EUROAFT CORP EUROAFT INC HATRONIC CORP Common token (low weight) Good for word order variations and common word insert./del. Computer Science Dept. ~ Dept. of Computer Science Problems with misspellings Biotechnology Department ≠ Bioteknology Dept. “WHIRL” – W.Cohen, SIGMOD’98 12/1/2018 Columbia University

7 Using q-grams and Cosine Similarity
Using q-grams as tokens: Split each string into small substrings of length q (q-grams) Similar entries share many, infrequent q-grams Biotechnology Department Bio, iot, ote, tec, ech, chn, hno, nol, olo, log, ogy, …, tme, men, ent Bioteknology Department Bio, iot, ote, tek,ekn, kno, nol, olo, log, ogy, ... , tme, men, ent Handles naturally misspellings, word order variations, and insertions and deletions of common or short words 12/1/2018 Columbia University

8 Problem Problem that we address: For two entries t1, t2
0 ≤ Similarity ≤ 1 Similarity = Σ weight(token, t1) * weight(token, t2) token Problem that we address: Given two relations, report all pairs with cosine similarity above threshold φ 12/1/2018 Columbia University

9 Computing Text Joins in an RDBMS
Create in SQL relations RiWeights (token weights from Ri) 2 1 INC HATRONIC CORP EUROAFT Token 0.01 0.98 W 0.02 R1Weights 0.03 3 0.05 0.97 0.95 R2Weights Computes similarity for many useless pairs Expensive operation! SELECT r1w.tid AS tid1, r2w.tid AS tid2 FROM R1Weights r1w, R2Weights r2w WHERE r1w.token = r2w.token GROUP BY r1w.tid, r2w.tid HAVING SUM(r1w.weight*r2w.weight) ≥ φ Compute similarity of each tuple pair R1 R2 Name 1 EUROAFT CORP 2 HATRONIC INC Name 1 HATRONIC CORP 2 EUROAFT INC 3 EUROAFT CORP R1 R2 Similarity EUROAFT CORP EUROAFT INC 0.98 1.00 HATRONIC CORP 0.01 HATRONIC INC 0.02 12/1/2018 Columbia University

10 Sampling Step for Text Joins
Similarity = Σ weight(token, t1) * weight(token, t2) Similarity is a sum of products Products cannot be high when weight is small Can (safely) drop low weights from RiWeights (adapted from [Cohen & Lewis, SODA97] for efficient execution inside an RDBMS) RiWeights Token W EUROAFT 0.9144 HATRONIC 0.8419 CORP INC RiSample Sampling 20 times Token #TIMES SAMPLED EUROAFT 18 (18/20=0.90) HATRONIC 17 (17/20=0.85) Eliminates low similarity pairs (e.g., “EUROAFT INC” with “HATRONIC INC”) 12/1/2018 Columbia University

11 Sampling-Based Text Joins in SQL
R1Weights R2Sample R1 Token W 1 EUROAFT 0.98 CORP 0.02 2 HATRONIC INC 0.01 Token W 1 HATRONIC 0.98 CORP 0.02 2 EUROAFT 0.95 INC 0.05 3 0.97 0.03 Name 1 EUROAFT CORP 2 HATRONIC INC Fully implemented in pure SQL! SELECT r1w.tid AS tid1, r2s.tid AS tid2 FROM R1Weights r1w, R2Sample r2s, R2sum r2sum WHERE r1w.token = r2s.token AND r1w.token = r2sum.token GROUP BY r1w.tid, r2s.tid HAVING SUM(r1w.weight*r2sum.total*r2s.c) ≥ S*φ R1 R2 Similarity EUROAFT CORP EUROAFT INC 0.98 0.9 HATRONIC INC HATRONIC CORP 12/1/2018 Columbia University

12 Experimental Setup 40,000 entries from AT&T customer database, split into R1 (26,000 entries) and R2 (14,000 entries) Tokenizations: Words Q-grams, q=2 & q=3 Methods compared: Variations of sample-based joins Baseline in SQL WHIRL [SIGMOD98], adapted for handling q-grams 12/1/2018 Columbia University

13 Metrics Execute the (approximate) join for similarity > φ
Precision: (measures accuracy) Fraction of the pairs in the answer with real similarity > φ Recall: (measures completeness) Fraction of the pairs with real similarity > φ that are also in the answer Execution time 12/1/2018 Columbia University

14 Comparing WHIRL and Sample-based Joins
Sample-based Joins: Good recall across similarity thresholds WHIRL: Very low recall (almost 0 recall for thresholds below 0.7) 12/1/2018 Columbia University

15 Changing Sample Size Increased sample size → Better recall, precision
Drawback: Increased execution time 12/1/2018 Columbia University

16 Execution Time WHIRL and Sample-based text joins ‘break-even’ at S~ 64, 128 12/1/2018 Columbia University

17 SQL statements tested in MS SQL Server and available for download at:
Contributions “WHIRL [Cohen, SIGMOD98] inside an RDBMS”: Scalability, no data exporting/importing Different tokens choices: Words: Captures word swaps, deletion of common words Q-grams: All the above, plus spelling mistakes, but slower SQL statements tested in MS SQL Server and available for download at: 12/1/2018 Columbia University

18 Questions? 12/1/2018 Columbia University


Download ppt "Text Joins in an RDBMS for Web Data Integration"

Similar presentations


Ads by Google