Text Joins in an RDBMS for Web Data Integration

Text Joins in an RDBMS for Web Data Integration
Luis Gravano Panagiotis G. Ipeirotis Nick Koudas Divesh Srivastava Columbia University AT&T Labs - Research

Same entity has multiple textual representations.
Why Text Joins? Web Service B HATRONIC CORP EUROAFT INC EUROAFT CORP … Web Service A EUROAFT CORP HATRONIC INC … Problem: Same entity has multiple textual representations. 12/1/2018 Columbia University

Matching Text Attributes
Need for a similarity metric! Many desirable properties: Match entries with typing mistakes Microsoft Windpws XP vs. Microsoft Windows XP Match entries with abbreviated information Zurich International Airport vs. Zurich Intl. Airport Match entries with different formatting conventions Dept. of Computer Science vs. Computer Science Dept. …and combinations thereof 12/1/2018 Columbia University

Matching Text Attributes using Edit Distance
Edit Distance: Character insertions, deletions, and modifications to transform one string to the other EUROAFT CORP - EURODRAFT CORP → 2 COMPUTER SCI. COMPUTER → 3 KIA INTERNATIONAL KIA → 13 Good for: spelling errors, short word insertions and deletions Problems: word order variations, long word insertions and deletions “Approximate String Joins” – VLDB 2001 12/1/2018 Columbia University

Matching Text Attributes using Cosine Similarity
Similar entries should share “infrequent” tokens Infrequent token (high weight) EUROAFT CORP ≈ EUROAFT INC ≠ HATRONIC CORP Common token (low weight) Similarity = Σ weight(token, t1) * weight(token, t2) token Different token choices result in similarity metrics with different properties 12/1/2018 Columbia University

Using Words and Cosine Similarity
Using words as tokens: Infrequent token (high weight) Split each entry into words Similar entries share infrequent words EUROAFT CORP ≈ EUROAFT INC ≠ HATRONIC CORP Common token (low weight) Good for word order variations and common word insert./del. Computer Science Dept. ~ Dept. of Computer Science Problems with misspellings Biotechnology Department ≠ Bioteknology Dept. “WHIRL” – W.Cohen, SIGMOD’98 12/1/2018 Columbia University

Using q-grams and Cosine Similarity
Using q-grams as tokens: Split each string into small substrings of length q (q-grams) Similar entries share many, infrequent q-grams Biotechnology Department Bio, iot, ote, tec, ech, chn, hno, nol, olo, log, ogy, …, tme, men, ent Bioteknology Department Bio, iot, ote, tek,ekn, kno, nol, olo, log, ogy, ... , tme, men, ent Handles naturally misspellings, word order variations, and insertions and deletions of common or short words 12/1/2018 Columbia University

Problem Problem that we address: For two entries t1, t2
0 ≤ Similarity ≤ 1 Similarity = Σ weight(token, t1) * weight(token, t2) token Problem that we address: Given two relations, report all pairs with cosine similarity above threshold φ 12/1/2018 Columbia University

Computing Text Joins in an RDBMS
Create in SQL relations RiWeights (token weights from Ri) 2 1 … INC HATRONIC CORP EUROAFT Token 0.01 0.98 W 0.02 R1Weights 0.03 3 0.05 0.97 0.95 R2Weights Computes similarity for many useless pairs Expensive operation! SELECT r1w.tid AS tid1, r2w.tid AS tid2 FROM R1Weights r1w, R2Weights r2w WHERE r1w.token = r2w.token GROUP BY r1w.tid, r2w.tid HAVING SUM(r1w.weight*r2w.weight) ≥ φ Compute similarity of each tuple pair R1 R2 Name 1 EUROAFT CORP 2 HATRONIC INC … Name 1 HATRONIC CORP 2 EUROAFT INC 3 EUROAFT CORP … R1 R2 Similarity EUROAFT CORP EUROAFT INC 0.98 1.00 HATRONIC CORP 0.01 HATRONIC INC 0.02 12/1/2018 Columbia University

Sampling Step for Text Joins
Similarity = Σ weight(token, t1) * weight(token, t2) Similarity is a sum of products Products cannot be high when weight is small Can (safely) drop low weights from RiWeights (adapted from [Cohen & Lewis, SODA97] for efficient execution inside an RDBMS) RiWeights Token W EUROAFT 0.9144 HATRONIC 0.8419 … CORP INC RiSample → Sampling 20 times Token #TIMES SAMPLED EUROAFT 18 (18/20=0.90) HATRONIC 17 (17/20=0.85) Eliminates low similarity pairs (e.g., “EUROAFT INC” with “HATRONIC INC”) 12/1/2018 Columbia University

Sampling-Based Text Joins in SQL
R1Weights R2Sample R1 Token W 1 EUROAFT 0.98 CORP 0.02 2 HATRONIC INC 0.01 … Token W 1 HATRONIC 0.98 CORP 0.02 2 EUROAFT 0.95 INC 0.05 3 0.97 0.03 Name 1 EUROAFT CORP 2 HATRONIC INC … Fully implemented in pure SQL! SELECT r1w.tid AS tid1, r2s.tid AS tid2 FROM R1Weights r1w, R2Sample r2s, R2sum r2sum WHERE r1w.token = r2s.token AND r1w.token = r2sum.token GROUP BY r1w.tid, r2s.tid HAVING SUM(r1w.weight*r2sum.total*r2s.c) ≥ S*φ R1 R2 Similarity EUROAFT CORP EUROAFT INC 0.98 0.9 HATRONIC INC HATRONIC CORP 12/1/2018 Columbia University

Experimental Setup 40,000 entries from AT&T customer database, split into R1 (26,000 entries) and R2 (14,000 entries) Tokenizations: Words Q-grams, q=2 & q=3 Methods compared: Variations of sample-based joins Baseline in SQL WHIRL [SIGMOD98], adapted for handling q-grams 12/1/2018 Columbia University

Metrics Execute the (approximate) join for similarity > φ
Precision: (measures accuracy) Fraction of the pairs in the answer with real similarity > φ Recall: (measures completeness) Fraction of the pairs with real similarity > φ that are also in the answer Execution time 12/1/2018 Columbia University

Comparing WHIRL and Sample-based Joins
Sample-based Joins: Good recall across similarity thresholds WHIRL: Very low recall (almost 0 recall for thresholds below 0.7) 12/1/2018 Columbia University

Changing Sample Size Increased sample size → Better recall, precision
Drawback: Increased execution time 12/1/2018 Columbia University

Execution Time WHIRL and Sample-based text joins ‘break-even’ at S~ 64, 128 12/1/2018 Columbia University

SQL statements tested in MS SQL Server and available for download at:
Contributions “WHIRL [Cohen, SIGMOD98] inside an RDBMS”: Scalability, no data exporting/importing Different tokens choices: Words: Captures word swaps, deletion of common words Q-grams: All the above, plus spelling mistakes, but slower SQL statements tested in MS SQL Server and available for download at: 12/1/2018 Columbia University

Questions? 12/1/2018 Columbia University

Text Joins in an RDBMS for Web Data Integration

Similar presentations

Presentation on theme: "Text Joins in an RDBMS for Web Data Integration"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Text Joins in an RDBMS for Web Data Integration

Similar presentations

Presentation on theme: "Text Joins in an RDBMS for Web Data Integration"— Presentation transcript:

Similar presentations

About project

Feedback