Text Joins for Data Cleansing and Integration in an RDBMS

Slides:



Advertisements
Similar presentations
Text Joins for Data Cleansing and Integration in an RDBMS Luis Gravano Panagiotis G. Ipeirotis Nick Koudas.
Advertisements

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
Chapter 5: Introduction to Information Retrieval
Murach’s SQL Server 2008, C6© 2008, Mike Murach & Associates, Inc.Slide 1.
SQL for SQL Server, C6© 2002, Mike Murach & Associates, Inc.Slide 1.
Murach’s SQL Server 2008, C1© 2008, Mike Murach & Associates, Inc.Slide 1.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Murach’s SQL Server 2008, C4© 2008, Mike Murach & Associates, Inc.Slide 1.
Automatic Classification of Text Databases Through Query Probing Panagiotis G. Ipeirotis Luis Gravano Columbia University Mehran Sahami E.piphany Inc.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
SQL: Multiple Tables CIT 381. Sample Schema General Form.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
1 Modeling Query-Based Access to Text Databases Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay.
Wai Kit Wong 1, Ben Kao 2, David W. Cheung 2, Rongbin Li 2, Siu Ming Yiu 2 1 Hang Seng Management College, Hong Kong 2 University of Hong Kong.
Introduction: The Megatron-3000 Database-Management System (Slides by Hector Garcia-Molina,
Wai Kit Wong, Ben Kao, David W. Cheung, Rongbin Li, Siu Ming Yiu.
Module 11: Programming Across Multiple Servers. Overview Introducing Distributed Queries Setting Up a Linked Server Environment Working with Linked Servers.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Structured Query Language Chris Nelson CS 157B Spring 2008.
14-1 Goals of Database Design A database should provide for efficient storage, update, and retrieval of data. A database should be reliable—the stored.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Byung-Won On (Penn State Univ.) Nick Koudas (Univ. of Toronto) Dongwon Lee (Penn State Univ.) Divesh Srivastava (AT&T Labs – Research) Group Linkage ICDE.
Database Design And Implementation. Done so far… Started a design of your own data model In Software Engineering, recognised the processes that occur.
Efficient Evaluation of Queries in a Mediator for WebSources Louiqa Raschid University of Maryland Joint work with Zadorozhny, Vidal, Urhan, Bright.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
MIX: A Meta-Data Indexing System for XML SungRan Cho, L3S Nick Koudas, University of Toronto Divesh Srivastava, AT&T Labs-Research.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
MIDTERM REVIEW IST 210 Organization of Data IST210 1.
IT 5433 LM3 Relational Data Model. Learning Objectives: List the 5 properties of relations List the properties of a candidate key, primary key and foreign.
Distance functions and IE - 3 William W. Cohen CALD.
Relational Algebra & Calculus
© 2008, Mike Murach & Associates, Inc.
RankSQL: Query Algebra and Optimization for Relational Top-k Queries
Lab 13 Databases and SQL.
True/False questions (3pts*2)
Database Management System
Soft Joins with TFIDF: Why and What
Web Data Integration Using Approximate String Join
© 2002, Mike Murach & Associates, Inc.
Relational Algebra Chapter 4, Part A
© 2002, Mike Murach & Associates, Inc.
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
© 2002, Mike Murach & Associates, Inc.
© 2002, Mike Murach & Associates, Inc.
© 2002, Mike Murach & Associates, Inc.
© 2002, Mike Murach & Associates, Inc.
Approximate String Joins in a Database (Almost) for Free
© 2002, Mike Murach & Associates, Inc.
Text Joins in an RDBMS for Web Data Integration
Weighted Exact Set Similarity Join
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Structure and Content Scoring for XML
© 2002, Mike Murach & Associates, Inc.
© 2002, Mike Murach & Associates, Inc.
Panos Ipeirotis Luis Gravano
Efficient Record Linkage in Large Data Sets
Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.
Panagiotis G. Ipeirotis Luis Gravano
Structure and Content Scoring for XML
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
© 2002, Mike Murach & Associates, Inc.
© 2007, Mike Murach & Associates, Inc.
Database Systems: Design, Implementation, and Management
Relax and Adapt: Computing Top-k Matches to XPath Queries
© 2008, Mike Murach & Associates, Inc.
Presentation transcript:

Text Joins for Data Cleansing and Integration in an RDBMS Luis Gravano Panagiotis G. Ipeirotis Nick Koudas Divesh Srivastava Columbia University AT&T Labs - Research

Problem: Same entity has multiple textual representations Why Text Joins? Service B HATRONIC CORP EUROAFT INC EUROAFT CORP … Service A EUROAFT CORP HATRONIC INC … Problem: Same entity has multiple textual representations 4/5/2019 Columbia University

Matching Text Attributes using Cosine Similarity Similar tuples should share “infrequent” tokens Infrequent token (high weight) EUROAFT CORP ≈ EUROAFT INC ≠ HATRONIC CORP Common token (low weight) Similarity = Σ weight(token, t1) * weight(token, t2) token Problem: Given two relations, report tuple pairs with similarity above threshold φ 4/5/2019 Columbia University

Computing Text Joins in an RDBMS Create in SQL relations RiWeights (token weights from Ri) 2 1 … INC HATRONIC CORP EUROAFT Token 0.01 0.98 W 0.02 R1Weights 0.03 3 0.05 0.97 0.95 R2Weights Computes similarity for many useless pairs Expensive operation! SELECT r1w.tid AS tid1, r2w.tid AS tid2 FROM R1Weights r1w, R2Weights r2w WHERE r1w.token = r2w.token GROUP BY r1w.tid, r2w.tid HAVING SUM(r1w.weight*r2w.weight) ≥ φ Compute similarity of each tuple pair R1 R2 Name 1 EUROAFT CORP 2 HATRONIC INC … Name 1 HATRONIC CORP 2 EUROAFT INC 3 EUROAFT CORP … R1 R2 Similarity EUROAFT CORP EUROAFT INC 0.98 1.00 HATRONIC CORP 0.01 HATRONIC INC 0.02 4/5/2019 Columbia University

Sampling Step for Text Joins Similarity = Σ weight(token, t1) * weight(token, t2) Similarity is a sum of products Products cannot be high when weight is small Can (safely) drop low weights from RiWeights (adapted from [Cohen & Lewis, SODA97] for efficient execution inside an RDBMS) RiWeights Token W EUROAFT 0.9144 HATRONIC 0.8419 … CORP 0.01247 INC 0.00504 RiSample → Sampling 20 times Token #TIMES SAMPLED EUROAFT 18 (18/20=0.90) HATRONIC 17 (17/20=0.85) Eliminates low similarity pairs (e.g., “EUROAFT INC” with “HATRONIC INC”) 4/5/2019 Columbia University

Sampling-Based Text Joins in SQL R1Weights R2Sample R1 Token W 1 EUROAFT 0.98 CORP 0.02 2 HATRONIC INC 0.01 … Token W 1 HATRONIC 0.98 CORP 0.02 2 EUROAFT 0.95 INC 0.05 3 0.97 0.03 Name 1 EUROAFT CORP 2 HATRONIC INC … Fully implemented in pure SQL! SELECT r1w.tid AS tid1, r2s.tid AS tid2 FROM R1Weights r1w, R2Sample r2s, R2sum r2sum WHERE r1w.token = r2s.token AND r1w.token = r2sum.token GROUP BY r1w.tid, r2s.tid HAVING SUM(r1w.weight*r2sum.total*r2s.c) ≥ S*φ R1 R2 Similarity EUROAFT CORP EUROAFT INC 0.98 0.9 HATRONIC INC HATRONIC CORP 4/5/2019 Columbia University

SQL statements tested in MS SQL Server and available for download at: Contributions “WHIRL [Cohen, SIGMOD98] inside an RDBMS”: Scalability, no data exporting/importing Different tokens choices: Words: Captures word swaps, deletion of common words Q-grams: All the above, plus spelling mistakes, but slower SQL statements tested in MS SQL Server and available for download at: http://www.cs.columbia.edu/~pirot/DataCleaning/ 4/5/2019 Columbia University

Questions? 4/5/2019 Columbia University

Overflow Slides 4/5/2019 Columbia University

Recall for 3-grams Upcoming WWW 2003 paper 4/5/2019 Columbia University

Precision for 3-grams Upcoming WWW 2003 paper 4/5/2019 Columbia University