A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar.

Slides:



Advertisements
Similar presentations
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Advertisements

Efficient Query Evaluation on Probabilistic Databases
Partial Fractions MATH Precalculus S. Rook.
Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik Presented by Bryan Wilhelm.
Lesson 8 Gauss Jordan Elimination
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Constraint Logic Programming Ryan Kinworthy. Overview Introduction Logic Programming LP as a constraint programming language Constraint Logic Programming.
Aki Hecht Seminar in Databases (236826) January 2009
Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.
Slides prepared by Rose Williams, Binghamton University Chapter 3 Flow of Control if-else and switch statements.
Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al.
Mark Graves Leveraging Existing DBMS Storage for XML DBMS.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Introduction to Software Testing Chapter 3.1 Logic Coverage Paul Ammann & Jeff Offutt.
Programming Logic and Design, Introductory, Fourth Edition1 Understanding Computer Components and Operations (continued) A program must be free of syntax.
Introduction to C Programming
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Fundamentals of Python: From First Programs Through Data Structures
Fundamentals of Python: First Programs
TH EDITION LIAL HORNSBY SCHNEIDER COLLEGE ALGEBRA.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
1 Relational Algebra and Calculus Chapter 4. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Querying Structured Text in an XML Database By Xuemei Luo.
CSC-682 Cryptography & Computer Security Sound and Precise Analysis of Web Applications for Injection Vulnerabilities Pompi Rotaru Based on an article.
Selection Control Structures Simple Program Design Third Edition A Step-by-Step Approach 4.
Lesley Charles November 23, 2009.
The Relational Database Model
Chapter 7 Relational Algebra. Topics in this Chapter Closure Revisited The Original Algebra: Syntax and Semantics What is the Algebra For? Further Points.
© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. Chapter 2 Chapter 2 - Introduction to C Programming.
ISBN Chapter 3 Describing Syntax and Semantics.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra.
DataBase Management System What is DBMS Purpose of DBMS Data Abstraction Data Definition Language Data Manipulation Language Data Models Data Keys Relationships.
1 Relational Algebra and Calculas Chapter 4, Part A.
CSE 425: Syntax II Context Free Grammars and BNF In context free grammars (CFGs), structures are independent of the other structures surrounding them Backus-Naur.
Data TypestMyn1 Data Types The type of a variable is not set by the programmer; rather, it is decided at runtime by PHP depending on the context in which.
Propositional Calculus CS 270: Mathematical Foundations of Computer Science Jeremy Johnson.
Chapter 3 Part II Describing Syntax and Semantics.
Sections © Copyright by Pearson Education, Inc. All Rights Reserved.
Information Integration Entity Resolution – 21.7 Presented By: Deepti Bhardwaj Roll No: 223_103.
1 Knowledge Based Systems (CM0377) Lecture 6 (last modified 20th February 2002)
Introduction to Software Testing Chapter 3.1 Logic Coverage Paul Ammann & Jeff Offutt.
Semantics of Predicate Calculus For the propositional calculus, an interpretation was simply an assignment of truth values to the proposition letters of.
TFA: A Tunable Finite Automaton for Regular Expression Matching Author: Yang Xu, Junchen Jiang, Rihua Wei, Yang Song and H. Jonathan Chao Publisher: ACM/IEEE.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
A FIRST BOOK OF C++ CHAPTER 14 THE STRING CLASS AND EXCEPTION HANDLING.
LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.
Introduction to Parsing (adapted from CS 164 at Berkeley)
Control Structures Combine individual statements into a single logical unit with one entry point and one exit point. Used to regulate the flow of execution.
Relational Algebra Chapter 4 1.
Data Mining K-means Algorithm
Representation, Syntax, Paradigms, Types
Lecture 2 The Relational Model
Relational Algebra Chapter 4, Part A
By Don Henderson PhilaSUG, June 18, 2018
Relational Algebra 1.
Lecture 12: Data Wrangling
Relational Algebra Chapter 4 1.
Parsing and More Parsing
R.Rajkumar Asst.Professor CSE
Representation, Syntax, Paradigms, Types
Representation, Syntax, Paradigms, Types
INFORMATION INTEGRATION
Probabilistic Databases
Chapter 8 Advanced SQL.
Representation, Syntax, Paradigms, Types
BNF 9-Apr-19.
COMPILER CONSTRUCTION
Boolean Expressions September 1, 2019 ICS102: The course.
Presentation transcript:

A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar

Problems  Poor data quality is due to lack to unique representations for real world entities  Eg: California can be represented as California, Calif, CA, etc  Although textually different, these 5 records correspond to just 2 authors

Problem Definition  Main problem in data cleaning is to determine whether or not two representations are duplicate i.e. correspond to same real world entity.  Cosine similarity and Edit distance use textual similarity. But it can be misleading.  Two representations of same entity can be highly dissimilar  Conversely, two representations that are textually very similar can correspond to different entities

Solution: Programmable Framework

Basic Definitions  The Program is a collection of triples of the form where R is the grammar rule, P is predicate and A is action  The grammar rule has a head and body. Head is single non terminal and body is sequence of non terminals, terminals and variables  Terminals are words and punctuation  Non terminals are represented by angular brackets terminals using single quoted strings (eg:’Jeff’) and variables using uppercase letters

Example: Framework program

Expanded program G’ for program G  Expanded program G’, like G is a collection of augmented rules  To construct G’, we consider each augmented rule R= and enumerate all possible assignments of constant values to variables in R so that predicate P evaluates to true i.e.

Parse Tree:  Handles variations in the order in which the first name and last name appear  Program handles variations resulting from the use of nick name

Weights:  Non negative real numbers are assigned to each augmented rule in G’  The weight of an output record is the sum of weights of augmented rules involved in the parsing of output record  Lower weights indicate high confidence  Programmer can use “loose” rules, rules that the programmer is not very confident about.  Higher weights assigned to “loose” rules  If R’ is augmented rule in expanded program G’, the weight of R’ is the log of number of rules in G’

Implementation  Given a program G, we can construct expanded program G’. Given an input record r, we can use traditional parsing technique to parse r  But the main problem with this approach is that the scale of the expanded program G’ can be very large  Instead, construct Gr’, a partially expanded program at query time.  To construct Gr’, consider R= and enumerates all possible assignment of constants to variables in R such that P evaluates to true  Enforce an additional constraint, if variable X occurs in R, then the constant c assigned to variable X should be a substring of the record r. Dictionary (X): P(X,.…)  Eg: Smith Andy, J: Dictionary (N): Nicknames (I,N,F,G)

Case studies 1. UCD people data

Quality of record matching and Record matching

2. Author Affiliation Dataset

Program:

Discussion Record matching:  Previous works on record matching focused on similarity design function  This framework indicates that, with right pre processing the need for approximate equality when performing record matching is minimized and often eliminated  How ever string similarity joins are needed to capture variations such as typos and misspellings  This framework does not intend to replace this body of work

Pay as you go:  The goal of this framework is not to clean the entire dataset, because doing so is difficult  This framework rather approaches “pay as we go” where they use example reference tables that cover only part of data to clean a subset of data Lineage:  Parse trees constitute a natural notion of lineage that can be used to program on top of the module  For eg. Data cleaning developer using this framework can choose not to use rule weighting options and use if- then- else logic to capture parse tree preferences

Uncertainty:  Framework provides a tool to manage uncertainty in the data  Framework incorporates “possible worlds”. Thus it allows multiple possible variations of same entity.  Framework also returns multiple parse trees for same input record with accompanying score.

Questions???

Thank you!