A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar.

2 Problems  Poor data quality is due to lack to unique representations for real world entities  Eg: California can be represented as California, Calif, CA, etc  Although textually different, these 5 records correspond to just 2 authors

3 Problem Definition  Main problem in data cleaning is to determine whether or not two representations are duplicate i.e. correspond to same real world entity.  Cosine similarity and Edit distance use textual similarity. But it can be misleading.  Two representations of same entity can be highly dissimilar  Conversely, two representations that are textually very similar can correspond to different entities

4 Solution: Programmable Framework

5 Basic Definitions  The Program is a collection of triples of the form where R is the grammar rule, P is predicate and A is action  The grammar rule has a head and body. Head is single non terminal and body is sequence of non terminals, terminals and variables  Terminals are words and punctuation  Non terminals are represented by angular brackets terminals using single quoted strings (eg:’Jeff’) and variables using uppercase letters

6 Example: Framework program

7 Expanded program G’ for program G  Expanded program G’, like G is a collection of augmented rules  To construct G’, we consider each augmented rule R= and enumerate all possible assignments of constant values to variables in R so that predicate P evaluates to true i.e.

8 Parse Tree:  Handles variations in the order in which the first name and last name appear  Program handles variations resulting from the use of nick name

9 Weights:  Non negative real numbers are assigned to each augmented rule in G’  The weight of an output record is the sum of weights of augmented rules involved in the parsing of output record  Lower weights indicate high confidence  Programmer can use “loose” rules, rules that the programmer is not very confident about.  Higher weights assigned to “loose” rules  If R’ is augmented rule in expanded program G’, the weight of R’ is the log of number of rules in G’

10 Implementation  Given a program G, we can construct expanded program G’. Given an input record r, we can use traditional parsing technique to parse r  But the main problem with this approach is that the scale of the expanded program G’ can be very large  Instead, construct Gr’, a partially expanded program at query time.  To construct Gr’, consider R= and enumerates all possible assignment of constants to variables in R such that P evaluates to true  Enforce an additional constraint, if variable X occurs in R, then the constant c assigned to variable X should be a substring of the record r. Dictionary (X): P(X,.…)  Eg: Smith Andy, J: Dictionary (N): Nicknames (I,N,F,G)

11 Case studies 1. UCD people data

12 Quality of record matching and Record matching

13 2. Author Affiliation Dataset

14 Program:


16 Discussion Record matching:  Previous works on record matching focused on similarity design function  This framework indicates that, with right pre processing the need for approximate equality when performing record matching is minimized and often eliminated  How ever string similarity joins are needed to capture variations such as typos and misspellings  This framework does not intend to replace this body of work

17 Pay as you go:  The goal of this framework is not to clean the entire dataset, because doing so is difficult  This framework rather approaches “pay as we go” where they use example reference tables that cover only part of data to clean a subset of data Lineage:  Parse trees constitute a natural notion of lineage that can be used to program on top of the module  For eg. Data cleaning developer using this framework can choose not to use rule weighting options and use if- then- else logic to capture parse tree preferences

18 Uncertainty:  Framework provides a tool to manage uncertainty in the data  Framework incorporates “possible worlds”. Thus it allows multiple possible variations of same entity.  Framework also returns multiple parse trees for same input record with accompanying score.

