Date: 2011/12/26 Source: Dustin Lange et. al (CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Frequency-aware Similarity Measures 1
Outline Outline Introduction Composing similarity Exploiting frequencies Partitioning strategies Experiment Conclusion 2
Introduction Propose a novel comparison method that partitions the data using value frequency information and then automatically determines similarity measures for each individual partition. Use by partitioning compared record pairs according to frequencies of attribute values. Partition 1 contains all pairs with rare names. Partition 2 all pairs with medium frequent names. Partition 3 all pairs with frequent names. 3
Introduction Motivation: Schufa, a credit rating agency that stores data of about 66 million citizens, which are in turn reported by banks, insurance agencies, etc. queries about the rating of an individual must be responded to as precisely as possible. To ensure the quality of the data, it is necessary to detect and fuse duplicates. 4
Introduction Why Arnold Schwarzenegger is Always a Duplicate ? Why Arnold Schwarzenegger is Always a Duplicate ? In a person table with U.S. citizens, this name is a very rare name. If we find several Arnold Schwarzeneggers in it, it is very likely that these are duplicates. they argue that address and date-of-birth similarity are less important than for rows with frequent names. person's name, birth date, address 5
Introduction Determining the similarity (or distance) of two records in a database is a well-known, but challenging problem. The problem comprises two main difficulties: 1. typos outdated values sloppy data or query entries. 2. The amount of data might be very large, thus prohibiting exhaustive comparisons. devising sophisticated similarity measures Efficient algorithms and indexes that avoid comparing each entry with all other entries. Efficient algorithms and indexes that avoid comparing each entry with all other entries. 6
Composing Similarity Base Similarity Measures Define: Sim p (r1,r2) Sim p : (R x R) → [0,1] ⊂ R each responsible for calculating the similarity of a specific attribute p of the compared records r1 and r2 from a set R of records. Ex: SimName : Jaro-Winkler distance SimBirthDate : relative distance SimAddress : Euclidean distance Also test for equality (e.g., for addresses) or boolean values(e.g., for gender). 7
Jaro-Winkler distance Jaro–Winkler distance d w : 8 m: the number of matching characters. t: half the number of transpositions. d j :the Jaro distance for strings s1 and s2 : the length of common prefix at the start of the string up to a maximum of 4 characters p : a constant scaling factor p should not exceed 0.25, otherwise the distance can become larger than 1. The standard value for this constant in Winkler's work is p = 0.1
9
Composing Similarity Composition of Base Similarity Measures Integrate the base similarity measures into an overall judgement to calculate the overall similarity of two records. the classes are isSimilar and isDissimilar The features are the results of the base similarity measures. To derive a general model: employ machine learning techniques and have enough training data for supervised learning methods. 10 logistic regression, decision trees, SVM
logistic regression SVM(support vector machine) Decision Tree 11
Frequency Function Determine the value frequencies of the selected attributes for two compared records. Define a frequency function f : R x R → N ( FirstName & LastName) Goal : partition the data according to the name frequencies. Several data quality problems: 1.swapping of first and last name 2. typos (e. g., Arnold, Arnnold) 3. combining two attributes (e. g., Schwarzenegger is more distinguishing than Arnold) 12 Exploiting frequencies FirstNameLastName ArnoldSchwarzenegger Arnold
FirstName frequency Josh : 3 Kevin: 1 Jack: 5... … LastName frequency powell : 2 johnson : 0 wills: 5 powell : 1 johnson : 1 wills: 1 powell : 4 johnson : 3 wills: 0 13 LastName frequency Powell: 1 Johnson: 0 Wills: 5... … FirstName frequency Josh : 2 Kevin : 2 Jack: 2 Josh : 4 Kevin : 6 Jack: 5
Exploiting frequencies Frequency-enriched Models exploit frequency distributions is to alter the models that we learned with the machine learning techniques 1. manually add rules to the models 2. integrate the frequencies directly into the machine learning models. 14 Ex: logistic regression, "if the frequency of the name value is below10, then increase the weight of the name similarity by 10% and appropriately decrease the weights of the other similarity functions". Drawback : Manually defining such rules is cumbersome and error-prone where M is the maximum frequency in the data set.
Partitioning strategies partition compared record pairs into n partitions using the determined frequencies. Number of partition: Too large in small partitions: Overfitting 0 10 Too small in large partitions: discovering frequency-specific differences
Partitioning strategies Define partitions: The entire frequency space is divided into non-overlapping, continuous partitions by a set of thresholds: Ɵ 0 = 0 and Ɵ n = M + 1, where M is the maximum frequency in the data set. Defined as frequency ranges I i : A partition covers a set of record pairs. A record pair(r1,r2) falls into a partition [ Ɵ i, Ɵ i+1 ) iff the frequency function value for this pair lies in the partition's range: 16
Partitioning strategies Random partitioning: randomly pick several thresholds Ɵ i ∈ {0,…….,M + 1} The number of thresholds in each partitioning is also randomly chosen. maximum of 20 partitions in one partitioning. Equi-depth partitioning: divide the frequency space into e partitions. Each partition contains the same number of tuples from the original data set R. e ∈ {2,…….,20} 1partition partition e:9
Partitioning strategies Greedy partitioning: define a list of threshold candidates C = { Ɵ 0,……, Ɵ n } by dividing the frequency space into segments with the same number of tuples (similar to equi-depth partitioning, but with fixed, large e = 50). Process: 1.learning a partition for the first candidate thresholds [ Ɵ 0, Ɵ 1 ). 2.learn a second partition that extends the current partition by moving its upper threshold to the next threshold candidate: [ Ɵ 0, Ɵ 2 ). 3. …………………… [ Ɵ 0, Ɵ 3 ). …… ∆ compare both partitions using F-measure. 18
Partitioning strategies Greedy partitioning: (continue) If the extended partition achieves better performance, the process is repeated for the next threshold slot. If not, the smaller partition is kept and a new partitioning is started at its upper threshold; another iteration starts with this new partition. This process is repeated until all threshold candidates have been processed. 19
[ Ɵ i, Ɵ j ) TotalSD 0≤Frequency< ≤Frequency< ≤Frequency< ≤Frequency<4 15 4≤Frequency< similardissimilar similar51 dissimilar31 similardissimilar similar101 dissimilar36 similardissimilar similar104 dissimilar56 5≠ = =20 actual predict
Partitioning strategies Genetic Partitioning Algorithm 1. Initialization: Create an initial population consisting of several random partitionings. These partitionings are created as described above with the random partitioning approach. 2. Growth: Learn one composite similarity function for each partition in the current set of partitionings. 3. Selection: For each partition, determine the maximum F-measure that can be achieved by choosing an appropriate threshold for the similarity function. Select the partitionings with highest weighted F- measure, then select the top five partitionings. 21
Partitioning strategies 4. Reproduction: build pairs of the selected best individuals and combine them to create new individuals. a) Recombination: First create the union of the thresholds of both partitionings. For each threshold, randomly decide whether to keep it in the result partition or not. Both decisions have equal chances. b) Mutation: Randomly decide whether to add another new (also ran-domly picked) threshold and whether to delete a (randomly picked) threshold from the current threshold list. Define a minimum partition size (set this value to 20 record pairs ). Randomly created partitionings with too small partitions are discarded. 22
23 [ Ɵ 0, Ɵ 1 )[ Ɵ 1, Ɵ 2 )[ Ɵ 2, Ɵ 3 ) [ 0, 1 )[ 1, 2 )[ 2, 3 ) [ 0, 2 )[ 1, 3 )[ 2, 4 ) [ 0, 3 )[ 1, 4 )…… [ 0, 4 )……[ 3, 4 ) [ 0, 5 )[ 2, 3 )[ 3, 5 ) [ 0, 6 )[ 2, 4 )…… …..……[ 4, 5 ) Ɵ0Ɵ0 Ɵ1Ɵ1 Ɵ2Ɵ2 Ɵ3Ɵ3 → [ 0, 1 ), [ 1, 3 ), [ 3, 4 ) → [ 0, 2 ), [ 2, 4 ), [ 4, 5 ) Top 5
Partitioning strategies 5. Termination: The resulting partitions are evaluated and added to the set of evaluated partitions. The selection/reproduction phases are repeated until a certain number of iterations is reached or until no significant improvement can be measured. Require a minimum F-measure improvement of after 5 iterations. 24
Experiment 25 data set consists of two parts: a person data set and a query data set. built record pairs of the form (query, correct result) or (query, incorrect result), Evaluation on Schufa Data Set
Experiment 26 Evaluation on DBLP Data Set Evaluation on DBLP Data Set(bibliographic database for computer sciences) (1) Two papers from the same author, (2) Two papers from the same author with different name aliases (3) Two papers from different authors with the same name, (4) Two papers from different authors with different names. For each paper pair, the matching task is to decide whether the two papers were written by the same author.
Conclusion With this paper, introduced a novel approach for im- proving composite similarity measures. Divide a data set consisting of record pairs into partitions according to frequencies of selected attributes. Learn optimal similarity measures for each partition. Experiments on different real-world data sets showed that partitioning the data can improve learning results and that genetic partitioning performs better than several other partitioning strategies. 27
Thank you for your listening ! 28