Date: 2011/12/26 Source: Dustin Lange et. al (CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Frequency-aware Similarity Measures 1.

Slides:



Advertisements
Similar presentations
ADBIS 2007 Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique Rayner Alfred Dimitar.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Random Forest Predrag Radenković 3237/10
Imbalanced data David Kauchak CS 451 – Fall 2013.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Fast Incremental Maintenance of Approximate histograms : Phillip B. Gibbons (Intel Research Pittsburgh) Yossi Matias (Tel Aviv University) Viswanath Poosala.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial.
Linear Obfuscation to Combat Symbolic Execution Zhi Wang 1, Jiang Ming 2, Chunfu Jia 1 and Debin Gao 3 1 Nankai University 2 Pennsylvania State University.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Evaluating Hypotheses
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Cost-Based Plan Selection Choosing an Order for Joins Chapter 16.5 and16.6 by:- Vikas Vittal Rao ID: 124/227 Chiu Luk ID: 210.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Genetic Algorithm.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
SOFT COMPUTING (Optimization Techniques using GA) Dr. N.Uma Maheswari Professor/CSE PSNA CET.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A data mining approach to the prediction of corporate failure.
Department of Electrical Engineering, Southern Taiwan University Robotic Interaction Learning Lab 1 The optimization of the application of fuzzy ant colony.
Date: 2012/3/5 Source: Marcus Fontouraet. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou 1 Efficiently encoding term co-occurrences in inverted.
Adding Semantics to Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
2005MEE Software Engineering Lecture 11 – Optimisation Techniques.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Designing multiple biometric systems: Measure of ensemble effectiveness Allen Tang NTUIM.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Date: 2015/11/19 Author: Reza Zafarani, Huan Liu Source: CIKM '15
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
GENETIC ALGORITHM Basic Algorithm begin set time t = 0;
Effective Anomaly Detection with Scarce Training Data Presenter: 葉倚任 Author: W. Robertson, F. Maggi, C. Kruegel and G. Vigna NDSS
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
Data Mining What is to be done before we get to Data Mining?
1 Link Privacy in Social Networks Aleksandra Korolova, Rajeev Motwani, Shubha U. Nabar CIKM’08 Advisor: Dr. Koh, JiaLing Speaker: Li, HueiJyun Date: 2009/3/30.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.
Efficient Signature Matching with Multiple Alphabet Compression Tables Publisher : SecureComm, 2008 Author : Shijin Kong,Randy Smith,and Cristian Estan.
Efficient Similarity Search : Arbitrary Similarity Measures, Arbitrary Composition Date: 2011/10/31 Source: Dustin Lange et. al (CIKM’11) Speaker:Chiang,guang-ting.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Intelligent Exploration for Genetic Algorithms Using Self-Organizing.
Experience Report: System Log Analysis for Anomaly Detection
Indexes By Adrienne Watt.
Indexing Structures for Files and Physical Database Design
Record Storage, File Organization, and Indexes
Decision Trees.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
iSRD Spam Review Detection with Imbalanced Data Distributions
Efficient Record Linkage in Large Data Sets
Implementation of Relational Operations
Presentation transcript:

Date: 2011/12/26 Source: Dustin Lange et. al (CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Frequency-aware Similarity Measures 1

Outline Outline  Introduction  Composing similarity  Exploiting frequencies  Partitioning strategies  Experiment  Conclusion 2

Introduction  Propose a novel comparison method that partitions the data using value frequency information and then automatically determines similarity measures for each individual partition.  Use by partitioning compared record pairs according to frequencies of attribute values. Partition 1  contains all pairs with rare names. Partition 2  all pairs with medium frequent names. Partition 3  all pairs with frequent names. 3

Introduction Motivation: Schufa, a credit rating agency that stores data of about 66 million citizens, which are in turn reported by banks, insurance agencies, etc. queries about the rating of an individual must be responded to as precisely as possible. To ensure the quality of the data, it is necessary to detect and fuse duplicates. 4

Introduction Why Arnold Schwarzenegger is Always a Duplicate ? Why Arnold Schwarzenegger is Always a Duplicate ? In a person table with U.S. citizens, this name is a very rare name. If we find several Arnold Schwarzeneggers in it, it is very likely that these are duplicates. they argue that address and date-of-birth similarity are less important than for rows with frequent names. person's name, birth date, address 5

Introduction Determining the similarity (or distance) of two records in a database is a well-known, but challenging problem. The problem comprises two main difficulties: 1. typos outdated values sloppy data or query entries. 2. The amount of data might be very large, thus prohibiting exhaustive comparisons. devising sophisticated similarity measures Efficient algorithms and indexes that avoid comparing each entry with all other entries. Efficient algorithms and indexes that avoid comparing each entry with all other entries. 6

Composing Similarity  Base Similarity Measures Define: Sim p (r1,r2) Sim p : (R x R) → [0,1] ⊂ R each responsible for calculating the similarity of a specific attribute p of the compared records r1 and r2 from a set R of records. Ex: SimName : Jaro-Winkler distance SimBirthDate : relative distance SimAddress : Euclidean distance Also test for equality (e.g., for addresses) or boolean values(e.g., for gender). 7

Jaro-Winkler distance Jaro–Winkler distance d w : 8 m: the number of matching characters. t: half the number of transpositions. d j :the Jaro distance for strings s1 and s2 : the length of common prefix at the start of the string up to a maximum of 4 characters p : a constant scaling factor p should not exceed 0.25, otherwise the distance can become larger than 1. The standard value for this constant in Winkler's work is p = 0.1

9

Composing Similarity  Composition of Base Similarity Measures Integrate the base similarity measures into an overall judgement to calculate the overall similarity of two records. the classes are isSimilar and isDissimilar The features are the results of the base similarity measures. To derive a general model: employ machine learning techniques and have enough training data for supervised learning methods. 10 logistic regression, decision trees, SVM

logistic regression SVM(support vector machine) Decision Tree 11

Frequency Function Determine the value frequencies of the selected attributes for two compared records. Define a frequency function f : R x R → N ( FirstName & LastName) Goal : partition the data according to the name frequencies.  Several data quality problems: 1.swapping of first and last name 2. typos (e. g., Arnold, Arnnold) 3. combining two attributes (e. g., Schwarzenegger is more distinguishing than Arnold) 12 Exploiting frequencies FirstNameLastName ArnoldSchwarzenegger Arnold

FirstName frequency Josh : 3 Kevin: 1 Jack: 5... … LastName frequency powell : 2 johnson : 0 wills: 5 powell : 1 johnson : 1 wills: 1 powell : 4 johnson : 3 wills: 0 13 LastName frequency Powell: 1 Johnson: 0 Wills: 5... … FirstName frequency Josh : 2 Kevin : 2 Jack: 2 Josh : 4 Kevin : 6 Jack: 5

Exploiting frequencies Frequency-enriched Models exploit frequency distributions is to alter the models that we learned with the machine learning techniques 1. manually add rules to the models 2. integrate the frequencies directly into the machine learning models. 14 Ex: logistic regression, "if the frequency of the name value is below10, then increase the weight of the name similarity by 10% and appropriately decrease the weights of the other similarity functions". Drawback : Manually defining such rules is cumbersome and error-prone where M is the maximum frequency in the data set.

Partitioning strategies  partition compared record pairs into n partitions using the determined frequencies.  Number of partition: Too large in small partitions: Overfitting 0 10 Too small in large partitions: discovering frequency-specific differences

Partitioning strategies  Define partitions: The entire frequency space is divided into non-overlapping, continuous partitions by a set of thresholds: Ɵ 0 = 0 and Ɵ n = M + 1, where M is the maximum frequency in the data set.  Defined as frequency ranges I i : A partition covers a set of record pairs. A record pair(r1,r2) falls into a partition [ Ɵ i, Ɵ i+1 ) iff the frequency function value for this pair lies in the partition's range: 16

Partitioning strategies Random partitioning: randomly pick several thresholds Ɵ i ∈ {0,…….,M + 1} The number of thresholds in each partitioning is also randomly chosen. maximum of 20 partitions in one partitioning. Equi-depth partitioning: divide the frequency space into e partitions. Each partition contains the same number of tuples from the original data set R. e ∈ {2,…….,20} 1partition partition e:9

Partitioning strategies Greedy partitioning: define a list of threshold candidates C = { Ɵ 0,……, Ɵ n } by dividing the frequency space into segments with the same number of tuples (similar to equi-depth partitioning, but with fixed, large e = 50). Process: 1.learning a partition for the first candidate thresholds [ Ɵ 0, Ɵ 1 ). 2.learn a second partition that extends the current partition by moving its upper threshold to the next threshold candidate: [ Ɵ 0, Ɵ 2 ). 3. …………………… [ Ɵ 0, Ɵ 3 ). …… ∆ compare both partitions using F-measure. 18

Partitioning strategies Greedy partitioning: (continue)  If the extended partition achieves better performance, the process is repeated for the next threshold slot.  If not, the smaller partition is kept and a new partitioning is started at its upper threshold; another iteration starts with this new partition.  This process is repeated until all threshold candidates have been processed. 19

[ Ɵ i, Ɵ j ) TotalSD 0≤Frequency< ≤Frequency< ≤Frequency< ≤Frequency<4 15 4≤Frequency< similardissimilar similar51 dissimilar31 similardissimilar similar101 dissimilar36 similardissimilar similar104 dissimilar56 5≠ = =20 actual predict

Partitioning strategies Genetic Partitioning Algorithm 1. Initialization: Create an initial population consisting of several random partitionings. These partitionings are created as described above with the random partitioning approach. 2. Growth: Learn one composite similarity function for each partition in the current set of partitionings. 3. Selection: For each partition, determine the maximum F-measure that can be achieved by choosing an appropriate threshold for the similarity function. Select the partitionings with highest weighted F- measure, then select the top five partitionings. 21

Partitioning strategies 4. Reproduction: build pairs of the selected best individuals and combine them to create new individuals. a) Recombination: First create the union of the thresholds of both partitionings. For each threshold, randomly decide whether to keep it in the result partition or not. Both decisions have equal chances. b) Mutation: Randomly decide whether to add another new (also ran-domly picked) threshold and whether to delete a (randomly picked) threshold from the current threshold list. Define a minimum partition size (set this value to 20 record pairs ). Randomly created partitionings with too small partitions are discarded. 22

23 [ Ɵ 0, Ɵ 1 )[ Ɵ 1, Ɵ 2 )[ Ɵ 2, Ɵ 3 ) [ 0, 1 )[ 1, 2 )[ 2, 3 ) [ 0, 2 )[ 1, 3 )[ 2, 4 ) [ 0, 3 )[ 1, 4 )…… [ 0, 4 )……[ 3, 4 ) [ 0, 5 )[ 2, 3 )[ 3, 5 ) [ 0, 6 )[ 2, 4 )…… …..……[ 4, 5 ) Ɵ0Ɵ0 Ɵ1Ɵ1 Ɵ2Ɵ2 Ɵ3Ɵ3 → [ 0, 1 ), [ 1, 3 ), [ 3, 4 ) → [ 0, 2 ), [ 2, 4 ), [ 4, 5 ) Top 5

Partitioning strategies 5. Termination: The resulting partitions are evaluated and added to the set of evaluated partitions. The selection/reproduction phases are repeated until a certain number of iterations is reached or until no significant improvement can be measured. Require a minimum F-measure improvement of after 5 iterations. 24

Experiment 25 data set consists of two parts: a person data set and a query data set. built record pairs of the form (query, correct result) or (query, incorrect result), Evaluation on Schufa Data Set

Experiment 26 Evaluation on DBLP Data Set Evaluation on DBLP Data Set(bibliographic database for computer sciences) (1) Two papers from the same author, (2) Two papers from the same author with different name aliases (3) Two papers from different authors with the same name, (4) Two papers from different authors with different names. For each paper pair, the matching task is to decide whether the two papers were written by the same author.

Conclusion  With this paper, introduced a novel approach for im- proving composite similarity measures.  Divide a data set consisting of record pairs into partitions according to frequencies of selected attributes.  Learn optimal similarity measures for each partition.  Experiments on different real-world data sets showed that partitioning the data can improve learning results and that genetic partitioning performs better than several other partitioning strategies. 27

Thank you for your listening ! 28