1 Searching and Integrating Information on the Web Seminar 3: Data Cleansing Professor Chen Li UC Irvine.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Analysis of Algorithms
1 Efficient Record Linkage in Large Data Sets Liang Jin, Chen Li, Sharad Mehrotra University of California, Irvine DASFAA, Kyoto, Japan, March 2003.
Data Cleaning and Transformation – Record Linkage Helena Galhardas DEI IST (based on the slides: “A Survey of Data Quality Issues in Cooperative Information.
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
The Flamingo Software Package on Approximate String Queries Chen Li UC Irvine and Bimaple
Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Mutual Information Mathematical Biology Seminar
Aki Hecht Seminar in Databases (236826) January 2009
Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne.
Liang Jin and Chen Li VLDB’2005 Supported by NSF CAREER Award IIS Selectivity Estimation for Fuzzy String Predicates in Large Data Sets.
CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.
Radial Basis Function Networks
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
MS Access: Database Concepts Instructor: Vicki Weidler.
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach by: Craig A. Knoblock, Kristina Lerman Steven Minton, Ion Muslea Presented.
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
by B. Zadrozny and C. Elkan
Presented by Tienwei Tsai July, 2005
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information.
Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1.
Filter Algorithms for Approximate String Matching Stefan Burkhardt.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Querying Structured Text in an XML Database By Xuemei Luo.
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Date: 2011/12/26 Source: Dustin Lange et. al (CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Frequency-aware Similarity Measures 1.
Presenter: Shanshan Lu 03/04/2010
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
Interactive Deduplication using Active Learning Sunita Sarawagi and Anuradha Bhamidipaty Presented by Doug Downey.
Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.
Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.
Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica Answering Approximate Queries Efficiently.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Optimizing Parallel Algorithms for All Pairs Similarity Search
Database Management System
Challenges in Creating an Automated Protein Structure Metaserver
Chapter 12: Query Processing
Web Data Integration Using Approximate String Join
Database Management Systems (CS 564)
Evaluation of Relational Operations: Other Operations
K Nearest Neighbor Classification
Approximate String Joins in a Database (Almost) for Free
Interactive De-Duplicate using Active Learning*
Efficient Record Linkage in Large Data Sets
Implementation of Relational Operations
Nearest Neighbors CSC 576: Data Mining.
Evaluation of Relational Operations: Other Techniques
A Framework for Testing Query Transformation Rules
A task of induction to find patterns
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Presentation transcript:

1 Searching and Integrating Information on the Web Seminar 3: Data Cleansing Professor Chen Li UC Irvine

Seminar 32 Paper readings Efficient merge and purge: Hernandez and Stolfo, SIGMOD 1995 Approximate String Joins in a Database (Almost) for Free, Gravano et al, VLDB 2001, Efficient Record Linkage in Large Data Sets, Liang Jin, Chen Li, Sharad Mehrotra, DASFAA, 2003 Sunita Sarawagi Anuradha Bhamidipaty, Interactive Deduplication Using Active Learning. Sarawagi and Bhamidipaty, KDD 2003

Seminar 33 Motivation Correlate data from different data sources (e.g., data integration) –Data is often dirty –Needs to be cleansed before being used Example: –A hospital needs to merge patient records from different data sources –They have different formats, typos, and abbreviations

Seminar 34 Example NameSSNAddr Jack Lemmon Maple St Harrison Ford Culver Blvd Tom Hanks Main St ……… Table R NameSSNAddr Ton Hanks Main Street Kevin Spacey Frost Blvd Jack Lemon Maple Street ……… Table S Find records from different datasets that could be the same entity

Seminar 35 Another Example P. Bernstein, D. Chiu: Using Semi-Joins to Solve Relational Queries. JACM 28(1): (1981)P. BernsteinD. Chiu Philip A. Bernstein, Dah-Ming W. Chiu, Using Semi-Joins to Solve Relational Queries, Journal of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981

Seminar 36 Record linkage Problem statement: “Given two relations, identify the potentially matched records –Efficiently and –Effectively”

Seminar 37 Challenges How to define good similarity functions? –Many functions proposed (edit distance, cosine similarity, …) –Domain knowledge is critical  Names: “Wall Street Journal” and “LA Times”  Address: “Main Street” versus “Main St” How to do matching efficiently –Offline join version –Online (interactive) search  Nearest search  Range search

Seminar 38 Outline Supporting string-similarity joins using RDBMS Using mapping techniques Interactive deduplication

Seminar 39 Edit Distance A widely used metric to define string similarity Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2

Seminar 310 Approximate String Joins Service A Jenny Stamatopoulou John Paul McDougal Aldridge Rodriguez Panos Ipeirotis John Smith … … Service B Panos Ipirotis Jonh Smith … Jenny Stamatopulou John P. McDougal … Al Dridge Rodriguez We want to join tuples with “similar” string fields Similarity measure: Edit Distance Each Insertion, Deletion, Replacement increases distance by one K=1 K=2 K=1 K=3 K=1

Seminar 311 Focus: Approximate String Joins over Relational DBMSs Join two tables on string attributes and keep all pairs of strings with Edit Distance ≤ K Solve the problem in a database-friendly way (if possible with an existing "vanilla" RDBMS)

Seminar 312 Current Approaches for Processing Approximate String Joins No native support for approximate joins in RDBMSs Two existing (straightforward) solutions: Join data outside of DBMS Join data via user-defined functions (UDFs) inside the DBMS

Seminar 313 Approximate String Joins outside of a DBMS 1. Export data 2. Join outside of DBMS 3. Import the result Main advantage: We can exploit any state-of-the-art string-matching algorithm, without restrictions from DBMS functionality Disadvantages: Substantial amounts of data to be exported/imported Cannot be easily integrated with further processing steps in the DBMS

Seminar 314 Approximate String Joins with UDFs 1. Write a UDF to check if two strings match within distance K 2. Write an SQL statement that applies the UDF to the string pairs SELECT R.stringAttr, S.stringAttr FROM R, S WHERE edit_distance(R.stringAttr, S.stringAttr, K) Main advantage: Ease of implementation Main disadvantage: UDF applied to entire cross-product of relations

Seminar 315 Our Approach: Approximate String Joins over an Unmodified RDBMS 1. Preprocess data and generate auxiliary tables 2. Perform join exploiting standard RDBMS capabilities Advantages No modification of underlying RDBMS needed. Can leverage the RDBMS query optimizer. Much more efficient than the approach based on naive UDFs

Seminar 316 Intuition and Roadmap Intuition: –Similar strings have many common substrings –Use exact joins to perform approximate joins (current DBMSs are good for exact joins) –A good candidate set can be verified for false positives [Ukkonen 1992, Sutinen and Tarhio 1996, Ullman 1977] Roadmap: –Break strings into substrings of length q (q-grams) –Perform an exact join on the q-grams –Find candidate string pairs based on the results –Check only candidate pairs with a UDF to obtain final answer

Seminar 317 What is a “Q-gram”? Q-gram: A sequence of q characters of the original string Example for q=3 vacations {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns$, s$$} String with length L → L + q - 1 q-grams Similar strings have a many common q-grams

Seminar 318 Q-grams and Edit Distance Operations With no edits: L + q - 1 common q-grams Replacement: (L + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacalions: {##v, #va, vac, aca, cal, ali, lio, ion, ons, ns#, s$$} Insertion: (L max + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacatlions: {##v, #va, vac, aca, cat, atl, tli, lio, ion, ons, ns#, s$$} Deletion: (L max + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacaions: {##v, #va, vac, aca, cai, aio, ion, ons, ns#, s$$}

Seminar 319 Number of Common Q-grams and Edit Distance For Edit Distance = K, there could be at most K replacements, insertions, deletions Two strings S1 and S2 with Edit Distance ≤ K have at least [max(S1.len, S2.len) + q - 1] – Kq q- grams in common Useful filter: eliminate all string pairs without "enough" common q-grams (no false dismissals)

Seminar 320 Using a DBMS for Q-gram Joins If we have the q-grams in the DBMS, we can perform this counting efficiently. Create auxiliary tables with tuples of the form: and join these tables A GROUP BY – HAVING COUNT clause can perform the counting / filtering

Seminar 321 Eliminating Candidate Pairs: COUNT FILTERING SQL for this filter: (parts omitted for clarity) SELECT R.sid, S.sid FROM R, S WHERE R.qgram=S.qgram GROUP BY R.sid, S.sid HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q - 1) – K*q The result is the pair of strings with sufficiently enough common q-grams to ensure that we will not have false negatives.

Seminar 322 Eliminating Candidate Pairs Further: LENGTH FILTERING Strings with length difference larger than K cannot be within Edit Distance K SELECT R.sid, S.sid FROM R, S WHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=K GROUP BY R.sid, S.sid HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q We refer to this filter as LENGTH FILTERING

Seminar 323 Exploiting Q-gram Positions for Filtering Consider strings aabbzzaacczz and aacczzaabbzz Strings are at edit distance 4 Strings have identical q-grams for q=3 Problem: Matching q-grams that are at different positions in both strings –Either q-grams do not "originate" from same q-gram, or –Too many edit operations "caused" spurious q-grams at various parts of strings to match

Seminar 324 POSITION FILTERING - Filtering using positions Keep the position of the q-grams Do not match q-grams that are more than K positions away SELECT R.sid, S.sid FROM R, S WHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=K AND abs(R.pos - S.pos)<=K GROUP BY R.sid, S.sid HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q We refer to this filter as POSITION FILTERING

Seminar 325 The Actual, Complete SQL Statement SELECT R1.string, S1.string, R1.sid, S1.sid FROM R1, S1, R, S, WHERE R1.sid=R.sid AND S1.sid=S.sid AND R.qgram=S.qgram AND abs(strlen(R1.string)–strlen(S1.string))<=K AND abs(R.pos - S.pos)<=K GROUP BY R1.sid, S1.sid, R1.string, S1.string HAVING COUNT(*) >= (max(strlen(R1.string),strlen(S1.string))+ q- 1)–K*q

Seminar 326 Summary of 1 st paper Introduced a technique for mapping approximate string joins into a “vanilla” SQL expression Our technique does not require modifying the underlying RDBMS

Seminar 327 Outline Supporting string-similarity joins using RDBMS Using mapping techniques Interactive deduplication

Seminar 328 Single-attribute Case Given –two sets of strings, R and S –a similarity function f between strings (metric space)  Reflexive: f(s1,s2) = 0 iff s1=s2  Symmetric: f(s1,s2) = d(s2, s1)  Triangle inequality: f(s1,s2)+f(s2,s3) >= f(s1,s3) –a threshold k Find: all pairs of strings (r, s) from R and S, such that f(r,s) <= k. R S

Seminar 329 Nested-loop? Not desirable for large data sets 5 hours for 30K strings!

Seminar 330 Our 2-step approach Step 1: map strings (in a metric space) to objects in a Euclidean space Step 2: do a similarity join in the Euclidean space

Seminar 331 Advantages Applicable to many metric similarity functions –Use edit distance as an example –Other similarity functions also tried, e.g., q- gram-based similarity Open to existing algorithms –Mapping techniques –Join techniques

Seminar 332 Step 1 Map strings into a high-dimensional Euclidean space Metric Space Euclidean Space

Seminar 333 Mapping: StringMap Input: A list of strings Output: Points in a high-dimensional Euclidean space that preserve the original distances well A variation of FastMap –Each step greedily picks two strings (pivots) to form an axis –All axes are orthogonal

Seminar 334 Can it preserve distances? Data Sources: –IMDB star names: 54,000 –German names: 132,000 Distribution of string lengths:

Seminar 335 Use data set 1 (54K names) as an example k=2, d=20 –Use k’=5.2 to differentiate similar and dissimilar pairs. Can it preserve distances?

Seminar 336 Choose Dimensionality d Increase d? Good : –better to differentiate similar pairs from dissimilar ones. Bad  : –Step 1: Efficiency ↓ –Step 2: “curse of dimensionality”

Seminar 337 Choose dimensionality d using sampling Sample 1Kx1K strings, find their similar pairs (within distance k) Calculate maximum of their new distances w Define “Cost” of finding a similar pair: # of similar pairs # of pairs within distance w Cost=

Seminar 338 Choose Dimensionality d d=15 ~ 25

Seminar 339 Choose new threshold k’ Closely related to the mapping property Ideally, if ed(r,s) <= k, the Euclidean distance between two corresponding points <= k’. Choose k’ using sampling –Sample 1Kx1K strings, find similar pairs –Calculate their maximum new distance as k’ –repeat multiple times, choose their maximum

Seminar 340 New threshold k’ in step 2 d=20

Seminar 341 Step 2: Similarity Join Input: Two sets of points in Euclidean space. Output: Pairs of two points whose distance is less than new threshold k’. Many join algorithms can be used

Seminar 342 Example Adopted an algorithm by Hjaltason and Samet. –Building two R-Trees. –Traverse two trees, find points whose distance is within k’. –Pruning during traversal (e.g., using MinDist).

Seminar 343 Final processing Among the pairs produced from the similarity-join step, check their edit distance. Return those pairs satisfying the threshold k

Seminar 344 Running time

Seminar 345 Recall Recall: (#of found similar pairs)/(#of all similar pairs)

Seminar 346 Multi-attribute linkage Example: title + name + year Different attributes have different similarity functions and thresholds Consider merge rules in disjunctive format:

Seminar 347 Evaluation strategies Many ways to evaluate rules Finding an optimal one: NP-hard Heuristics: –Treat different conjuncts independently. Pick the “most efficient” attribute in each conjunct. –Choose the largest threshold for each attribute. Then choose the “most efficient” attribute among these thresholds.

Seminar 348 Summary of 2 nd paper A novel two-step approach to record linkage. Many existing mapping and join algorithms can be adopted Applicable to many distance metrics. Time and space efficient. Multi-attribute case studied

Seminar 349 Outline Supporting string-similarity joins using RDBMS Using mapping techniques Interactive deduplication

Seminar 350 Problems with Existing Deduplication Methods Matching Functions Calculate similarity scores, thresholds Tedious coding Learning-based Methods Require large-sized training set for accuracy (static training set) Difficult to provide a covering and challenging training set that will bring out the subtlety of deduplication function

ALIAS: Active Learning led Interactive Alias SuppressionSeminar 351 New Approach Relegate the task of finding the deduplication function to a machine learning algorithm Design goal:  Less training instances  Interactive response  Fast convergence  High accuracy  Design an interactive deduplication system called ALIAS

Seminar 352 ALIAS SYSTEM A learned-based method Exploit existing similarity functions Use Active Learning - an active learner actively picks unlabeled instances with the most information gain in the training set Produce a deduplication function that can identify duplicates

Seminar 353 Predicate for uncertain region Overall Architecture Similarity Functions Mapper Training data Train classifiers Select n instances for labeling Similarity Indices L D F Learner Feedback From user Lp Dp T S

Seminar 354 Primary Inputs for ALIAS A set of initial training pairs (L) –less than 10 labeled records –arranged in pairs of duplicates and non-duplicates A set of similarity functions (F) –Ex: word-match, qgram-match … –To compute similarities scores between 2 records based on any subset of attributes. –Learner will find the right way of combining those scores to get the final deduplication function A database of unlabeled records(D) Number of classifiers (<5) Similarity Functions L F D

Seminar 355 Mapped Labeled Instances Similarity Functions Mapper L F Lp Take r1, r2 from input L Record r1(a1, a2, a3) Record r2(a1, a2, a3) Use similarity functions f1, f2…fn to compute similarity scores s between r1 and r2 New record  r1&r2 (s1, s2…sn, y/n)  y = duplicate;  n = non-duplicate Put new record in Lp D

Seminar 356 Mapped Unlabeled Instances Similarity Functions Mapper L F Lp Take r1, r2 from D x D Record r1(a1, a2, a3) Record r2(a1, a2, a3) Use similarity functions f1, f2…fn to compute similarity scores between r1 and r2 New record  r1&r2 (s1, s2…sn)  No y/n field Put new record in Dp D Mapper Dp

Seminar 357 Active Learner Similarity Functions Mapper Training data Train Classifiers Select set S of n instances for labeling L D F Learner Feedback From user Lp Dp T s

Seminar 358 ALIAS Algorithm 1. Input: L, D, F. 2. Create pairs Lp from the labeled data L and F. 3. Create pairs Dp from the unlabeled data D and F. 4. Initial training set T = Lp 5. Loop until user satisfaction – Train classier C using T. – Use C to select a set S of n instances from Dp for labeling. – If S is empty, exit loop. – Collect user feedback on the labels of S. – Add S to T and remove S from Dp. 6. Output classifier C

Seminar 359 The Indexing Component Purpose: –Avoid mapping all pairs of records in D x D 3 Methods: –Grouping –Sampling –Indexing

Seminar 360 The Indexing Component Grouping –Example: group records in D according to the field year of publication –Mapped pairs are formed only within records of a group. Sampling –Sample in units of a group instead of individual records.

Seminar 361 The Indexing Component Indexing –A similarity function: “ fraction of common words between two text attributes >=0.4 ” –we can create an index on the words of the text attributes

Seminar 362 The learning component Contain a number of classifiers A classifier: a machine learning algorithm such as decision tree (D-tree), na ï ve Bayes (NB), Support Vector Machine (SVM) … to classify instances A classifier is trained using a training data set

Seminar 363 Criteria for a classifier Accuracy Interpretability Indexability Efficient training

Seminar 364 Accuracy of a Classifier Measured by the mean F of recall r and precision p  r = fraction of duplicates correctly classified  p = fraction correct amongst all instances actually libeled duplicate

Seminar 365 Accuracy (cont.) Example  A case with 1% duplicates; a classifier labels all pairs as non-duplicates recall r = 0  mean F = 0  accuracy = 0%  A case with 1% duplicates; a classifier identifies all duplicates correctly but misclassifies 1% non-duplicates recall r = 1, p = 0.5  F =  accuracy = 66.7%  If we don’t use r and p, then t hey both have 99% accuracy!

Seminar 366 Criteria for a classifier (cont.) Interpretability –Final deduplication rule is easy to understand and interpret Indexability –Final deduplication rule has indexable predicates Efficient training –Fast to train

Seminar 367 Comparison of Different Classifiers

Seminar 368 Active Learning The goal is to seek out from the unlabeled pool the instances which when labeled will help strengthen the classifier at the fastest possible rate.

Seminar 369 A simple Example  Assume all are unlabeled except a and b  Suppose r-coordinate = 0, b-coordinate = 1  Any unlabeled point x to the left of r and to the right of b will have no effect in reducing the region of uncertainty  By including m in the training set, the size of the uncertain region will reduce by half Most uncertain instance

Seminar 370 How to select an unlabeled instance Uncertainty –The instance about which the learner was most unsure was also the instance for which the expected reduction in confusion was the largest –Uncertainty score  the disagreement among the predictions the instance get from a committee of N classifiers  A sure duplicate or non-duplicate would get the same prediction from all members Representativeness –An uncertain instance representing a larger number of unlabeled instances has greater impact to the classifier

Seminar 371 Example 3 similarity functions: word match f1, qgram match f2, string edit distance f3. Take from Mapped Unlabeled Instances (Dp) –r1&r2 (s1, s2…sn) –r3&r4 (s1, s2…sn) –s1, s2…sn: scores using functions f1, f2, f3 3 classifiers: D-tree, Naïve Bayes, SVM – D-treeNaïve BayesSVM r1&r2duplicateNon-duplicateduplicate r3&r4duplicate selected

Seminar 372 How to Combine Uncertainty and Representativeness –Proposed two approaches –1st approach: weighted sum  Cluster the unlabeled instances  Estimate the density of points around it  The instances are scored using a weighted sum of its density and uncertainty value  n highest scoring instances selected –2 nd approach: sampling

Seminar 373 Conclusion ALIAS: Makes deduplication much easier (less training instances) Provide interaction response to the user High accuracy