Active Sampling for Entity Matching Aditya Parameswaran Stanford University Jointly with: Kedar Bellare, Suresh Iyengar, Vibhor Rastogi Yahoo! Research
Entity Matching Goal: Find duplicate entities in a given data set Fundamental data cleaning primitive decades of prior work Especially important at Yahoo! (and other web companies) 2 Homma’s Brown Rice Sushi California Avenue Palo Alto Homma’s Brown Rice Sushi California Avenue Palo Alto Homma’s Sushi Cal Ave Palo Alto Homma’s Sushi Cal Ave Palo Alto
Why is it important? 3 Websites Databases Content Providers Dirty Entities ??? Deduplicated Entities Applications: Business Listings in Y! Local Celebrities in Y! Movies Events in Y! Upcoming …. Applications: Business Listings in Y! Local Celebrities in Y! Movies Events in Y! Upcoming …. Find Duplicates Yelp Zagat Foursq
How? 4 Reformulated Goal: Construct a high quality classifier identifying duplicate entity pairs Problem: How do we select training data? Answer: Active Learning with Human Experts!
Reformulated Workflow 5 Websites Databases Content Providers Dirty Entities Our Technique Deduplicated Entities
Active Learning (AL) Primer Properties of an AL algorithm: —Label Complexity —Time Complexity —Consistency Prior work: —Uncertainty Sampling —Query by Committee —…—… —Importance Weighted Active Learning (IWAL) —Online IWAL without Constraints Implemented in Vowpal Wabbit (VW) 0-1 Metric Time and Label efficient Provably Consistent Work even under noisy settings } 6
Problem One: Imbalanced Data Typical to have 100:1 even after blocking Solution: Metric from [Arasu11]: —Maximize Recall —Such that Precision > τ Non-matchesMatches Solution: All Non-matches Precision 100% 0-1 Error ≈ 0 Correctly identified matches % of correct matches
Problem Two: Guarantees Prior work on Entity Matching —No guarantees on Recall/Precision —Even if they do, they have: High time + label complexity Can we adapt prior work on AL for the new objective: —Maximize recall, such that precision > τ With: —Sub-linear label complexity —Efficient time complexity 8
Overview of Our Approach Recall Optimization with Precision Constraint Recall Optimization with Precision Constraint Weighted 0-1 Error Active Learning with 0-1 Error Active Learning with 0-1 Error Reduction: Convex-hull Search in Relaxed Lagrangian Reduction: Rejection Sampling This talk Paper 9
Objective Given: —Hypothesis class H, —Threshold τ in [0,1] Objective: Find h in H that —Maximizes recall(h) —Such that: precision(h) >= τ Equivalently: —Maximize - falseneg(h) —Such that: ε truepos(h) - falsepos(h) >= 0 —Where ε = τ/(1-τ) 10
Unconstrained Objective Current formulation: —Maximize -falseneg(h) ε truepos(h) - falsepos(h) >= 0 If we introduce lagrange multiplier λ: —Maximize X(h) + λ Y(h), can be rewritten as: —Minimize δ falseneg (h) + (1 – δ) falsepos(h) X(h) Y(h) Weighted 0-1 objective 11
Convex Hull of Classifiers 12 Y(h) X(h) We want a classifier here 0 Convex shape formed by joining classifiers strictly dominating others Maximize X(h) Such that Y(h) >= 0 Maximize X(h) Such that Y(h) >= 0 Can have exponential number of points inside
Convex Hull of Classifiers 13 Y(h) X(h) For any λ>0, there is a point / line with largest value of X + λ Y If λ=-1/slope of a line, we get a classifier on the line, else we get a vertex classifier. u v u-v Plug λ into weighted objective, get classifier h with highest X(h) + λ Y(h) Maximize X(h) Such that Y(h) >= 0 Maximize X(h) Such that Y(h) >= 0
Convex Hull of Classifiers 14 Y(h) X(h) Worst case, we get this point Naïve strategy: try all λ Equivalently, try all slopes Naïve strategy: try all λ Equivalently, try all slopes Instead, do binary search for λ Problem: When to stop? 1) Bounds 2) Discretization of λ Details in Paper! Instead, do binary search for λ Problem: When to stop? 1) Bounds 2) Discretization of λ Details in Paper! Too long! Maximize X(h) Such that Y(h) >= 0 Maximize X(h) Such that Y(h) >= 0
Algorithm I (Ours Weighted) Given: AL black box C for weighted 0-1 error Goal: Precision constrained objective Range of λ: [Λ min,Λ max ] —Don’t enumerate all candidate λ too expensive; O(n 3 ) —Instead, discretized using factor θ see paper! Binary search over discretized values Same complexity as binary search —O(log n) 15
Algorithm II (Weighted 0-1) Given: AL black box B for 0-1 error Goal: AL Black box C for weighted 0-1 error Use trick from Supervised Learning [Zadrozny03] —Cost-sensitive objective Binary —Reduction by rejection sampling 16
Overview of Our Approach Recall Optimization with Precision Constraint Recall Optimization with Precision Constraint Weighted 0-1 Error Active Learning with 0-1 Error Active Learning with 0-1 Error Reduction: Convex-hull Search in Relaxed Lagrangian Reduction: Rejection Sampling This talk Paper O(log n) Labels = O(log 2 n) L(B) Time = O(log 2 n) T(B) Labels = O(log 2 n) L(B) Time = O(log 2 n) T(B) 17
Experiments Four real-world data sets All labels known —Simulate active learning Two approaches for AL with Precision Constraint: —Ours With Vowpal Wabbit as 0-1 AL Black Box —Monotone [Arasu11] Assumes monotonicity of similarity features High computational + label complexity Data SetSizeRatio (+/-)Features Y! Local Businesses UCI Person Linkage DBLP-ACM Bibliography Scholar-DBLP Bibliography
Results I (Runtime with #Features) Computational complexity on UCI Person 19
Results II (Quality & #Label Queries) Business Person 20
Results II (Contd.) DBLP-ACM 21 Scholar
Results III (0-1 Active Learning) Precision Constraint Satisfaction % of 0-1 AL 22
Conclusion Active learning for Entity Matching Can use any 0-1 AL as black box Great real world performance: —Computationally efficient (600k examples in 25 seconds) —Label efficient and better F-1 on four real-world tasks Guaranteed —Precision of matcher —Time and label complexity 23