Download presentation
Presentation is loading. Please wait.
Published byCharity Taylor Modified over 9 years ago
1
Active Sampling for Entity Matching Aditya Parameswaran Stanford University Jointly with: Kedar Bellare, Suresh Iyengar, Vibhor Rastogi Yahoo! Research
2
Entity Matching Goal: Find duplicate entities in a given data set Fundamental data cleaning primitive decades of prior work Especially important at Yahoo! (and other web companies) 2 Homma’s Brown Rice Sushi California Avenue Palo Alto Homma’s Brown Rice Sushi California Avenue Palo Alto Homma’s Sushi Cal Ave Palo Alto Homma’s Sushi Cal Ave Palo Alto
3
Why is it important? 3 Websites Databases Content Providers Dirty Entities ??? Deduplicated Entities Applications: Business Listings in Y! Local Celebrities in Y! Movies Events in Y! Upcoming …. Applications: Business Listings in Y! Local Celebrities in Y! Movies Events in Y! Upcoming …. Find Duplicates Yelp Zagat Foursq
4
How? 4 Reformulated Goal: Construct a high quality classifier identifying duplicate entity pairs Problem: How do we select training data? Answer: Active Learning with Human Experts!
5
Reformulated Workflow 5 Websites Databases Content Providers Dirty Entities Our Technique Deduplicated Entities
6
Active Learning (AL) Primer Properties of an AL algorithm: —Label Complexity —Time Complexity —Consistency Prior work: —Uncertainty Sampling —Query by Committee —…—… —Importance Weighted Active Learning (IWAL) —Online IWAL without Constraints Implemented in Vowpal Wabbit (VW) 0-1 Metric Time and Label efficient Provably Consistent Work even under noisy settings } 6
7
Problem One: Imbalanced Data Typical to have 100:1 even after blocking Solution: Metric from [Arasu11]: —Maximize Recall —Such that Precision > τ 7 100 1 1 Non-matchesMatches Solution: All Non-matches Precision 100% 0-1 Error ≈ 0 Correctly identified matches % of correct matches
8
Problem Two: Guarantees Prior work on Entity Matching —No guarantees on Recall/Precision —Even if they do, they have: High time + label complexity Can we adapt prior work on AL for the new objective: —Maximize recall, such that precision > τ With: —Sub-linear label complexity —Efficient time complexity 8
9
Overview of Our Approach Recall Optimization with Precision Constraint Recall Optimization with Precision Constraint Weighted 0-1 Error Active Learning with 0-1 Error Active Learning with 0-1 Error Reduction: Convex-hull Search in Relaxed Lagrangian Reduction: Rejection Sampling This talk Paper 9
10
Objective Given: —Hypothesis class H, —Threshold τ in [0,1] Objective: Find h in H that —Maximizes recall(h) —Such that: precision(h) >= τ Equivalently: —Maximize - falseneg(h) —Such that: ε truepos(h) - falsepos(h) >= 0 —Where ε = τ/(1-τ) 10
11
Unconstrained Objective Current formulation: —Maximize -falseneg(h) ε truepos(h) - falsepos(h) >= 0 If we introduce lagrange multiplier λ: —Maximize X(h) + λ Y(h), can be rewritten as: —Minimize δ falseneg (h) + (1 – δ) falsepos(h) X(h) Y(h) Weighted 0-1 objective 11
12
Convex Hull of Classifiers 12 Y(h) X(h) We want a classifier here 0 Convex shape formed by joining classifiers strictly dominating others Maximize X(h) Such that Y(h) >= 0 Maximize X(h) Such that Y(h) >= 0 Can have exponential number of points inside
13
Convex Hull of Classifiers 13 Y(h) X(h) For any λ>0, there is a point / line with largest value of X + λ Y If λ=-1/slope of a line, we get a classifier on the line, else we get a vertex classifier. u v u-v Plug λ into weighted objective, get classifier h with highest X(h) + λ Y(h) Maximize X(h) Such that Y(h) >= 0 Maximize X(h) Such that Y(h) >= 0
14
Convex Hull of Classifiers 14 Y(h) X(h) Worst case, we get this point Naïve strategy: try all λ Equivalently, try all slopes Naïve strategy: try all λ Equivalently, try all slopes Instead, do binary search for λ Problem: When to stop? 1) Bounds 2) Discretization of λ Details in Paper! Instead, do binary search for λ Problem: When to stop? 1) Bounds 2) Discretization of λ Details in Paper! Too long! Maximize X(h) Such that Y(h) >= 0 Maximize X(h) Such that Y(h) >= 0
15
Algorithm I (Ours Weighted) Given: AL black box C for weighted 0-1 error Goal: Precision constrained objective Range of λ: [Λ min,Λ max ] —Don’t enumerate all candidate λ too expensive; O(n 3 ) —Instead, discretized using factor θ see paper! Binary search over discretized values Same complexity as binary search —O(log n) 15
16
Algorithm II (Weighted 0-1) Given: AL black box B for 0-1 error Goal: AL Black box C for weighted 0-1 error Use trick from Supervised Learning [Zadrozny03] —Cost-sensitive objective Binary —Reduction by rejection sampling 16
17
Overview of Our Approach Recall Optimization with Precision Constraint Recall Optimization with Precision Constraint Weighted 0-1 Error Active Learning with 0-1 Error Active Learning with 0-1 Error Reduction: Convex-hull Search in Relaxed Lagrangian Reduction: Rejection Sampling This talk Paper O(log n) Labels = O(log 2 n) L(B) Time = O(log 2 n) T(B) Labels = O(log 2 n) L(B) Time = O(log 2 n) T(B) 17
18
Experiments Four real-world data sets All labels known —Simulate active learning Two approaches for AL with Precision Constraint: —Ours With Vowpal Wabbit as 0-1 AL Black Box —Monotone [Arasu11] Assumes monotonicity of similarity features High computational + label complexity Data SetSizeRatio (+/-)Features Y! Local Businesses39580.1155 UCI Person Linkage5749130.0049 DBLP-ACM Bibliography4944370.0057 Scholar-DBLP Bibliography5893260.0097 18
19
Results I (Runtime with #Features) Computational complexity on UCI Person 19
20
Results II (Quality & #Label Queries) Business Person 20
21
Results II (Contd.) DBLP-ACM 21 Scholar
22
Results III (0-1 Active Learning) Precision Constraint Satisfaction % of 0-1 AL 22
23
Conclusion Active learning for Entity Matching Can use any 0-1 AL as black box Great real world performance: —Computationally efficient (600k examples in 25 seconds) —Label efficient and better F-1 on four real-world tasks Guaranteed —Precision of matcher —Time and label complexity 23
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.