Active Sampling for Entity Matching Aditya Parameswaran Stanford University Jointly with: Kedar Bellare, Suresh Iyengar, Vibhor Rastogi Yahoo! Research.

Active Sampling for Entity Matching Aditya Parameswaran Stanford University Jointly with: Kedar Bellare, Suresh Iyengar, Vibhor Rastogi Yahoo! Research

Entity Matching Goal: Find duplicate entities in a given data set Fundamental data cleaning primitive  decades of prior work Especially important at Yahoo! (and other web companies) 2 Homma’s Brown Rice Sushi California Avenue Palo Alto Homma’s Brown Rice Sushi California Avenue Palo Alto Homma’s Sushi Cal Ave Palo Alto Homma’s Sushi Cal Ave Palo Alto

Why is it important? 3 Websites Databases Content Providers Dirty Entities ??? Deduplicated Entities Applications: Business Listings in Y! Local Celebrities in Y! Movies Events in Y! Upcoming …. Applications: Business Listings in Y! Local Celebrities in Y! Movies Events in Y! Upcoming …. Find Duplicates Yelp Zagat Foursq

How? 4 Reformulated Goal: Construct a high quality classifier identifying duplicate entity pairs Problem: How do we select training data? Answer: Active Learning with Human Experts!

Reformulated Workflow 5 Websites Databases Content Providers Dirty Entities Our Technique Deduplicated Entities

Active Learning (AL) Primer Properties of an AL algorithm: —Label Complexity —Time Complexity —Consistency Prior work: —Uncertainty Sampling —Query by Committee —…—… —Importance Weighted Active Learning (IWAL) —Online IWAL without Constraints Implemented in Vowpal Wabbit (VW) 0-1 Metric Time and Label efficient Provably Consistent Work even under noisy settings } 6

Problem One: Imbalanced Data  Typical to have 100:1 even after blocking  Solution: Metric from [Arasu11]: —Maximize Recall —Such that Precision > τ 7 100 1 1 Non-matchesMatches Solution: All Non-matches Precision 100% 0-1 Error ≈ 0 Correctly identified matches % of correct matches

Problem Two: Guarantees  Prior work on Entity Matching —No guarantees on Recall/Precision —Even if they do, they have: High time + label complexity  Can we adapt prior work on AL for the new objective: —Maximize recall, such that precision > τ  With: —Sub-linear label complexity —Efficient time complexity 8

Overview of Our Approach Recall Optimization with Precision Constraint Recall Optimization with Precision Constraint Weighted 0-1 Error Active Learning with 0-1 Error Active Learning with 0-1 Error Reduction: Convex-hull Search in Relaxed Lagrangian Reduction: Rejection Sampling This talk Paper 9

Objective Given: —Hypothesis class H, —Threshold τ in [0,1] Objective: Find h in H that —Maximizes recall(h) —Such that: precision(h) >= τ Equivalently: —Maximize - falseneg(h) —Such that: ε truepos(h) - falsepos(h) >= 0 —Where ε = τ/(1-τ) 10

Unconstrained Objective Current formulation: —Maximize -falseneg(h) ε truepos(h) - falsepos(h) >= 0 If we introduce lagrange multiplier λ: —Maximize X(h) + λ Y(h), can be rewritten as: —Minimize δ falseneg (h) + (1 – δ) falsepos(h) X(h) Y(h) Weighted 0-1 objective 11

Convex Hull of Classifiers 12 Y(h) X(h) We want a classifier here 0 Convex shape formed by joining classifiers strictly dominating others Maximize X(h) Such that Y(h) >= 0 Maximize X(h) Such that Y(h) >= 0 Can have exponential number of points inside

Convex Hull of Classifiers 13 Y(h) X(h) For any λ>0, there is a point / line with largest value of X + λ Y If λ=-1/slope of a line, we get a classifier on the line, else we get a vertex classifier. u v u-v Plug λ into weighted objective, get classifier h with highest X(h) + λ Y(h) Maximize X(h) Such that Y(h) >= 0 Maximize X(h) Such that Y(h) >= 0

Convex Hull of Classifiers 14 Y(h) X(h) Worst case, we get this point Naïve strategy: try all λ Equivalently, try all slopes Naïve strategy: try all λ Equivalently, try all slopes Instead, do binary search for λ Problem: When to stop? 1) Bounds 2) Discretization of λ Details in Paper! Instead, do binary search for λ Problem: When to stop? 1) Bounds 2) Discretization of λ Details in Paper! Too long! Maximize X(h) Such that Y(h) >= 0 Maximize X(h) Such that Y(h) >= 0

Algorithm I (Ours  Weighted)  Given: AL black box C for weighted 0-1 error  Goal: Precision constrained objective  Range of λ: [Λ min,Λ max ] —Don’t enumerate all candidate λ  too expensive; O(n 3 ) —Instead, discretized using factor θ  see paper!  Binary search over discretized values  Same complexity as binary search —O(log n) 15

Algorithm II (Weighted  0-1)  Given: AL black box B for 0-1 error  Goal: AL Black box C for weighted 0-1 error  Use trick from Supervised Learning [Zadrozny03] —Cost-sensitive objective  Binary —Reduction by rejection sampling 16

Overview of Our Approach Recall Optimization with Precision Constraint Recall Optimization with Precision Constraint Weighted 0-1 Error Active Learning with 0-1 Error Active Learning with 0-1 Error Reduction: Convex-hull Search in Relaxed Lagrangian Reduction: Rejection Sampling This talk Paper O(log n) Labels = O(log 2 n) L(B) Time = O(log 2 n) T(B) Labels = O(log 2 n) L(B) Time = O(log 2 n) T(B) 17

Experiments  Four real-world data sets  All labels known —Simulate active learning  Two approaches for AL with Precision Constraint: —Ours With Vowpal Wabbit as 0-1 AL Black Box —Monotone [Arasu11] Assumes monotonicity of similarity features High computational + label complexity Data SetSizeRatio (+/-)Features Y! Local Businesses39580.1155 UCI Person Linkage5749130.0049 DBLP-ACM Bibliography4944370.0057 Scholar-DBLP Bibliography5893260.0097 18

Results I (Runtime with #Features)  Computational complexity on UCI Person 19

Results II (Quality & #Label Queries) Business Person 20

Results II (Contd.) DBLP-ACM 21 Scholar

Results III (0-1 Active Learning)  Precision Constraint Satisfaction % of 0-1 AL 22

Conclusion  Active learning for Entity Matching  Can use any 0-1 AL as black box  Great real world performance: —Computationally efficient (600k examples in 25 seconds) —Label efficient and better F-1 on four real-world tasks  Guaranteed —Precision of matcher —Time and label complexity 23

Active Sampling for Entity Matching Aditya Parameswaran Stanford University Jointly with: Kedar Bellare, Suresh Iyengar, Vibhor Rastogi Yahoo! Research.

Similar presentations

Presentation on theme: "Active Sampling for Entity Matching Aditya Parameswaran Stanford University Jointly with: Kedar Bellare, Suresh Iyengar, Vibhor Rastogi Yahoo! Research."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Active Sampling for Entity Matching Aditya Parameswaran Stanford University Jointly with: Kedar Bellare, Suresh Iyengar, Vibhor Rastogi Yahoo! Research.

Similar presentations

Presentation on theme: "Active Sampling for Entity Matching Aditya Parameswaran Stanford University Jointly with: Kedar Bellare, Suresh Iyengar, Vibhor Rastogi Yahoo! Research."— Presentation transcript:

Similar presentations

About project

Feedback