Presentation is loading. Please wait.

Presentation is loading. Please wait.

Active Sampling for Entity Matching Aditya Parameswaran Stanford University Jointly with: Kedar Bellare, Suresh Iyengar, Vibhor Rastogi Yahoo! Research.

Similar presentations


Presentation on theme: "Active Sampling for Entity Matching Aditya Parameswaran Stanford University Jointly with: Kedar Bellare, Suresh Iyengar, Vibhor Rastogi Yahoo! Research."— Presentation transcript:

1 Active Sampling for Entity Matching Aditya Parameswaran Stanford University Jointly with: Kedar Bellare, Suresh Iyengar, Vibhor Rastogi Yahoo! Research

2 Entity Matching Goal: Find duplicate entities in a given data set Fundamental data cleaning primitive  decades of prior work Especially important at Yahoo! (and other web companies) 2 Homma’s Brown Rice Sushi California Avenue Palo Alto Homma’s Brown Rice Sushi California Avenue Palo Alto Homma’s Sushi Cal Ave Palo Alto Homma’s Sushi Cal Ave Palo Alto

3 Why is it important? 3 Websites Databases Content Providers Dirty Entities ??? Deduplicated Entities Applications: Business Listings in Y! Local Celebrities in Y! Movies Events in Y! Upcoming …. Applications: Business Listings in Y! Local Celebrities in Y! Movies Events in Y! Upcoming …. Find Duplicates Yelp Zagat Foursq

4 How? 4 Reformulated Goal: Construct a high quality classifier identifying duplicate entity pairs Problem: How do we select training data? Answer: Active Learning with Human Experts!

5 Reformulated Workflow 5 Websites Databases Content Providers Dirty Entities Our Technique Deduplicated Entities

6 Active Learning (AL) Primer Properties of an AL algorithm: —Label Complexity —Time Complexity —Consistency Prior work: —Uncertainty Sampling —Query by Committee —…—… —Importance Weighted Active Learning (IWAL) —Online IWAL without Constraints Implemented in Vowpal Wabbit (VW) 0-1 Metric Time and Label efficient Provably Consistent Work even under noisy settings } 6

7 Problem One: Imbalanced Data  Typical to have 100:1 even after blocking  Solution: Metric from [Arasu11]: —Maximize Recall —Such that Precision > τ 7 100 1 1 Non-matchesMatches Solution: All Non-matches Precision 100% 0-1 Error ≈ 0 Correctly identified matches % of correct matches

8 Problem Two: Guarantees  Prior work on Entity Matching —No guarantees on Recall/Precision —Even if they do, they have: High time + label complexity  Can we adapt prior work on AL for the new objective: —Maximize recall, such that precision > τ  With: —Sub-linear label complexity —Efficient time complexity 8

9 Overview of Our Approach Recall Optimization with Precision Constraint Recall Optimization with Precision Constraint Weighted 0-1 Error Active Learning with 0-1 Error Active Learning with 0-1 Error Reduction: Convex-hull Search in Relaxed Lagrangian Reduction: Rejection Sampling This talk Paper 9

10 Objective Given: —Hypothesis class H, —Threshold τ in [0,1] Objective: Find h in H that —Maximizes recall(h) —Such that: precision(h) >= τ Equivalently: —Maximize - falseneg(h) —Such that: ε truepos(h) - falsepos(h) >= 0 —Where ε = τ/(1-τ) 10

11 Unconstrained Objective Current formulation: —Maximize -falseneg(h) ε truepos(h) - falsepos(h) >= 0 If we introduce lagrange multiplier λ: —Maximize X(h) + λ Y(h), can be rewritten as: —Minimize δ falseneg (h) + (1 – δ) falsepos(h) X(h) Y(h) Weighted 0-1 objective 11

12 Convex Hull of Classifiers 12 Y(h) X(h) We want a classifier here 0 Convex shape formed by joining classifiers strictly dominating others Maximize X(h) Such that Y(h) >= 0 Maximize X(h) Such that Y(h) >= 0 Can have exponential number of points inside

13 Convex Hull of Classifiers 13 Y(h) X(h) For any λ>0, there is a point / line with largest value of X + λ Y If λ=-1/slope of a line, we get a classifier on the line, else we get a vertex classifier. u v u-v Plug λ into weighted objective, get classifier h with highest X(h) + λ Y(h) Maximize X(h) Such that Y(h) >= 0 Maximize X(h) Such that Y(h) >= 0

14 Convex Hull of Classifiers 14 Y(h) X(h) Worst case, we get this point Naïve strategy: try all λ Equivalently, try all slopes Naïve strategy: try all λ Equivalently, try all slopes Instead, do binary search for λ Problem: When to stop? 1) Bounds 2) Discretization of λ Details in Paper! Instead, do binary search for λ Problem: When to stop? 1) Bounds 2) Discretization of λ Details in Paper! Too long! Maximize X(h) Such that Y(h) >= 0 Maximize X(h) Such that Y(h) >= 0

15 Algorithm I (Ours  Weighted)  Given: AL black box C for weighted 0-1 error  Goal: Precision constrained objective  Range of λ: [Λ min,Λ max ] —Don’t enumerate all candidate λ  too expensive; O(n 3 ) —Instead, discretized using factor θ  see paper!  Binary search over discretized values  Same complexity as binary search —O(log n) 15

16 Algorithm II (Weighted  0-1)  Given: AL black box B for 0-1 error  Goal: AL Black box C for weighted 0-1 error  Use trick from Supervised Learning [Zadrozny03] —Cost-sensitive objective  Binary —Reduction by rejection sampling 16

17 Overview of Our Approach Recall Optimization with Precision Constraint Recall Optimization with Precision Constraint Weighted 0-1 Error Active Learning with 0-1 Error Active Learning with 0-1 Error Reduction: Convex-hull Search in Relaxed Lagrangian Reduction: Rejection Sampling This talk Paper O(log n) Labels = O(log 2 n) L(B) Time = O(log 2 n) T(B) Labels = O(log 2 n) L(B) Time = O(log 2 n) T(B) 17

18 Experiments  Four real-world data sets  All labels known —Simulate active learning  Two approaches for AL with Precision Constraint: —Ours With Vowpal Wabbit as 0-1 AL Black Box —Monotone [Arasu11] Assumes monotonicity of similarity features High computational + label complexity Data SetSizeRatio (+/-)Features Y! Local Businesses39580.1155 UCI Person Linkage5749130.0049 DBLP-ACM Bibliography4944370.0057 Scholar-DBLP Bibliography5893260.0097 18

19 Results I (Runtime with #Features)  Computational complexity on UCI Person 19

20 Results II (Quality & #Label Queries) Business Person 20

21 Results II (Contd.) DBLP-ACM 21 Scholar

22 Results III (0-1 Active Learning)  Precision Constraint Satisfaction % of 0-1 AL 22

23 Conclusion  Active learning for Entity Matching  Can use any 0-1 AL as black box  Great real world performance: —Computationally efficient (600k examples in 25 seconds) —Label efficient and better F-1 on four real-world tasks  Guaranteed —Precision of matcher —Time and label complexity 23


Download ppt "Active Sampling for Entity Matching Aditya Parameswaran Stanford University Jointly with: Kedar Bellare, Suresh Iyengar, Vibhor Rastogi Yahoo! Research."

Similar presentations


Ads by Google