Active Sampling for Entity Matching Aditya Parameswaran Stanford University Jointly with: Kedar Bellare, Suresh Iyengar, Vibhor Rastogi Yahoo! Research.

Slides:



Advertisements
Similar presentations
A Support Vector Method for Optimizing Average Precision
Advertisements

Lazy Paired Hyper-Parameter Tuning
Active Appearance Models
ECG Signal processing (2)
CrowdER - Crowdsourcing Entity Resolution
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
LEARNING INFLUENCE PROBABILITIES IN SOCIAL NETWORKS Amit Goyal Francesco Bonchi Laks V. S. Lakshmanan University of British Columbia Yahoo! Research University.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Support Vector Machines
Support vector machine
ANDREW MAO, STACY WONG Regrets and Kidneys. Intro to Online Stochastic Optimization Data revealed over time Distribution of future events is known Under.
Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language.
Robust Global Registration Natasha Gelfand Niloy Mitra Leonidas Guibas Helmut Pottmann.
Separating Hyperplanes
1 s-t Graph Cuts for Binary Energy Minimization  Now that we have an energy function, the big question is how do we minimize it? n Exhaustive search is.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Robust Multi-Kernel Classification of Uncertain and Imbalanced Data
Visual Recognition Tutorial
Optimizing Estimated Loss Reduction for Active Sampling in Rank Learning Presented by Pinar Donmez joint work with Jaime G. Carbonell Language Technologies.
Evaluating Search Engine
Towards the Web of Concepts: Extracting Concepts from Large Datasets Aditya G. Parameswaran Stanford University Joint work with: Hector Garcia-Molina (Stanford)
TRADING OFF PREDICTION ACCURACY AND POWER CONSUMPTION FOR CONTEXT- AWARE WEARABLE COMPUTING Presented By: Jeff Khoshgozaran.
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
CES 514 – Data Mining Lecture 8 classification (contd…)
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Bing LiuCS Department, UIC1 Learning from Positive and Unlabeled Examples Bing Liu Department of Computer Science University of Illinois at Chicago Joint.
Machine Learning in Simulation-Based Analysis 1 Li-C. Wang, Malgorzata Marek-Sadowska University of California, Santa Barbara.
Decision Optimization Techniques for Efficient Delivery of Multimedia Streams Mugurel Ionut Andreica, Nicolae Tapus Politehnica University of Bucharest,
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
SVM by Sequential Minimal Optimization (SMO)
 1  Outline  stages and topics in simulation  generation of random variates.
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Constraint Satisfaction Problems (CSPs) CPSC 322 – CSP 1 Poole & Mackworth textbook: Sections § Lecturer: Alan Mackworth September 28, 2012.
Benk Erika Kelemen Zsolt
Nearest Neighbor (NN) Rule & k-Nearest Neighbor (k-NN) Rule Non-parametric : Can be used with arbitrary distributions, No need to assume that the form.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Presenter: Shanshan Lu 03/04/2010
Beyond Sliding Windows: Object Localization by Efficient Subwindow Search The best paper prize at CVPR 2008.
Efficient Subwindow Search: A Branch and Bound Framework for Object Localization ‘PAMI09 Beyond Sliding Windows: Object Localization by Efficient Subwindow.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
CS 478 – Tools for Machine Learning and Data Mining SVM.
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Agnostic Active Learning Maria-Florina Balcan*, Alina Beygelzimer**, John Langford*** * : Carnegie Mellon University, ** : IBM T.J. Watson Research Center,
Machine Learning Concept Learning General-to Specific Ordering
Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.
Lagrangean Relaxation
Machine Learning ICS 178 Instructor: Max Welling Supervised Learning.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Efficient Similarity Search : Arbitrary Similarity Measures, Arbitrary Composition Date: 2011/10/31 Source: Dustin Lange et. al (CIKM’11) Speaker:Chiang,guang-ting.
Sofus A. Macskassy Fetch Technologies
Geometrical intuition behind the dual problem
Importance Weighted Active Learning
Vincent Granville, Ph.D. Co-Founder, DSC
Preference Query Evaluation Over Expensive Attributes
Disambiguation Algorithm for People Search on the Web
Presentation transcript:

Active Sampling for Entity Matching Aditya Parameswaran Stanford University Jointly with: Kedar Bellare, Suresh Iyengar, Vibhor Rastogi Yahoo! Research

Entity Matching Goal: Find duplicate entities in a given data set Fundamental data cleaning primitive  decades of prior work Especially important at Yahoo! (and other web companies) 2 Homma’s Brown Rice Sushi California Avenue Palo Alto Homma’s Brown Rice Sushi California Avenue Palo Alto Homma’s Sushi Cal Ave Palo Alto Homma’s Sushi Cal Ave Palo Alto

Why is it important? 3 Websites Databases Content Providers Dirty Entities ??? Deduplicated Entities Applications: Business Listings in Y! Local Celebrities in Y! Movies Events in Y! Upcoming …. Applications: Business Listings in Y! Local Celebrities in Y! Movies Events in Y! Upcoming …. Find Duplicates Yelp Zagat Foursq

How? 4 Reformulated Goal: Construct a high quality classifier identifying duplicate entity pairs Problem: How do we select training data? Answer: Active Learning with Human Experts!

Reformulated Workflow 5 Websites Databases Content Providers Dirty Entities Our Technique Deduplicated Entities

Active Learning (AL) Primer Properties of an AL algorithm: —Label Complexity —Time Complexity —Consistency Prior work: —Uncertainty Sampling —Query by Committee —…—… —Importance Weighted Active Learning (IWAL) —Online IWAL without Constraints Implemented in Vowpal Wabbit (VW) 0-1 Metric Time and Label efficient Provably Consistent Work even under noisy settings } 6

Problem One: Imbalanced Data  Typical to have 100:1 even after blocking  Solution: Metric from [Arasu11]: —Maximize Recall —Such that Precision > τ Non-matchesMatches Solution: All Non-matches Precision 100% 0-1 Error ≈ 0 Correctly identified matches % of correct matches

Problem Two: Guarantees  Prior work on Entity Matching —No guarantees on Recall/Precision —Even if they do, they have: High time + label complexity  Can we adapt prior work on AL for the new objective: —Maximize recall, such that precision > τ  With: —Sub-linear label complexity —Efficient time complexity 8

Overview of Our Approach Recall Optimization with Precision Constraint Recall Optimization with Precision Constraint Weighted 0-1 Error Active Learning with 0-1 Error Active Learning with 0-1 Error Reduction: Convex-hull Search in Relaxed Lagrangian Reduction: Rejection Sampling This talk Paper 9

Objective Given: —Hypothesis class H, —Threshold τ in [0,1] Objective: Find h in H that —Maximizes recall(h) —Such that: precision(h) >= τ Equivalently: —Maximize - falseneg(h) —Such that: ε truepos(h) - falsepos(h) >= 0 —Where ε = τ/(1-τ) 10

Unconstrained Objective Current formulation: —Maximize -falseneg(h) ε truepos(h) - falsepos(h) >= 0 If we introduce lagrange multiplier λ: —Maximize X(h) + λ Y(h), can be rewritten as: —Minimize δ falseneg (h) + (1 – δ) falsepos(h) X(h) Y(h) Weighted 0-1 objective 11

Convex Hull of Classifiers 12 Y(h) X(h) We want a classifier here 0 Convex shape formed by joining classifiers strictly dominating others Maximize X(h) Such that Y(h) >= 0 Maximize X(h) Such that Y(h) >= 0 Can have exponential number of points inside

Convex Hull of Classifiers 13 Y(h) X(h) For any λ>0, there is a point / line with largest value of X + λ Y If λ=-1/slope of a line, we get a classifier on the line, else we get a vertex classifier. u v u-v Plug λ into weighted objective, get classifier h with highest X(h) + λ Y(h) Maximize X(h) Such that Y(h) >= 0 Maximize X(h) Such that Y(h) >= 0

Convex Hull of Classifiers 14 Y(h) X(h) Worst case, we get this point Naïve strategy: try all λ Equivalently, try all slopes Naïve strategy: try all λ Equivalently, try all slopes Instead, do binary search for λ Problem: When to stop? 1) Bounds 2) Discretization of λ Details in Paper! Instead, do binary search for λ Problem: When to stop? 1) Bounds 2) Discretization of λ Details in Paper! Too long! Maximize X(h) Such that Y(h) >= 0 Maximize X(h) Such that Y(h) >= 0

Algorithm I (Ours  Weighted)  Given: AL black box C for weighted 0-1 error  Goal: Precision constrained objective  Range of λ: [Λ min,Λ max ] —Don’t enumerate all candidate λ  too expensive; O(n 3 ) —Instead, discretized using factor θ  see paper!  Binary search over discretized values  Same complexity as binary search —O(log n) 15

Algorithm II (Weighted  0-1)  Given: AL black box B for 0-1 error  Goal: AL Black box C for weighted 0-1 error  Use trick from Supervised Learning [Zadrozny03] —Cost-sensitive objective  Binary —Reduction by rejection sampling 16

Overview of Our Approach Recall Optimization with Precision Constraint Recall Optimization with Precision Constraint Weighted 0-1 Error Active Learning with 0-1 Error Active Learning with 0-1 Error Reduction: Convex-hull Search in Relaxed Lagrangian Reduction: Rejection Sampling This talk Paper O(log n) Labels = O(log 2 n) L(B) Time = O(log 2 n) T(B) Labels = O(log 2 n) L(B) Time = O(log 2 n) T(B) 17

Experiments  Four real-world data sets  All labels known —Simulate active learning  Two approaches for AL with Precision Constraint: —Ours With Vowpal Wabbit as 0-1 AL Black Box —Monotone [Arasu11] Assumes monotonicity of similarity features High computational + label complexity Data SetSizeRatio (+/-)Features Y! Local Businesses UCI Person Linkage DBLP-ACM Bibliography Scholar-DBLP Bibliography

Results I (Runtime with #Features)  Computational complexity on UCI Person 19

Results II (Quality & #Label Queries) Business Person 20

Results II (Contd.) DBLP-ACM 21 Scholar

Results III (0-1 Active Learning)  Precision Constraint Satisfaction % of 0-1 AL 22

Conclusion  Active learning for Entity Matching  Can use any 0-1 AL as black box  Great real world performance: —Computationally efficient (600k examples in 25 seconds) —Label efficient and better F-1 on four real-world tasks  Guaranteed —Precision of matcher —Time and label complexity 23