Sample Selection Bias Lei Tang Feb. 20th, 2007. Classical ML vs. Reality  Training data and Test data share the same distribution (In classical Machine.

Slides:



Advertisements
Similar presentations
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
An Introduction of Support Vector Machine
Support Vector Machines
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Classification and risk prediction
Lesson learnt from the UCSD datamining contest Richard Sia 2008/10/10.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Presenting: Assaf Tzabari
SVM Support Vectors Machines
Visual Recognition Tutorial
Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Introduction to domain adaptation
An Introduction to Support Vector Machines Martin Law.
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
An Introduction to Support Vector Machines (M. Law)
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Optimal Bayes Classification
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
29 August 2013 Venkat Naïve Bayesian on CDF Pair Scores.
Class Imbalance in Text Classification
Classification Ensemble Methods 1
Intro. ANN & Fuzzy Systems Lecture 15. Pattern Classification (I): Statistical Formulation.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support vector machines
CS479/679 Pattern Recognition Dr. George Bebis
Empirical risk minimization
Bounding the error of misclassification
Support Vector Machines
Regularized risk minimization
ECE 5424: Introduction to Machine Learning
Support vector machines
Parametric Methods Berlin Chen, 2005 References:
Empirical risk minimization
Learning From Observed Data
Mathematical Foundations of BME
Multivariate Methods Berlin Chen, 2005 References:
Linear Discrimination
Presentation transcript:

Sample Selection Bias Lei Tang Feb. 20th, 2007

Classical ML vs. Reality  Training data and Test data share the same distribution (In classical Machine Learning)  But that’s not always the case in reality.  Survey data  Species habitat modeling based on data of only one area  Training and test data collected by different experiments  Newswire articles with timestamps

Sample selection bias  Standard setting: data (x,y) are drawn independently from a distribution D  If the selected samples is not a random samples of D, then the samples are biased.  Usually, training data are biased, but we want to apply the classifier to unbiased samples.

Four cases of Bias(1)  Let s denote whether or not a sample is selected.  P(s=1|x,y) = P(s=1) (not biased)  P(s=1|x,y) = P(s=1|x) (depending only on the feature vector)  P(s=1|x,y) = P(s=1|y) (depending only on the class label)  P(s=1|x,y) (depending on both x and y)

Four cases of Bias(2)  P(s=1|x, y)= P(s=1|y): learning from imbalanced data. Can alleviate the bias by changing the class prior.  P(s=1|x,y) = P(s=1|x) imply P(y|x) remain unchanged. This is mostly studied.  If the bias depends on both x and y, lack information to analyze.

An intuitive Example P(s=1|x,y) = P(s=1|x) => s and y are independent. So P(y|x, s=1) = P(y|x). Does it really matter as P(y|x) remain unchanged??

Bias Analysis for Classifiers(1)  Logistic Regression Any classifiers directly models P(y|x) won’t be affected by bias  Bayesian Classifier But for naïve Bayesian classifier

Bias Analysis for Classifiers(2)  Hard margin SVM: no bias effect. Soft margin SVM: has bias effect as the cost of misclassification might change.  Decision Tree usually results in a different classifier if the bias is presented  In sum, most classifiers are still sensitive to the sample bias.  This is in asymptotic analysis assuming the samples are “enough”

Correcting Bias  Expected Risk:  Suppose training set from Pr, test set from Pr’  So we minimize the empirical regularized risk:

Estimate the weights  The samples which are likely to appear in the test data will gain more weight.  But how to estimate the weight of each sample?  Brute force approach:  Estimate the density of Pr(x) and Pr’(x), respectively,  Then calculate the sample weight.  Not applicable as density estimation is more difficult than classification given limited number of samples.  Existing works use simulation experiments in which both Pr(x) and Pr’(x) are known (NOT REALISTIC)

Distribution Matching  The expectation in feature space:  We have  Hence, the problem can be formulated as  Solution is:

Empirical KMM optimization where Therefore, it’s equivalent to solve the QP problem:

Experiments  A Toy Regression Example

Simulation  Select some UCI datasets to inject some sample selection bias into training, then test on unbiased samples.

Bias on Labels

Unexplained  From theory, the importance sampling should be the best, why KMM performs better?  Why kernel methods? Can we just do the matching using input features?  Can we just perform a logistic regression to estimate \beta by treating test data as positive class, and training data as negative. Then, \beta is the odds.

Some Related Problems  Semi-supervised Learning (Is it equivalent??)  Multi-task Learning: assume P(y|x) to be different. But sample selection bias(mostly) assume P(y|x) to be the same. MTL requires training data for each task.  Is it possible to discriminate features which introduce the bias? Or find invariant dimensionalities?

Any Questions? Happy Pig Year!