On Sample Selection Bias and Its Efficient Correction via Model Averaging and Unlabeled Examples Wei Fan Ian Davidson.

Slides:

Advertisements

Similar presentations

The image formed by concave mirror

Advertisements

Knowledge Transfer via Multiple Model Local Structure Mapping Jing Gao, Wei Fan, Jing Jiang, Jiawei Han l Motivate Solution Framework Data Sets Synthetic.

An Improved Categorization of Classifiers Sensitivity on Sample Selection Bias Wei Fan Ian Davidson Bianca Zadrozny Philip S. Yu.

ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany.

Type Independent Correction of Sample Selection Bias via Structural Discovery and Re-balancing Jiangtao Ren Xiaoxiao Shi Wei Fan Philip S. Yu.

Type Independent Correction of Sample Selection Bias via Structural Discovery and Re-balancing Jiangtao Ren 1 Xiaoxiao Shi 1 Wei Fan 2 Philip S. Yu 2 1.

When Efficient Model Averaging Out-Perform Bagging and Boosting Ian Davidson, SUNY Albany Wei Fan, IBM T.J.Watson.

On / By / With The building blocks of the Mplus language.

Stream flow rate prediction x = weather data f(x) = flow rate.

The Build-up of the Red Sequence at z

Dome C Tb Marco’s Input Models More or Less KCJ May 2014.

Annual maps of sites where % outliers ≥ 10% of the valid samples, e.g. sites which had ≥ 10% of their sample pairs with Z Scores outside of the range [-3,3].

Spatial Semi- supervised Image Classification Stuart Ness G07 - Csci 8701 Final Project 1.

Graph-based Iterative Hybrid Feature Selection Erheng Zhong † Sihong Xie † Wei Fan ‡ Jiangtao Ren † Jing Peng # Kun Zhang $ † Sun Yat-sen University ‡

Adaptive Integrated Microsystems Example mid-term questions ECE 412 Algorithms of Circuit Design (Fall 2010) Class website:

Mrs. Smith’s 7th Grade Reading Blue Class Mrs. Smith’s 7th Grade Reading Blue Class Mrs. Smith’s 7th Grade Reading Blue Class.

Inorganic Analysis Fish!. Pick a question!

$100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300.

1A © Educational Publishing House Ltd What colour …? Unit 5 Colourful world.

1 Focus Last Change : June, 2003 C/Clients/Publications/p-plant.

Hmmmmm..Pie!. WALT understand and interpret pie charts.

1. Gold 2. Green 3. Purple 4. Pink 5. Gray

Properties of Estimators Statistics: 1.Sufficiency 2.Un-biased 3.Resistance 4.Efficiency Parameters:Describe the population Describe samples. But we use.

Spring Colors Enter our Garden to see all of the beautiful flowers that bloom in the Spring time.

= 5 = 2 = 4 = 3 How could I make 13 from these shapes? How could I make 19 from these shapes? STARTER.

Probability of Simple Events

HOW LARGE OF A SAMPLE SHOULD YOU USE? The larger the sample size, the more likely: o It will be representative of the whole population o The conclusions.

Research Problem In one sentence, describe the problem that is the focus of your classroom research project about student learning: Biology students struggle.

Fractions and Sets 9-3 I can show and understand that fractions are equal parts of a whole. 3.NF.1.

1 James N. Bellinger Robert Handler University of Wisconsin-Madison 11-Monday-2009 Laser fan non-linearity James N. Bellinger 20-March-2009.

High Speed Heteroskedasticity Review. 2 Review: Heteroskedasticity Heteroskedasticity leads to two problems: –OLS computes standard errors on slopes incorrectly.

Drought in the Western U.S.. Mean US Precipitation (in inches) Average Precipitation in 1 Year (in inches):

Solving Linear Equations in Two VariablesProjector Resources Solving Linear Equations in Two Variables Projector Resources.

Prn Auto Erik Mesh Carrie Mesh Erik Texture Carrie Texture 2.40mm0.88mm 1.08mm 1.18mm 2.39mm 0.97mm 1.07mm 2.55mm 2.35mm 2.62mm 2.16mm.

Top 10 Harley Davidson Motorcycles

Types of angles.

Results for p = 0.1 Graphs at different values of Call Arrival Rate for Call Blocking Probability (in %) System Utilization (Efficiency) (in %) Average.

Butterfly Maths Each caterpillar must be coloured the correct pattern for it to turn into a butterfly. Work out each problem to know how to colour each.

What is the average rate of change of the function f (x) = 8 x - 7 between x = 6 and x = 7? Select the correct answer:

Sample to correct Test box with subtitle.

Choice 1 Choice 2 Choice 3 Choice 4

M & M color Analysis By Ellen Zimmer.

Combining Effect Sizes

الأستاذ المساعد بقسم المناهج وطرق التدريس

Stylish Black Table Features Plan 1 Plan 2 Plan 3 Plan 4 Total 49894

Average Number of Photons

Text Each slide has a different corner that is the correct answer.

Critical Thinking Questions

Clicker #1 What is the average mass of one hydrogen atom? A) amu

Market Research (Sampling)

Types of Angles An acute angle measures less than 90°

For the data set classify_regress.dat, find w.

You and your lab partner independently determine the concentration of Ca2+ in a water sample. The results are: You Lab partner 350 ppm

An introduction to color theory

What Color is it?.

Sampling results 5 (10%) 74% 10 (20%) 25 (50%) 45 (90%) Sample Size

Clicker #1 What is the average mass of one hydrogen atom? A) amu

Answer Grid – 3 & 4 Step Equations

“ Einstein had a lot more astrocytes than the average human brain 6.

Conversion Chart Blue Zone Green Zone

Reviewgamezone.com Go online to reviewgamezone.com and on the right (or scroll down to) where it says “Games By ID#” type in & Then scroll.

Reviewgamezone.com Go online to reviewgamezone.com and on the right (or scroll down to) where it says “Games By ID#” type in 4274 or Then scroll.

What colour is it / are they?

C.2.10 Sample Questions.

C.2.8 Sample Questions.

C.2.8 Sample Questions.

Clicker #1 What is the average mass of one hydrogen atom? A) amu

Expression profiles of 87 miRNAs expressed in SC

Relationship of PMBL to Hodgkin lymphoma.

Presentation transcript:

On Sample Selection Bias and Its Efficient Correction via Model Averaging and Unlabeled Examples Wei Fan Ian Davidson

A Toy Example Two classes: red and green red: f2>f1 green: f2<=f1

Unbiased and Biased Samples Not so-biased sampling Biased sampling

Effect on Learning Unbiased 97.1%Biased 92.1%Unbiased 96.9%Biased 95.9%Unbiased %Biased 92.7% Some techniques are more sensitive to bias than others. One important question: How to reduce the effect of sample selection bias?

Ubiquitous Loan Approval Drug screening Weather forecasting Ad Campaign Fraud Detection User Profiling Biomedical Informatics Intrusion Detection Insurance etc 1.Normally, banks only have data of their own customers 2.Late payment, default models are computed using their own data 3.New customers may not completely follow the same distribution. 4.Is the New Century sub-prime mortgage bankcrupcy due to bad modeling?

Bias as Distribution Think of sampling an example (x,y) into the training data as an event denoted by random variable s s=1: example (x,y) is sampled into the training data s=0: example (x,y) is not sampled. Think of bias as a conditional probability of s=1 dependent on x and y P(s=1|x,y) : the probability for (x,y) to be sampled into the training data, conditional on the examples feature vector x and class label y.

Categorization From Zadrozny04 No Sample Selection Bias P(s=1|x,y) = P(s=1) Feature Bias P(s=1|x,y) = P(s=1|x) Class Bias P(s=1|x,y) = P(s=1|y) Complete Bias No more reduction

Alternatively, consider D of the size can be sampled exhaustively from the universe of examples. Bias for a Training Set How P(s=1|x,y) is computed Practically, for a given training set D P(s=1|x,y) = 1: if (x,y) is sampled into D P(s=1|x,y) = 0: otherwise

Realistic Datasets are biased? Most datasets are biased. Unlikely to sample each and every feature vector. For most problems, it is at least feature bias. P(s=1|x,y) = P(s=1|x)

Effect on Learning Learning algorithms estimate the true conditional probability True probability P(y|x), such as P(fraud|x)? Estimated probabilty P(y|x,M): M is the model built. Conditional probability in the biased data. P(y|x,s=1) Key Issue: P(y|x,s=1) = P(y|x) ? At least for those sampled examples.

Appropriate Assumptions More good training examples in feature bias than both class bias and complete bias. good: P(y|x,s=1) = P(y|x) beware: it is incorrect to conclude that P(y|x,s=1) = P(y|x) unless under some restricted situations that can rarely happen. For class bias and complete bias, it is hard to derive anything. It is hard to make any more detailed claims without knowing more about Both the sampling process The true function.

Categorizing into the exact type is difficult. You dont know what you dont know. Not that bad, since the key issue is the number of examples with bad conditional probability. Small Large

Small Solutions Posterior weighting Class Probability Integration Over Model Space Averaging of estimated class probabilities weighted by posterior Removes model uncertainty by averaging

Prove that the expected error of model averaging is less than any single model combined. What this says: Compute many models in different ways Dont hang on one tree

Large Solutions When too many base modelss estimates are off track, the power of model averaging will be limited. In this case, we need to smartly use unlabeled example that are unbiased. Reasonable assumption: unlabeled examples are usually plenty and easier to get.

How to Use Them Estimate joint probability P(x,y) instead of just conditional probability, i.e., P(x,y) = P(y|x)P(x) Makes no difference use 1 model, but Multiple models

Examples of How This Works P 1 (+|x) = 0.8 and P 2 (+|x) = 0.4 P 1 (-|x) = 0.2 and P 2 (+|x) = 0.6 model averaging, P(+|x) = ( ) / 2 = 0.6 P(-|x) = ( )/2 = 0.4 Prediction will be –

But if there are two P(x) models, with probability 0.05 and 0.4 Then P(+,x) = 0.05 * * 0.4 = 0.2 P(-,x) = 0.05 * * 0.6 = 0.25 Recall with model averaging: P(+|x) = 0.6 and P(-|x)=0.4 Prediction is + But, now the prediction will be – instead of + Key Idea: Unlabeled examples can be used as weights to re-weight the models.

Improve P(y|x) Use a semi-supervised discriminant learning procedure (Vittaut et al, 2002) Basic procedure: Use learned models to predict unlabeled examples. Use a random sample of predicted unlabeled examples to combine with labeled training data Re-train the model Repeat until the predictions on unlabeled examples remain stable.

Experiments Feature Bias Generation Sort the according to feature values chop off the top.

Class Bins Randomly generate prior class probability distribution P(y). Just the number, such as P(+)=0.1 and P(-)=0.9 Sample without replacement from class bins Class Bias Generation

Complete Bias Generation Recall: the probability to sample an example (x,y) is dependent on both x and y. Easiest simulation: Sample (x,y) without replacement from the training data.

Feature Bias

Datasets Adult: 2 classes SJ: 3 classes SS: 3 classes Pendig: 10 classes ArtiChar: 10 classes Query: 4 classes Donation: 2 classes, cost-sensitive Credit Card: 2 classes, cost-sensitive

Winners and Losers Single model *never wins* Under feature bias winners: model averaging *with or without* improved conditional probability using unlabeled examples Joint probability averaging with *uncorrelated* P(y|x) and P(x) models (details in paper) Under class bias winners: Joint probability averaging with *correlated* P(y|x) and *improved* P(x) models. Under complete bias: Model averaging with improved P(y|x)

Summary According to our definition, sample selection bias is ubiquitous. Categorization of sample selection bias into 4 types is useful for analysis, but hard to use in practice. In practice, the key question is: relative number of examples with inaccurate P(y|x). Small: use model averaging of conditional probabilities of several models Medium: use model averaging of improved conditional probabilities Large: use joint probability averaging of uncorrelated conditional probability and feature probability

When the number is small Prove in the paper that the expected error of model averaging is less than any single model combined. What this says: Compute model in different ways Dont hang yourself on one tree