Tradeoffs Between Fairness and Accuracy in Machine Learning

Slides:

Advertisements

Similar presentations

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.

Advertisements

Data Mining Classification: Alternative Techniques

Model Assessment, Selection and Averaging

9-1 Hypothesis Testing Statistical Hypotheses Statistical hypothesis testing and confidence interval estimation of parameters are the fundamental.

Ensemble Learning: An Introduction

Induction of Decision Trees

Evaluating Hypotheses

Machine Learning: Ensemble Methods

Experimental Evaluation

Part I: Classification and Bayesian Learning

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

For Better Accuracy Eick: Ensemble Learning

Online Learning Algorithms

1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)

Hypothesis Testing in Linear Regression Analysis

Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.

Review: Two Main Uses of Statistics 1)Descriptive : To describe or summarize a collection of data points The data set in hand = all the data points of.

PARAMETRIC STATISTICAL INFERENCE

9-1 Hypothesis Testing Statistical Hypotheses Definition Statistical hypothesis testing and confidence interval estimation of parameters are.

Inferential Statistics 2 Maarten Buis January 11, 2006.

Confidence intervals: The basics BPS chapter 14 © 2006 W.H. Freeman and Company.

CpSc 881: Machine Learning Evaluating Hypotheses.

Machine Learning Chapter 5. Evaluating Hypotheses

1 CSI5388 Current Approaches to Evaluation (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.

Data Mining and Decision Support

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Multi-armed Bandit Problems WAIM 2014.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Estimating standard error using bootstrap

Ensemble Classifiers.

Machine Learning: Ensemble Methods

Chapter 5 STATISTICAL INFERENCE: ESTIMATION AND HYPOTHESES TESTING

Making inferences from collected data involve two possible tasks:

Game Theory Just last week:

STATISTICAL INFERENCE

Unit 3 Hypothesis.

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Session 7: Face Detection (cont.)

Data Mining Lecture 11.

Thinking (and teaching) Quantitatively about Bias and Privacy

CS 4/527: Artificial Intelligence

LECTURE 05: THRESHOLD DECODING

Machine Learning Techniques for Data Mining

Tuning bandit algorithms in stochastic environments

Random Sampling over Joins Revisited

9 Tests of Hypotheses for a Single Sample CHAPTER OUTLINE

Problems: Q&A chapter 6, problems Chapter 6:

Discrete Event Simulation - 4

LECTURE 05: THRESHOLD DECODING

Summarizing Data by Statistics

Evaluating Hypotheses

Task Force on Victimization Eurostat, October 2011 Guillaume Osier

Chapter 2: Evaluative Feedback

Seth Neel University of Pennsylvania EC 2018: MD4SG Workshop

Model Combination.

LECTURE 05: THRESHOLD DECODING

Lecture 14 Learning Inductive inference

Mathematical Foundations of BME Reza Shadmehr

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

INTRODUCTION TO Machine Learning 3rd Edition

Chapter 2: Evaluative Feedback

Markov Decision Processes

MGS 3100 Business Analysis Regression Feb 18, 2016

Markov Decision Processes

Presentation transcript:

Tradeoffs Between Fairness and Accuracy in Machine Learning Aaron Roth

The Basic Setup Given: Data, consisting of 𝑑 features and a label. Goal: Find a rule to predict label from features. College? Bankruptcy? Tall? Employed? Homeowner? Paid back loan? Y N Yes No 𝑥 𝑦 (College and Employed and not Bankruptcy) or (Tall and Employed and not College)

The Basic Setup Given data, select a hypothesis ℎ: 𝑌,𝑁 𝑑 →{𝑌,𝑁} Goal is not prediction on the training data, but prediction on new examples.

𝑷 The Basic Setup ( 𝑥 1 , 𝑦 1 ) Training Data ( 𝑥 2 , 𝑦 2 ) ( 𝑥 1 , 𝑦 1 ) Training Data ( 𝑥 2 , 𝑦 2 ) 𝑷 ( 𝑥 3 , 𝑦 3 ) ( 𝑥 𝑖 , 𝑦 𝑖 ) ( 𝑥 𝑛 , 𝑦 𝑛 ) (𝑥, 𝑦) Algorithm ℎ ℎ(𝑥) The example is misclassified if ℎ 𝑥 ≠𝑦.

The Basic Setup Goal: Find a classifier to minimize 𝑒𝑟𝑟 ℎ = Pr 𝑥,𝑦 ∼𝑃 [ℎ 𝑥 ≠𝑦] We don’t know 𝑃… But we can minimize the empirical error: 𝑒𝑟𝑟 ℎ,𝐷 = 1 𝑛 𝑥,𝑦 ∈𝐷 :ℎ 𝑥 ≠𝑦

The Basic Setup Empirical Error Minimization: College? Bankruptcy? Tall? Employed? Homeowner? Paid back loan? Y N Yes No The Basic Setup Empirical Error Minimization: Try a lookup table! ℎ 𝑥 =𝑌 if 𝑥∈ 𝑌𝑁𝑌𝑌𝑁, 𝑌𝑁𝑁𝑌𝑁, 𝑌𝑁𝑌𝑌𝑁, 𝑁𝑌𝑌𝑌𝑌 ℎ 𝑥 =𝑁 otherwise. 𝑒𝑟𝑟 ℎ,𝐷 = 0. This would over-fit. We haven’t learned, Just memorized. Learning must summarize.

The Basic Setup Instead, limit hypotheses to come from a “simple” class 𝐶. E.g. linear functions, or small decision trees, etc. Compute ℎ ∗ = arg min ℎ∈𝐶 𝑒𝑟𝑟 ℎ,𝐷 𝑒𝑟𝑟 ℎ ∗ ≤ 𝑒𝑟𝑟 ℎ ∗ ,𝐷 + max ℎ∈𝐶 𝑒𝑟𝑟 ℎ − 𝑒𝑟𝑟 ℎ,𝐷 Training Error Generalization Error

Why might machine learning be “unfair”? Many reasons: Data might encode existing biases. E.g. labels are not “Committed a crime?” but “Was arrested.” Data collection feedback loops. E.g. only observe “Paid back loan?” if the loan was granted. Different populations with different properties. E.g. “SAT score” might correlate with label differently in populations that employ SAT tutors. Less data (by definition) about minority populations.

Why might machine learning be “unfair”? Many reasons: Data might encode existing biases. E.g. labels are not “Committed a crime?” but “Was arrested.” Data collection feedback loops. E.g. only observe “Paid back loan?” if the loan was granted. Different populations with different properties. E.g. “SAT score” might correlate with label differently in populations that employ SAT tutors. Less data (by definition) about minority populations.

SAT Score Number of Credit Cards Population 1 Population 2

SAT Score Number of Credit Cards Population 1 Population 2

Else, optimizing accuracy fits the majority population. SAT Score Upshot: To be “fair”, the algorithm may need to explicitly take into account group membership. Else, optimizing accuracy fits the majority population. Number of Credit Cards Population 1 Population 2

Why might machine learning be “unfair”? Many reasons: Data might encode existing biases. E.g. labels are not “Committed a crime?” but “Was arrested.” Data collection feedback loops. E.g. only observe “Paid back loan?” if the loan was granted. Different populations with different properties. E.g. “SAT score” might correlate with label differently in populations that employ SAT tutors. Less data (by definition) about minority populations.

SAT Score Number of Credit Cards Population 1 Population 2

Qualified individuals will be denied at a higher rate. Upshot: Algorithms trained on minority populations will be less accurate. Qualified individuals will be denied at a higher rate. SAT Score Number of Credit Cards Population 1 Population 2

Why might machine learning be “unfair”? Many reasons: Data might encode existing biases. E.g. labels are not “Committed a crime?” but “Was arrested.” Data collection feedback loops. E.g. only observe “Paid back loan?” if the loan was granted. Different populations with different properties. E.g. “SAT score” might correlate with label differently in populations that employ SAT tutors. Less data (by definition) about minority populations.

Toy Model Two kinds of loan applicants. Type 1: Pays back loan with unknown probability 𝑝 1 Type 2: Pays back loan with unknown probability 𝑝 2 Initially, bank believes 𝑝 1 , 𝑝 2 uniform in [0,1] Every day, bank makes a loan to type most likely to pay it back according to posterior distribution on 𝑝 1 , 𝑝 2 Bank observes if it is repaid, but not counterfactuals. Bank then updates posteriors.

Toy Model Suppose bank has made 𝑛 𝑖 loans to population 𝑖, 𝑠 𝑖 have paid them back , 𝑑 𝑖 have defaulted. ( 𝑛 𝑖 = 𝑠 𝑖 + 𝑑 𝑖 ) Expected payoff of next loan is 𝑠 𝑖 +1 𝑠 𝑖 + 𝑑 𝑖 +2 𝑝 1 = 𝑝 2 = 1 2 0.5 0.5 𝑛 𝑖 =0 𝑠 𝑖 =0 𝑑 𝑖 =0 𝑛 𝑖 =0 𝑠 𝑖 =0 𝑑 𝑖 =0

Toy Model Suppose bank has made 𝑛 𝑖 loans to population 𝑖, 𝑠 𝑖 have paid them back , 𝑑 𝑖 have defaulted. ( 𝑛 𝑖 = 𝑠 𝑖 + 𝑑 𝑖 ) Expected payoff of next loan is 𝑠 𝑖 +1 𝑠 𝑖 + 𝑑 𝑖 +2 𝑝 1 = 𝑝 2 = 1 2 0.5 0.33 𝑛 𝑖 =1 𝑠 𝑖 =0 𝑑 𝑖 =1 𝑛 𝑖 =0 𝑠 𝑖 =0 𝑑 𝑖 =0

Toy Model Suppose bank has made 𝑛 𝑖 loans to population 𝑖, 𝑠 𝑖 have paid them back , 𝑑 𝑖 have defaulted. ( 𝑛 𝑖 = 𝑠 𝑖 + 𝑑 𝑖 ) Expected payoff of next loan is 𝑠 𝑖 +1 𝑠 𝑖 + 𝑑 𝑖 +2 𝑝 1 = 𝑝 2 = 1 2 0.66 0.33 𝑛 𝑖 =1 𝑠 𝑖 =0 𝑑 𝑖 =1 𝑛 𝑖 =1 𝑠 𝑖 =1 𝑑 𝑖 =0

Toy Model Suppose bank has made 𝑛 𝑖 loans to population 𝑖, 𝑠 𝑖 have paid them back , 𝑑 𝑖 have defaulted. ( 𝑛 𝑖 = 𝑠 𝑖 + 𝑑 𝑖 ) Expected payoff of next loan is 𝑠 𝑖 +1 𝑠 𝑖 + 𝑑 𝑖 +2 𝑝 1 = 𝑝 2 = 1 2 0.75 0.33 𝑛 𝑖 =1 𝑠 𝑖 =0 𝑑 𝑖 =1 𝑛 𝑖 =2 𝑠 𝑖 =2 𝑑 𝑖 =0

Toy Model Suppose bank has made 𝑛 𝑖 loans to population 𝑖, 𝑠 𝑖 have paid them back , 𝑑 𝑖 have defaulted. ( 𝑛 𝑖 = 𝑠 𝑖 + 𝑑 𝑖 ) Expected payoff of next loan is 𝑠 𝑖 +1 𝑠 𝑖 + 𝑑 𝑖 +2 𝑝 1 = 𝑝 2 = 1 2 0.60 0.33 𝑛 𝑖 =1 𝑠 𝑖 =0 𝑑 𝑖 =1 𝑛 𝑖 =3 𝑠 𝑖 =2 𝑑 𝑖 =1

Toy Model Suppose bank has made 𝑛 𝑖 loans to population 𝑖, 𝑠 𝑖 have paid them back , 𝑑 𝑖 have defaulted. ( 𝑛 𝑖 = 𝑠 𝑖 + 𝑑 𝑖 ) Expected payoff of next loan is 𝑠 𝑖 +1 𝑠 𝑖 + 𝑑 𝑖 +2 𝑝 1 = 𝑝 2 = 1 2 0.5 0.33 𝑛 𝑖 =1 𝑠 𝑖 =0 𝑑 𝑖 =1 𝑛 𝑖 =4 𝑠 𝑖 =2 𝑑 𝑖 =2

Toy Model Suppose bank has made 𝑛 𝑖 loans to population 𝑖, 𝑠 𝑖 have paid them back , 𝑑 𝑖 have defaulted. ( 𝑛 𝑖 = 𝑠 𝑖 + 𝑑 𝑖 ) Expected payoff of next loan is 𝑠 𝑖 +1 𝑠 𝑖 + 𝑑 𝑖 +2 𝑝 1 = 𝑝 2 = 1 2 0.57 0.33 𝑛 𝑖 =1 𝑠 𝑖 =0 𝑑 𝑖 =1 𝑛 𝑖 =5 𝑠 𝑖 =3 𝑑 𝑖 =2

Toy Model Suppose bank has made 𝑛 𝑖 loans to population 𝑖, 𝑠 𝑖 have paid them back , 𝑑 𝑖 have defaulted. ( 𝑛 𝑖 = 𝑠 𝑖 + 𝑑 𝑖 ) Expected payoff of next loan is 𝑠 𝑖 +1 𝑠 𝑖 + 𝑑 𝑖 +2 𝑝 1 = 𝑝 2 = 1 2 0.625 0.33 𝑛 𝑖 =1 𝑠 𝑖 =0 𝑑 𝑖 =1 𝑛 𝑖 =6 𝑠 𝑖 =4 𝑑 𝑖 =2

Toy Model Suppose bank has made 𝑛 𝑖 loans to population 𝑖, 𝑠 𝑖 have paid them back , 𝑑 𝑖 have defaulted. ( 𝑛 𝑖 = 𝑠 𝑖 + 𝑑 𝑖 ) Expected payoff of next loan is 𝑠 𝑖 +1 𝑠 𝑖 + 𝑑 𝑖 +2 𝑝 1 = 𝑝 2 = 1 2 0.55 0.33 𝑛 𝑖 =1 𝑠 𝑖 =0 𝑑 𝑖 =1 𝑛 𝑖 =7 𝑠 𝑖 =4 𝑑 𝑖 =3

Toy Model Upshot: Algorithms making myopically optimal decisions may forever discriminate against a qualified population because of unlucky prefix. Less myopic algorithms balance exploration and exploitation – but this has its own unfairness issues. Suppose bank has made 𝑛 𝑖 loans to population 𝑖, 𝑠 𝑖 have paid them back , 𝑑 𝑖 have defaulted. ( 𝑛 𝑖 = 𝑠 𝑖 + 𝑑 𝑖 ) Expected payoff of next loan is 𝑠 𝑖 +1 𝑠 𝑖 + 𝑑 𝑖 +2 𝑝 1 = 𝑝 2 = 1 2 0.5000001 0.33 𝑛 𝑖 =1 𝑠 𝑖 =0 𝑑 𝑖 =1 𝑛 𝑖 =𝑥 𝑠 𝑖 ≈𝑥/2 𝑑 𝑖 ≈𝑥/2

A solution in a stylized model… So what can we do? A solution in a stylized model… Joint work with: Matthew Joseph, Michael Kearns, Jamie Morgenstern Seth Neel.

A Richer Model

A Richer Model

A Richer Model

The Contextual Bandit Problem 𝑘 populations 𝑖∈{1,…, 𝑘} Each with an unknown function 𝑓 𝑖 : ℝ 𝑑 →[0,1] e.g. 𝑓 𝑖 𝑥 =〈 𝛽 𝑖 ,𝑥〉 In rounds 𝑡∈ 1,…,𝑇 : A member of each population 𝑖 arrives with an application 𝑥 𝑖 𝑡 ∈ ℝ 𝑑 The bank chooses an individual 𝑖 𝑡 , obtains reward 𝑟 𝑖 𝑡 𝑡 such that 𝔼 𝑟 𝑖 𝑡 𝑡 = 𝑓 𝑖 𝑡 ( 𝑥 𝑖 𝑡 𝑡 ) Bank does not observe reward for individuals not chosen.

The Contextual Bandit Problem Measure performance of the algorithm via Regret: Regret = 1 𝑇 𝑡 max 𝑖 𝑓 𝑖 𝑥 𝑖 𝑡 − 𝑡 𝑓 𝑖 𝑡 𝑥 𝑖 𝑡 𝑡 “UCB” style algorithms can get regret 𝑂 𝑑⋅𝑘 𝑇

A Fairness Definition “Fair Equality of Opportunity”: “… assuming there is a distribution of natural assets, those who are at the same level of talent and ability, and have the same willingness to use them, should have the same prospects of success regardless of their initial place in the social system.” – John Rawls

A Fairness Definition “Fair Equality of Opportunity”: The algorithm should never preferentially favor (i.e. choose with higher probability) a less qualified individual over a more qualified individual.

A Fairness Definition “Fair Equality of Opportunity”: For every round 𝑡: Pr Alg chooses 𝑖 >Pr⁡[Alg chooses 𝑗] only if: 𝑓 𝑖 𝑥 𝑖 𝑡 > 𝑓 𝑗 ( 𝑥 𝑗 𝑡 )

Is Fairness in Conflict with Profit? Note --- optimal policy is fair! Every day, select arg max 𝑖 𝑓 𝑖 ( 𝑥 𝑖 𝑡 ) Constraint only prohibits favoring worse applicants. Maybe goals of learner are already aligned with this constraint? Main Take Away: Wrong. Fairness still an obstruction to learning the optimal policy. Even for 𝑑=0, requires Θ( 𝑘 3 ) rounds to obtain non- trivial regret.

FairUCB 𝜷 𝟏 𝒕 , 𝒙 𝟏 𝒕 = 𝝁 𝒕 If two intervals overlap, don’t know which is better. If two intervals overlap, play with = probability What about me? 𝜷 𝟏 𝒕 , 𝒙 𝟏 𝒕 = 𝝁 𝒕 “Standard” choice [Auer et al ’02] Reward 40

“Standard Approach” Biases Towards Uncertainty Example with 2 features: 𝑥 1 is “level of education” 𝑥 2 is “number of tophats owned” For Group 1: 𝛽=(1, 0) 90%: Random s.t. 𝑥 1 = 𝑥 2 (perfectly correlated) 10%: Random ( 𝑥 1 , 𝑥 2 independent) For Group 2: 𝛽= 0.5,0.5 All applications uniform.

“Standard Approach” Biases Towards Uncertainty “Unfair Rounds”: 𝑡 s.t. Algorithm selects sub-optimal applicant. Discrimination Index for Individual: Probability of victimization conditioned on participating in being present at unfair round.

Chaining: Chained

FairUCB Pseudo-Code For t = 1 to ∞ Using OLS estimators 𝛽 𝑖 , construct reward estimators and widths 𝑤 𝑖 such that with high probability: 𝜇 𝑖 𝑡 = 𝛽 𝑖 , 𝑥 𝑖 𝑡 𝜇 𝑖 𝑡 ∈ 𝜇 𝑖 𝑡 − 𝑤 𝑖 , 𝜇 𝑖 𝑡 + 𝑤 𝑖 Compute 𝑖 ∗ = arg max 𝑖 ( 𝑢 𝑖 𝑡 + 𝑤 𝑖 ) Select an individual 𝑖 𝑡 uniformly at random from among the set of individuals chained to 𝑖 ∗ Observe reward 𝑟 𝑖 𝑡 𝑡 and update 𝛽 𝑖 𝑡 .

FairUCB’s Guarantees: Regret is: 𝑂 𝑑 𝑘 3 𝑇 Worse than without fairness – but only by a factor of 𝑂 𝑘⋅ 𝑑 When 𝑓 𝑖 are linear, fairness has a cost, but only a mild one. What about when 𝑓 𝑖 are non-linear? Can quickly, fairly learn ⇔ Can quickly find tight confidence intervals. “Cost of Fairness” can sometimes be exponential in 𝑑.

Discrimination Index Standard UCB Fair UCB

Performance

Discussion Even without a taste for discrimination “baked in”, natural (optimal) learning procedures can disproportionately benefit certain populations. Imposing “fairness” constraints is not without cost. Must quantify tradeoffs between fairness and other desiderata No Pareto dominant solution – how should we choose between different points on the tradeoff curve? None of these points is unique to “algorithmic” decision making. Just allows formal analysis to bring these universal issues to the fore.

Fairness Definition

Where does cost of fairness come from? Distribution over instances k Bernoulli arms Only uncertainty is their means Arm i has mean μi which is either

Lower bound particulars Mean of Distribution b’k bk b’k-1 bk-1 Must play w. = probability until confident they’re different 5151

Lower bound particulars Mean of Distribution Must play w. = probability until confident they’re different bk-1 b’k-2 bk-2 b’k-1 5252

Lower bound particulars Mean of Distribution Must play w. = probability until confident they’re different 5353

Lower bound particulars Mean of Distribution Must play w. = probability until confident they’re different b’2 b2 b’1 b1 5454

Lower bound particulars Mean of Distribution p’k pk Must play w. = probability until confident they’re different pk-1 p’k-1 p’2 p2 p’1 p1

So, there’s a separation between fair/unfair learning Without fairness one can achieve regret but this lower bound implies fair regret of