Zhipeng (Patrick) Luo December 6th, 2016

Slides:

Advertisements

Similar presentations

Brief introduction on Logistic Regression

Advertisements

CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.

On-line learning and Boosting

Pattern Recognition and Machine Learning

Supervised Learning Recap

CMPUT 466/551 Principal Source: CMU

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Linear Regression with One Regression

Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.

Ensemble Learning: An Introduction

Hypothesis Testing and Dynamic Treatment Regimes S.A. Murphy, L. Gunter & B. Chakraborty ENAR March 2007.

Variable Selection for Tailoring Treatment

Machine Learning CMPT 726 Simon Fraser University

Sparse vs. Ensemble Approaches to Supervised Learning

Variable Selection for Optimal Decision Making Lacey Gunter University of Michigan Statistics Department Michigan Student Symposium for Interdisciplinary.

1 Variable Selection for Tailoring Treatment S.A. Murphy, L. Gunter & J. Zhu May 29, 2008.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Ensemble Learning (2), Tree and Forest

Selective Sampling on Probabilistic Labels Peng Peng, Raymond Chi-Wing Wong CSE, HKUST 1.

Online Learning Algorithms

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.

Benk Erika Kelemen Zsolt

Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

Empirical Efficiency Maximization: Locally Efficient Covariate Adjustment in Randomized Experiments Daniel B. Rubin Joint work with Mark J. van der Laan.

Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.

Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.

Linear Models for Classification

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

Agnostic Active Learning Maria-Florina Balcan*, Alina Beygelzimer**, John Langford*** * : Carnegie Mellon University, ** : IBM T.J. Watson Research Center,

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.

Gaussian Processes For Regression, Classification, and Prediction.

Effect of the Reference Set on Frequency Inference Donald A. Pierce Radiation Effects Research Foundation, Japan Ruggero Bellio Udine University, Italy.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

REGRESSION MODEL FITTING & IDENTIFICATION OF PROGNOSTIC FACTORS BISMA FAROOQI.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Confidential and Proprietary Business Information. For Internal Use Only. Statistical modeling of tumor regrowth experiment in xenograft studies May 18.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Bootstrap and Model Validation

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning: Ensemble Methods

Chapter 7. Classification and Prediction

Sofus A. Macskassy Fetch Technologies

Empirical risk minimization

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Machine Learning Basics

Roberto Battiti, Mauro Brunato

K Nearest Neighbor Classification

Methods of Economic Investigation Lecture 12

CSCI B609: “Foundations of Data Science”

INF 5860 Machine learning for image classification

Neuro-Computing Lecture 4 Radial Basis Function Network

Computational Learning Theory

A Novel Evaluation Methodology for Assessing Off-Policy Learning Methods in Contextual Bandits Negar Hassanpour and Russ Greiner Department of Computing.

Computational Learning Theory

Pattern Recognition and Machine Learning

Seminar in Economics Econ. 470

The loss function, the normal equation,

Mathematical Foundations of BME Reza Shadmehr

Model generalization Brief summary of methods

Empirical risk minimization

Presentation transcript:

Zhipeng (Patrick) Luo December 6th, 2016 Active Learning Zhipeng (Patrick) Luo December 6th, 2016

Motivation Labeling cost in supervised learning can be huge. A typical supervised learning is called passive learning. Active Learning: Aims to find a good mapping without using too many labeled data.

Active Learning Paradigm Iterative learning framework: Start with initial labeled data; Repeat: Fit a classifier based on current labeled data; Actively sample a most important unlabeled point based on current classifier; Obtain its label from a human labeler. A glut of different querying strategies: Uncertainty Novelty (Redundancy) Representativeness

Three Scenarios of Active Learning Pool-based Pick one point from a unlabeled data pool. Uncertainty closest to decision boundary highest entropy in the predicted label Query synthesize Aggressively search the unlabeled data space for a data point. But can be unrealistic for labeling. Sequential Labeling Unlabeled data arrive sequentially. Mellow learner

Notations The unlabeled data space: 𝑿 The label space: a finite set 𝒀 A mapping: ℎ :𝑿→𝒀, ℎ ∈𝑯 ℎ 𝑛 : a hypothesis learned after seeing n examples. A pre-specified hypothesis space: 𝑯 An underlying distribution of 𝑿×𝒀: 𝑷 Error of a hypothesis ℎ: 𝑒𝑟𝑟 ℎ =𝑷[ℎ(𝑋)≠𝑌] The optimal hypothesis ℎ ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑯 𝑒𝑟𝑟(ℎ) Separable case: 𝑒𝑟𝑟(ℎ ∗ )=0; non-separable otherwise Active Learning: hope ℎ 𝑛 → ℎ ∗ when n grows.

A Toy Example Setting: 𝑿 is one-dimensional; 𝒀 is binary; 𝑯 is a set of linear separators (thresholds); Linear separable; 𝜖 is a pre-defined error rate. Passive learning: 𝑂(1/𝜖) randomly labeled points. Binary Search: 𝑂( log 1/𝜖 ) examples.

Separable Case 𝑯 1 =𝑯 For t = 1, 2, … Receive an unlabeled point 𝑥 𝑡 ; If disagreement about 𝑥 𝑡 ’s label: Query label 𝑦 𝑡 of 𝑥 𝑡 ; 𝑯 𝑡+1 ={ℎ∈ 𝑯 𝑡 :ℎ 𝑥 𝑡 = 𝑦 𝑡 }. Else: 𝑯 𝑡+1 = 𝑯 𝑡

Separable Case

Separable Case Label complexity How many labels are needed? 𝐿 𝜖,𝛿 = 𝑡 𝑜 𝑠.𝑡. 𝑓𝑜𝑟 ∀ 𝑡≥ 𝑡 𝑜 , 𝑃 ∃ℎ∈ 𝐻 𝑡 , 𝑒𝑟𝑟 ℎ >𝜖 ≤𝛿 How many labels are needed? 𝐿 𝜖,𝛿 ≤ 𝑂 (𝜃𝑑𝑙𝑜𝑔1/𝜖) 𝑂 suppresses terms logarithmic in 𝑑, 𝜃 and 𝑙𝑜𝑔1/𝜖 𝑑 is the VC dimension of a hypothesis space 𝜃 is constant called disagreement coefficient (to be talked about later) In contrast to passive learning: The label complexity is Ω(𝑑/𝜖). Active learning yields an exponential improvement.

Separable Case Non-separable Case 𝑯 1 =𝑯 For t = 1, 2, … Receive an unlabeled point 𝑥 𝑡 ; If there exists disagreement about 𝑥 𝑡 ’s label: Query label 𝑦 𝑡 of 𝑥 𝑡 ; 𝑯 𝑡+1 ={ℎ∈ 𝑯 𝑡 :ℎ 𝑥 𝑡 = 𝑦 𝑡 }. Else: 𝑯 𝑡+1 = 𝑯 𝑡 If there exists disagreement about 𝑥 𝑡 ’s label: Query label 𝑦 𝑡 of 𝑥 𝑡 ; Else: Infer label 𝑦 𝑡 of 𝑥 𝑡 ; 𝑦 𝑡 may be wrong. 𝑯 𝑡+1 = {ℎ∈ 𝑯 𝑡 :𝑒𝑟𝑟𝑒 ℎ ≤ 𝑒𝑟𝑟 𝑒 ℎ 𝑡 ∗ + ∆ 𝑡 }

Separable Case Non-separable Case Label complexity Label complexity 𝐿 𝜖,𝛿 = 𝑡 𝑜 𝑠.𝑡. 𝑓𝑜𝑟 ∀ 𝑡≥ 𝑡 𝑜 , 𝑃 ∃ℎ∈ 𝐻 𝑡 , 𝑒𝑟𝑟 ℎ >𝜖 ≤𝛿 Active Learning 𝐿 𝜖,𝛿 ≤ 𝑂 (𝜃𝑑𝑙𝑜𝑔1/𝜖) Passive Learning Ω(𝑑/𝜖) Label complexity 𝐿 𝜖,𝛿 = 𝑡 𝑜 𝑠.𝑡. 𝑓𝑜𝑟 ∀ 𝑡≥ 𝑡 𝑜 , 𝑃 ∃ℎ∈ 𝐻 𝑡 , 𝑒𝑟𝑟 ℎ >𝜖+𝑣 ∃ℎ∈ 𝐻 𝑡 , 𝑒𝑟𝑟 ℎ >𝜖+𝑣 ≤𝛿 𝑣=𝑒𝑟𝑟( ℎ ∗ ) Active Learning 𝐿 𝜖,𝛿 ≤ 𝑂 (𝜃𝑑(𝑙𝑜𝑔21/𝜖+ 𝑣 2 𝜖 2 )) Passive Learning Ω( 𝑑 𝜖 + 𝑑 𝑣 2 𝜖 2 )

The Disagreement Coefficient 𝜃 𝑑 ℎ, ℎ ∗ = 𝑷 𝑿 [ℎ 𝑋 ≠ ℎ ∗ (𝑋)] 𝐵(ℎ ∗ ,𝑟)={ℎ∈𝑯:𝑑 ℎ, ℎ ∗ ≤𝑟} 𝐷𝐼𝑆(𝐵(ℎ ∗ ,𝑟))={𝑥∈𝑿,∃ℎ, ℎ ′ ∈ 𝐵(ℎ ∗ ,𝑟), 𝑠.𝑡. ℎ(𝑥)≠ℎ′(𝑥)} 𝜃= 𝑠𝑢𝑝 𝑟>0 𝑷[ 𝐷𝐼𝑆(𝐵(ℎ ∗ ,𝑟))] 𝑟 This coefficient measures how 𝑷[ 𝐷𝐼𝑆(𝐵(ℎ ∗ ,𝑟))] scales with 𝑟.

Active Clinical Trials for Personalized Medicine Stanislav Minsker, Ying-Qi Zhao, and Guang Cheng JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 2016

Introduction Individualized treatment rules (ITRs) Find the optimal treatment for each patient Each patient’s characteristics include graphics, medical history, genetic or genomic information etc. Randomized clinical trials (RCTs) Only focuses on efficacy of treatments Not efficient, as sample size (cost) can be large Active clinical trials (ACTs) Exclude patients for whom the benefit of some treatment is clear. Select those whose optimal treatment is hard to determine An instance of uncertainty sampling.

Problem Setting Data (𝑋,𝐴,𝑅) has a joint probability 𝑷. 𝑋∈ 𝑹 𝑝 , representing a patient case; 𝐴∈{1,−1}, representing a treatment decision: standard or alternative 𝑅∈𝑹 stands for a treatment outcome, the larger the better. An ITR 𝐷:𝑋→𝐴 is a binary mapper: 𝐷 ∗ (𝑥)=𝑠𝑖𝑔𝑛{ 𝑓 ∗ (𝑥)} 𝑓 ∗ 𝑥 =𝐸 𝑅 𝐴=1,𝑋=𝑥 −𝐸[𝑅|𝐴=−1,𝑋=𝑥] Called contrast function. Defines the optimal decision boundary. Can be modeled with a regression model.

Active Clinical Trials Its key idea is to find the active set in each iteration The optimal treatment of any patient in this set is uncertain.

Active Clinical Trials Step 1: Initialization Randomly recruit patients and randomly treat them; Observe their outcomes (labeled); Train the initial estimators. Step 2: Active learning Find the active set patients; Randomly treat them; Observe their outcomes; Update the estimators; Repeat until budget runs out, then output the final estimator.

Active Set Define 𝐹(𝑓,𝛿) to be a set of hypotheses that are 𝛿-close to 𝑓. 𝐹 𝑓,𝛿 ={𝑔: 𝑔−𝑓 ∞ ≤𝛿} For each iteration: Find 𝐹( 𝑓 𝑡−1 ,𝛿); Active Set 𝐴𝑆 𝑡 ={𝑥:∃ 𝑓 1 , 𝑓 2 ∈𝐹( 𝑓 𝑡−1 ,𝛿), sign( 𝑓 2 )≠𝑠𝑖𝑔𝑛( 𝑓 1 )} Approximate 𝐴𝑆 𝑡 with a regular set 𝑎𝑐𝑡 𝑡 Its purpose is to determine the active set based on intrinsic dimensions.

Smoothed Kernel Estimator

Kernel Bandwidth

Kernel Bandwidth

Theoretic Bound With probability greater than , it holds: 𝐶 is a constant depending on kernel and X distribution 𝑁 is the total number of patients recruited. 𝑑 is the intrinsic dimension cardinality and 𝛾∈ 1,𝑑 a constant.

Real Data Analysis Two data sets: Methods: Nefazodone-CBASP Clinical Trial Twelve-Step Intervention on Simulant Drug Use Methods: AL-BV: active learning with smoothed kernel (non-parametric) AL-GP: active learning with Gaussian Process (parametric) OWS: passive learning with outcome weighted learning (hinge loss) OLS: passive learning with ordinary least squares loss

Nefazodone-CBASP Clinical Trials Patient predicators: The baseline HRSD (Hamilton rating scale for depression) scores, the alcohol dependence, and the HAMA somatic anxiety scores Three treatments: Nefazodone, CBASP and combination of both Outcome: HRSD: the higher the worse

Nefazodone-CBASP Clinical Trials

Twelve-Step Intervention on Simulant Drug Use Patient predicators: Age, average number of days per month of self-reported stimulant drug use in the 3 months prior to randomization, baseline alcohol use, drug use, employment status, medical status, and psychiatric status composite scores on the addiction severity index (ASI). To reduce stimulant drug use, treatments are: As usual (TAU) or to TAU integrated with Stimulant Abuser Groups to Engage in 12-step intervention Outcome: The number of days of self-reported stimulant drug use over the 3- to 6-month post-randomization period. where a smaller value is preferable.

Twelve-Step Intervention on Simulant Drug Use

Thank You!