Zhipeng (Patrick) Luo December 6th, 2016

Zhipeng (Patrick) Luo December 6th, 2016
Active Learning Zhipeng (Patrick) Luo December 6th, 2016

Motivation Labeling cost in supervised learning can be huge.
A typical supervised learning is called passive learning. Active Learning: Aims to find a good mapping without using too many labeled data.

Active Learning Paradigm
Iterative learning framework: Start with initial labeled data; Repeat: Fit a classifier based on current labeled data; Actively sample a most important unlabeled point based on current classifier; Obtain its label from a human labeler. A glut of different querying strategies: Uncertainty Novelty (Redundancy) Representativeness

Three Scenarios of Active Learning
Pool-based Pick one point from a unlabeled data pool. Uncertainty closest to decision boundary highest entropy in the predicted label Query synthesize Aggressively search the unlabeled data space for a data point. But can be unrealistic for labeling. Sequential Labeling Unlabeled data arrive sequentially. Mellow learner

Notations The unlabeled data space: 𝑿 The label space: a finite set 𝒀
A mapping: ℎ :𝑿→𝒀, ℎ ∈𝑯 ℎ 𝑛 : a hypothesis learned after seeing n examples. A pre-specified hypothesis space: 𝑯 An underlying distribution of 𝑿×𝒀: 𝑷 Error of a hypothesis ℎ: 𝑒𝑟𝑟 ℎ =𝑷[ℎ(𝑋)≠𝑌] The optimal hypothesis ℎ ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑯 𝑒𝑟𝑟(ℎ) Separable case: 𝑒𝑟𝑟(ℎ ∗ )=0; non-separable otherwise Active Learning: hope ℎ 𝑛 → ℎ ∗ when n grows.

A Toy Example Setting: 𝑿 is one-dimensional; 𝒀 is binary; 𝑯 is a set of linear separators (thresholds); Linear separable; 𝜖 is a pre-defined error rate. Passive learning: 𝑂(1/𝜖) randomly labeled points. Binary Search: 𝑂( log 1/𝜖 ) examples.

Separable Case 𝑯 1 =𝑯 For t = 1, 2, … Receive an unlabeled point 𝑥 𝑡 ;
If disagreement about 𝑥 𝑡 ’s label: Query label 𝑦 𝑡 of 𝑥 𝑡 ; 𝑯 𝑡+1 ={ℎ∈ 𝑯 𝑡 :ℎ 𝑥 𝑡 = 𝑦 𝑡 }. Else: 𝑯 𝑡+1 = 𝑯 𝑡

Separable Case

Separable Case Label complexity How many labels are needed?
𝐿 𝜖,𝛿 = 𝑡 𝑜 𝑠.𝑡. 𝑓𝑜𝑟 ∀ 𝑡≥ 𝑡 𝑜 , 𝑃 ∃ℎ∈ 𝐻 𝑡 , 𝑒𝑟𝑟 ℎ >𝜖 ≤𝛿 How many labels are needed? 𝐿 𝜖,𝛿 ≤ 𝑂 (𝜃𝑑𝑙𝑜𝑔1/𝜖) 𝑂 suppresses terms logarithmic in 𝑑, 𝜃 and 𝑙𝑜𝑔1/𝜖 𝑑 is the VC dimension of a hypothesis space 𝜃 is constant called disagreement coefficient (to be talked about later) In contrast to passive learning: The label complexity is Ω(𝑑/𝜖). Active learning yields an exponential improvement.

Separable Case Non-separable Case 𝑯 1 =𝑯
For t = 1, 2, … Receive an unlabeled point 𝑥 𝑡 ; If there exists disagreement about 𝑥 𝑡 ’s label: Query label 𝑦 𝑡 of 𝑥 𝑡 ; 𝑯 𝑡+1 ={ℎ∈ 𝑯 𝑡 :ℎ 𝑥 𝑡 = 𝑦 𝑡 }. Else: 𝑯 𝑡+1 = 𝑯 𝑡 If there exists disagreement about 𝑥 𝑡 ’s label: Query label 𝑦 𝑡 of 𝑥 𝑡 ; Else: Infer label 𝑦 𝑡 of 𝑥 𝑡 ; 𝑦 𝑡 may be wrong. 𝑯 𝑡+1 = {ℎ∈ 𝑯 𝑡 :𝑒𝑟𝑟𝑒 ℎ ≤ 𝑒𝑟𝑟 𝑒 ℎ 𝑡 ∗ + ∆ 𝑡 }

Separable Case Non-separable Case Label complexity Label complexity
𝐿 𝜖,𝛿 = 𝑡 𝑜 𝑠.𝑡. 𝑓𝑜𝑟 ∀ 𝑡≥ 𝑡 𝑜 , 𝑃 ∃ℎ∈ 𝐻 𝑡 , 𝑒𝑟𝑟 ℎ >𝜖 ≤𝛿 Active Learning 𝐿 𝜖,𝛿 ≤ 𝑂 (𝜃𝑑𝑙𝑜𝑔1/𝜖) Passive Learning Ω(𝑑/𝜖) Label complexity 𝐿 𝜖,𝛿 = 𝑡 𝑜 𝑠.𝑡. 𝑓𝑜𝑟 ∀ 𝑡≥ 𝑡 𝑜 , 𝑃 ∃ℎ∈ 𝐻 𝑡 , 𝑒𝑟𝑟 ℎ >𝜖+𝑣 ∃ℎ∈ 𝐻 𝑡 , 𝑒𝑟𝑟 ℎ >𝜖+𝑣 ≤𝛿 𝑣=𝑒𝑟𝑟( ℎ ∗ ) Active Learning 𝐿 𝜖,𝛿 ≤ 𝑂 (𝜃𝑑(𝑙𝑜𝑔21/𝜖+ 𝑣 2 𝜖 2 )) Passive Learning Ω( 𝑑 𝜖 + 𝑑 𝑣 2 𝜖 2 )

The Disagreement Coefficient 𝜃
𝑑 ℎ, ℎ ∗ = 𝑷 𝑿 [ℎ 𝑋 ≠ ℎ ∗ (𝑋)] 𝐵(ℎ ∗ ,𝑟)={ℎ∈𝑯:𝑑 ℎ, ℎ ∗ ≤𝑟} 𝐷𝐼𝑆(𝐵(ℎ ∗ ,𝑟))={𝑥∈𝑿,∃ℎ, ℎ ′ ∈ 𝐵(ℎ ∗ ,𝑟), 𝑠.𝑡. ℎ(𝑥)≠ℎ′(𝑥)} 𝜃= 𝑠𝑢𝑝 𝑟>0 𝑷[ 𝐷𝐼𝑆(𝐵(ℎ ∗ ,𝑟))] 𝑟 This coefficient measures how 𝑷[ 𝐷𝐼𝑆(𝐵(ℎ ∗ ,𝑟))] scales with 𝑟.

Active Clinical Trials for Personalized Medicine
Stanislav Minsker, Ying-Qi Zhao, and Guang Cheng JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 2016

Introduction Individualized treatment rules (ITRs)
Find the optimal treatment for each patient Each patient’s characteristics include graphics, medical history, genetic or genomic information etc. Randomized clinical trials (RCTs) Only focuses on efficacy of treatments Not efficient, as sample size (cost) can be large Active clinical trials (ACTs) Exclude patients for whom the benefit of some treatment is clear. Select those whose optimal treatment is hard to determine An instance of uncertainty sampling.

Problem Setting Data (𝑋,𝐴,𝑅) has a joint probability 𝑷.
𝑋∈ 𝑹 𝑝 , representing a patient case; 𝐴∈{1,−1}, representing a treatment decision: standard or alternative 𝑅∈𝑹 stands for a treatment outcome, the larger the better. An ITR 𝐷:𝑋→𝐴 is a binary mapper: 𝐷 ∗ (𝑥)=𝑠𝑖𝑔𝑛{ 𝑓 ∗ (𝑥)} 𝑓 ∗ 𝑥 =𝐸 𝑅 𝐴=1,𝑋=𝑥 −𝐸[𝑅|𝐴=−1,𝑋=𝑥] Called contrast function. Defines the optimal decision boundary. Can be modeled with a regression model.

Active Clinical Trials
Its key idea is to find the active set in each iteration The optimal treatment of any patient in this set is uncertain.

Active Clinical Trials
Step 1: Initialization Randomly recruit patients and randomly treat them; Observe their outcomes (labeled); Train the initial estimators. Step 2: Active learning Find the active set patients; Randomly treat them; Observe their outcomes; Update the estimators; Repeat until budget runs out, then output the final estimator.

Active Set Define 𝐹(𝑓,𝛿) to be a set of hypotheses that are 𝛿-close to 𝑓. 𝐹 𝑓,𝛿 ={𝑔: 𝑔−𝑓 ∞ ≤𝛿} For each iteration: Find 𝐹( 𝑓 𝑡−1 ,𝛿); Active Set 𝐴𝑆 𝑡 ={𝑥:∃ 𝑓 1 , 𝑓 2 ∈𝐹( 𝑓 𝑡−1 ,𝛿), sign( 𝑓 2 )≠𝑠𝑖𝑔𝑛( 𝑓 1 )} Approximate 𝐴𝑆 𝑡 with a regular set 𝑎𝑐𝑡 𝑡 Its purpose is to determine the active set based on intrinsic dimensions.

Smoothed Kernel Estimator

Kernel Bandwidth

Theoretic Bound With probability greater than , it holds:
𝐶 is a constant depending on kernel and X distribution 𝑁 is the total number of patients recruited. 𝑑 is the intrinsic dimension cardinality and 𝛾∈ 1,𝑑 a constant.

Real Data Analysis Two data sets: Methods:
Nefazodone-CBASP Clinical Trial Twelve-Step Intervention on Simulant Drug Use Methods: AL-BV: active learning with smoothed kernel (non-parametric) AL-GP: active learning with Gaussian Process (parametric) OWS: passive learning with outcome weighted learning (hinge loss) OLS: passive learning with ordinary least squares loss

Nefazodone-CBASP Clinical Trials
Patient predicators: The baseline HRSD (Hamilton rating scale for depression) scores, the alcohol dependence, and the HAMA somatic anxiety scores Three treatments: Nefazodone, CBASP and combination of both Outcome: HRSD: the higher the worse

Nefazodone-CBASP Clinical Trials

Twelve-Step Intervention on Simulant Drug Use
Patient predicators: Age, average number of days per month of self-reported stimulant drug use in the 3 months prior to randomization, baseline alcohol use, drug use, employment status, medical status, and psychiatric status composite scores on the addiction severity index (ASI). To reduce stimulant drug use, treatments are: As usual (TAU) or to TAU integrated with Stimulant Abuser Groups to Engage in 12-step intervention Outcome: The number of days of self-reported stimulant drug use over the 3- to 6-month post-randomization period. where a smaller value is preferable.

Twelve-Step Intervention on Simulant Drug Use

Thank You!

Zhipeng (Patrick) Luo December 6th, 2016

Similar presentations

Presentation on theme: "Zhipeng (Patrick) Luo December 6th, 2016"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Zhipeng (Patrick) Luo December 6th, 2016

Similar presentations

Presentation on theme: "Zhipeng (Patrick) Luo December 6th, 2016"— Presentation transcript:

Similar presentations

About project

Feedback