Presentation is loading. Please wait.

Presentation is loading. Please wait.

Zhipeng (Patrick) Luo December 6th, 2016

Similar presentations


Presentation on theme: "Zhipeng (Patrick) Luo December 6th, 2016"β€” Presentation transcript:

1 Zhipeng (Patrick) Luo December 6th, 2016
Active Learning Zhipeng (Patrick) Luo December 6th, 2016

2 Motivation Labeling cost in supervised learning can be huge.
A typical supervised learning is called passive learning. Active Learning: Aims to find a good mapping without using too many labeled data.

3 Active Learning Paradigm
Iterative learning framework: Start with initial labeled data; Repeat: Fit a classifier based on current labeled data; Actively sample a most important unlabeled point based on current classifier; Obtain its label from a human labeler. A glut of different querying strategies: Uncertainty Novelty (Redundancy) Representativeness

4 Three Scenarios of Active Learning
Pool-based Pick one point from a unlabeled data pool. Uncertainty closest to decision boundary highest entropy in the predicted label Query synthesize Aggressively search the unlabeled data space for a data point. But can be unrealistic for labeling. Sequential Labeling Unlabeled data arrive sequentially. Mellow learner

5 Notations The unlabeled data space: 𝑿 The label space: a finite set 𝒀
A mapping: β„Ž :𝑿→𝒀, β„Ž βˆˆπ‘― β„Ž 𝑛 : a hypothesis learned after seeing n examples. A pre-specified hypothesis space: 𝑯 An underlying distribution of 𝑿×𝒀: 𝑷 Error of a hypothesis β„Ž: π‘’π‘Ÿπ‘Ÿ β„Ž =𝑷[β„Ž(𝑋)β‰ π‘Œ] The optimal hypothesis β„Ž βˆ— = π‘Žπ‘Ÿπ‘”π‘šπ‘–π‘› 𝑯 π‘’π‘Ÿπ‘Ÿ(β„Ž) Separable case: π‘’π‘Ÿπ‘Ÿ(β„Ž βˆ— )=0; non-separable otherwise Active Learning: hope β„Ž 𝑛 β†’ β„Ž βˆ— when n grows.

6 A Toy Example Setting: 𝑿 is one-dimensional; 𝒀 is binary; 𝑯 is a set of linear separators (thresholds); Linear separable; πœ– is a pre-defined error rate. Passive learning: 𝑂(1/πœ–) randomly labeled points. Binary Search: 𝑂( log 1/πœ– ) examples.

7 Separable Case 𝑯 1 =𝑯 For t = 1, 2, … Receive an unlabeled point π‘₯ 𝑑 ;
If disagreement about π‘₯ 𝑑 ’s label: Query label 𝑦 𝑑 of π‘₯ 𝑑 ; 𝑯 𝑑+1 ={β„Žβˆˆ 𝑯 𝑑 :β„Ž π‘₯ 𝑑 = 𝑦 𝑑 }. Else: 𝑯 𝑑+1 = 𝑯 𝑑

8 Separable Case

9 Separable Case Label complexity How many labels are needed?
𝐿 πœ–,𝛿 = 𝑑 π‘œ 𝑠.𝑑. π‘“π‘œπ‘Ÿ βˆ€ 𝑑β‰₯ 𝑑 π‘œ , 𝑃 βˆƒβ„Žβˆˆ 𝐻 𝑑 , π‘’π‘Ÿπ‘Ÿ β„Ž >πœ– ≀𝛿 How many labels are needed? 𝐿 πœ–,𝛿 ≀ 𝑂 (πœƒπ‘‘π‘™π‘œπ‘”1/πœ–) 𝑂 suppresses terms logarithmic in 𝑑, πœƒ and π‘™π‘œπ‘”1/πœ– 𝑑 is the VC dimension of a hypothesis space πœƒ is constant called disagreement coefficient (to be talked about later) In contrast to passive learning: The label complexity is Ξ©(𝑑/πœ–). Active learning yields an exponential improvement.

10 Separable Case Non-separable Case 𝑯 1 =𝑯
For t = 1, 2, … Receive an unlabeled point π‘₯ 𝑑 ; If there exists disagreement about π‘₯ 𝑑 ’s label: Query label 𝑦 𝑑 of π‘₯ 𝑑 ; 𝑯 𝑑+1 ={β„Žβˆˆ 𝑯 𝑑 :β„Ž π‘₯ 𝑑 = 𝑦 𝑑 }. Else: 𝑯 𝑑+1 = 𝑯 𝑑 If there exists disagreement about π‘₯ 𝑑 ’s label: Query label 𝑦 𝑑 of π‘₯ 𝑑 ; Else: Infer label 𝑦 𝑑 of π‘₯ 𝑑 ; 𝑦 𝑑 may be wrong. 𝑯 𝑑+1 = {β„Žβˆˆ 𝑯 𝑑 :π‘’π‘Ÿπ‘Ÿπ‘’ β„Ž ≀ π‘’π‘Ÿπ‘Ÿ 𝑒 β„Ž 𝑑 βˆ— + βˆ† 𝑑 }

11 Separable Case Non-separable Case Label complexity Label complexity
𝐿 πœ–,𝛿 = 𝑑 π‘œ 𝑠.𝑑. π‘“π‘œπ‘Ÿ βˆ€ 𝑑β‰₯ 𝑑 π‘œ , 𝑃 βˆƒβ„Žβˆˆ 𝐻 𝑑 , π‘’π‘Ÿπ‘Ÿ β„Ž >πœ– ≀𝛿 Active Learning 𝐿 πœ–,𝛿 ≀ 𝑂 (πœƒπ‘‘π‘™π‘œπ‘”1/πœ–) Passive Learning Ξ©(𝑑/πœ–) Label complexity 𝐿 πœ–,𝛿 = 𝑑 π‘œ 𝑠.𝑑. π‘“π‘œπ‘Ÿ βˆ€ 𝑑β‰₯ 𝑑 π‘œ , 𝑃 βˆƒβ„Žβˆˆ 𝐻 𝑑 , π‘’π‘Ÿπ‘Ÿ β„Ž >πœ–+𝑣 βˆƒβ„Žβˆˆ 𝐻 𝑑 , π‘’π‘Ÿπ‘Ÿ β„Ž >πœ–+𝑣 ≀𝛿 𝑣=π‘’π‘Ÿπ‘Ÿ( β„Ž βˆ— ) Active Learning 𝐿 πœ–,𝛿 ≀ 𝑂 (πœƒπ‘‘(π‘™π‘œπ‘”21/πœ–+ 𝑣 2 πœ– 2 )) Passive Learning Ξ©( 𝑑 πœ– + 𝑑 𝑣 2 πœ– 2 )

12 The Disagreement Coefficient πœƒ
𝑑 β„Ž, β„Ž βˆ— = 𝑷 𝑿 [β„Ž 𝑋 β‰  β„Ž βˆ— (𝑋)] 𝐡(β„Ž βˆ— ,π‘Ÿ)={β„Žβˆˆπ‘―:𝑑 β„Ž, β„Ž βˆ— β‰€π‘Ÿ} 𝐷𝐼𝑆(𝐡(β„Ž βˆ— ,π‘Ÿ))={π‘₯βˆˆπ‘Ώ,βˆƒβ„Ž, β„Ž β€² ∈ 𝐡(β„Ž βˆ— ,π‘Ÿ), 𝑠.𝑑. β„Ž(π‘₯)β‰ β„Žβ€²(π‘₯)} πœƒ= 𝑠𝑒𝑝 π‘Ÿ>0 𝑷[ 𝐷𝐼𝑆(𝐡(β„Ž βˆ— ,π‘Ÿ))] π‘Ÿ This coefficient measures how 𝑷[ 𝐷𝐼𝑆(𝐡(β„Ž βˆ— ,π‘Ÿ))] scales with π‘Ÿ.

13 Active Clinical Trials for Personalized Medicine
Stanislav Minsker, Ying-Qi Zhao, and Guang Cheng JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 2016

14 Introduction Individualized treatment rules (ITRs)
Find the optimal treatment for each patient Each patient’s characteristics include graphics, medical history, genetic or genomic information etc. Randomized clinical trials (RCTs) Only focuses on efficacy of treatments Not efficient, as sample size (cost) can be large Active clinical trials (ACTs) Exclude patients for whom the benefit of some treatment is clear. Select those whose optimal treatment is hard to determine An instance of uncertainty sampling.

15 Problem Setting Data (𝑋,𝐴,𝑅) has a joint probability 𝑷.
π‘‹βˆˆ 𝑹 𝑝 , representing a patient case; 𝐴∈{1,βˆ’1}, representing a treatment decision: standard or alternative π‘…βˆˆπ‘Ή stands for a treatment outcome, the larger the better. An ITR 𝐷:𝑋→𝐴 is a binary mapper: 𝐷 βˆ— (π‘₯)=𝑠𝑖𝑔𝑛{ 𝑓 βˆ— (π‘₯)} 𝑓 βˆ— π‘₯ =𝐸 𝑅 𝐴=1,𝑋=π‘₯ βˆ’πΈ[𝑅|𝐴=βˆ’1,𝑋=π‘₯] Called contrast function. Defines the optimal decision boundary. Can be modeled with a regression model.

16 Active Clinical Trials
Its key idea is to find the active set in each iteration The optimal treatment of any patient in this set is uncertain.

17 Active Clinical Trials
Step 1: Initialization Randomly recruit patients and randomly treat them; Observe their outcomes (labeled); Train the initial estimators. Step 2: Active learning Find the active set patients; Randomly treat them; Observe their outcomes; Update the estimators; Repeat until budget runs out, then output the final estimator.

18 Active Set Define 𝐹(𝑓,𝛿) to be a set of hypotheses that are 𝛿-close to 𝑓. 𝐹 𝑓,𝛿 ={𝑔: π‘”βˆ’π‘“ ∞ ≀𝛿} For each iteration: Find 𝐹( 𝑓 π‘‘βˆ’1 ,𝛿); Active Set 𝐴𝑆 𝑑 ={π‘₯:βˆƒ 𝑓 1 , 𝑓 2 ∈𝐹( 𝑓 π‘‘βˆ’1 ,𝛿), sign( 𝑓 2 )≠𝑠𝑖𝑔𝑛( 𝑓 1 )} Approximate 𝐴𝑆 𝑑 with a regular set π‘Žπ‘π‘‘ 𝑑 Its purpose is to determine the active set based on intrinsic dimensions.

19 Smoothed Kernel Estimator

20 Kernel Bandwidth

21 Kernel Bandwidth

22 Theoretic Bound With probability greater than , it holds:
𝐢 is a constant depending on kernel and X distribution 𝑁 is the total number of patients recruited. 𝑑 is the intrinsic dimension cardinality and π›Ύβˆˆ 1,𝑑 a constant.

23 Real Data Analysis Two data sets: Methods:
Nefazodone-CBASP Clinical Trial Twelve-Step Intervention on Simulant Drug Use Methods: AL-BV: active learning with smoothed kernel (non-parametric) AL-GP: active learning with Gaussian Process (parametric) OWS: passive learning with outcome weighted learning (hinge loss) OLS: passive learning with ordinary least squares loss

24 Nefazodone-CBASP Clinical Trials
Patient predicators: The baseline HRSD (Hamilton rating scale for depression) scores, the alcohol dependence, and the HAMA somatic anxiety scores Three treatments: Nefazodone, CBASP and combination of both Outcome: HRSD: the higher the worse

25 Nefazodone-CBASP Clinical Trials

26 Twelve-Step Intervention on Simulant Drug Use
Patient predicators: Age, average number of days per month of self-reported stimulant drug use in the 3 months prior to randomization, baseline alcohol use, drug use, employment status, medical status, and psychiatric status composite scores on the addiction severity index (ASI). To reduce stimulant drug use, treatments are: As usual (TAU) or to TAU integrated with Stimulant Abuser Groups to Engage in 12-step intervention Outcome: The number of days of self-reported stimulant drug use over the 3- to 6-month post-randomization period. where a smaller value is preferable.

27 Twelve-Step Intervention on Simulant Drug Use

28 Thank You!


Download ppt "Zhipeng (Patrick) Luo December 6th, 2016"

Similar presentations


Ads by Google