Download presentation
Presentation is loading. Please wait.
1
DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie Mellon University 2 Microsoft Research
2
Active Learning (Pool-based) unlabeled data Expert Data Source Learning Mechanism label request labeled data User output learn a new model
3
Why Learn Actively? Billions of data waiting to be labeled e.g. labeling articles/books with topics takes time and effort for humans size of the textual media is growing fast, e.g. over ~1 billion new web pages are added every year new topics will emerge => a must to re-train again and again Large unlabeled data is often cheap to obtain Obtaining large LABELED data is expensive in time and money Impractical running times on large datasets
4
Two different trends on Active Learning Uncertainty Sampling: selects the example with the lowest certainty i.e. closest to the boundary, maximum entropy,... Density-based Sampling: considers the underlying data distribution selects representatives of large clusters aims to cover the input space quickly i.e. representative sampling, active learning using pre-clustering, etc.
5
Goal of this Work Find an active learning method that works well everywhere Some work best when very few instances sampled (i.e. density-based sampling) Some work best after substantial sampling (i.e. uncertainty sampling) Combine the best of both worlds for superior performance
6
Main Features of DUAL DUAL is dynamic rather than static is context-sensitive builds upon the work titled “Active Learning with Pre- Clustering”, (Nguyen & Smeulders, 2004) proposes a mixture model of density and uncertainty DUAL’s primary focus is to outperform static strategies over a large operating range improve learning for the later iterations rather than concentrating on the initial data labeling
7
Related Work DUALAL with Pre- Clustering Representative Sampling COMB Clustering Yes No Uncertainty + Density Yes No Dynamic YesNo Yes
8
Active Learning with Pre-Clustering We call it Density Weighed Uncertainty Sampling (DWUS in short). Why? assumes a hidden clustering structure of the data calculates the posterior P(y | x) as x and y are conditionally independent given k since points in one cluster assumed to share the same label selection criterion uncertainty scoredensity score [1] [2] [3]
9
Outline of DWUS 1. Cluster the data using K-medoid algorithm to find the cluster centroids c k 2. Estimate P(k|x) by a standard EM procedure 3. Model P(y|k) as a logistic regression classifier 4. Estimate P(y|x) using 5. Select an unlabeled instance using Eq. 1 6. Update the parameters of the logistic regression model (hence update P(y|k) ) 7. Repeat steps 3-5 until stopping criterion
10
Notes on DWUS Posterior class distribution: P(y | k) is calculated via P(k|x) is estimated using an EM procedure after the clustering p(x | k) is a multivariate Gaussian with the same σ for all clusters The logistic regression model to estimate parameters
11
Motivation for DUAL Strength of DWUS: favors higher density samples close to the decision boundary fast decrease in error But! DWUS establishes diminishing returns! Why? Early iterations -> many points are highly uncertain Later iterations -> points with high uncertainty no longer in dense regions DWUS wastes time picking instances with no direct effect on the error
12
How does DUAL do better? Runs DWUS until it estimates a cross-over Monitor the change in expected error at each iteration to detect when it is stuck in local minima DUAL uses a mixture model after the cross-over ( saturation ) point Our goal should be to minimize the expected future error If we knew the future error of Uncertainty Sampling (US) to be zero, then we’d force But in practice, we do not know it
13
More on DUAL After cross-over, US does better => uncertainty score should be given more weight should reflect how well US performs can be calculated by the expected error of US on the unlabeled data * => Finally, we have the following selection criterion for DUAL: * US is allowed to choose data only from among the already sampled instances, and is calculated on the remaining unlabeled set
14
A simple Illustration I 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
15
A simple Illustration II 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
16
A simple Illustration III 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
17
A simple Illustration IV 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
18
Experiments initial training set size : 0.4% of the entire data ( n + = n - ) The results are averaged over 4 runs, each run takes 100 iterations DUAL outperforms DWUS with p<0.0001 significance* after 40th iteration Representative Sampling (p<0.0001) on all COMB (p<0.0001) on 4 datasets, and p<0.05 on Image and M- vs-N US (p<0.001) on 5 datasets DS (p<0.0001) on 5 datasets * All significance results are based on a 2-sided paired t-test on the classification error
19
Results: DUAL vs DWUS
20
Results: DUAL vs US
21
Results: DUAL vs DS
22
Results: DUAL vs COMB
23
Results: DUAL vs Representative S.
24
Failure Analysis Current estimate of the cross-over point is not accurate on V-vs-Y dataset => simulate a better error estimator Currently, DUAL only considers the performance of US. But, on Splice DS is better => modify selection criterion:
25
Conclusion DUAL robustly combines density and uncertainty (can be generalized to other active sampling methods which exhibit differential performance) DUAL leads to more effective performance than individual strategies DUAL shows the error of one method can be estimated using the data labeled by the other DUAL can be applied to multi-class problems where the error is estimated either globally or at the class or the instance level
26
Future Work Generalize DUAL to estimate which method is currently dominant or use a relative success weight Apply DUAL to more than two strategies to maximize the diversity of an ensemble Investigate better techniques to estimate the future classification error
27
THANK YOU!
28
The error expectation for a given point: Data density is estimated as a mixture of K Gaussians: EM procedure to estimate P(K): Likelihood:
29
Related Work Active Learning with Pre-Clustering Nguyen and Smeulders (ICML, 2004) uniform combination of uncertainty and density we use weighted scoring Representative Sampling Xu et al. (ECIR, 2003) selects cluster centroids in SVM margin only applicable in an SVM framework Online choice of Active Learning Algorithms (COMB) Baram et al. (ICML, 2003) decides which sampling method is optimal we decide the optimal operating range for the sampling methods
30
Supervised Learning (Passive) unlabeled data Expert Data Source Learning Mechanism labeled data User output
31
Semi-Supervised Learning (Passive) unlabeled data Expert Data Source Learning Mechanism labeled data User output unlabeled data
32
Active Learning 1. trains on initially small training data 2. chooses the most useful examples 3. requests the labels of the chosen data 4. aggregates the training data with the newly added examples and re-trains 5. stops either i. when reaches a max number of labeling requests or ii. when reaches a desired performance level Goal I: make as few requests as possible Goal II: achieve high performance with small amount of data
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.