DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

Slides:



Advertisements
Similar presentations
Interactively Co-segmentating Topically Related Images with Intelligent Scribble Guidance Dhruv Batra, Carnegie Mellon University Adarsh Kowdle, Cornell.
Advertisements

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
K Means Clustering , Nearest Cluster and Gaussian Mixture
Supervised Learning Recap
Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Optimal Design Laboratory | University of Michigan, Ann Arbor 2011 Design Preference Elicitation Using Efficient Global Optimization Yi Ren Panos Y. Papalambros.
Vamshi Ambati | Stephan Vogel | Jaime Carbonell Language Technologies Institute Carnegie Mellon University A ctive Learning and C rowd-Sourcing for Machine.
Multivariate linear models for regression and classification Outline: 1) multivariate linear regression 2) linear classification (perceptron) 3) logistic.
Optimizing Estimated Loss Reduction for Active Sampling in Rank Learning Presented by Pinar Donmez joint work with Jaime G. Carbonell Language Technologies.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Robust Moving Object Detection & Categorization using self- improving classifiers Omar Javed, Saad Ali & Mubarak Shah.
Final review LING572 Fei Xia Week 10: 03/13/08 1.
Sparse vs. Ensemble Approaches to Supervised Learning
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Ensemble Learning: An Introduction
Clustering.
Unsupervised Training and Clustering Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.
Visual Recognition Tutorial
Bing LiuCS Department, UIC1 Learning from Positive and Unlabeled Examples Bing Liu Department of Computer Science University of Illinois at Chicago Joint.
CS Ensembles and Bayes1 Semi-Supervised Learning Can we improve the quality of our learning by combining labeled and unlabeled data Usually a lot.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ensemble Learning (2), Tree and Forest
Radial Basis Function Networks
Remote Sensing Laboratory Dept. of Information Engineering and Computer Science University of Trento Via Sommarive, 14, I Povo, Trento, Italy Remote.
Semi-Supervised Learning
1 Efficiently Learning the Accuracy of Labeling Sources for Selective Sampling by Pinar Donmez, Jaime Carbonell, Jeff Schneider School of Computer Science,
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Machine Learning CS 165B Spring 2012
Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.
Efficient Model Selection for Support Vector Machines
Active Learning for Class Imbalance Problem
Particle Filters for Shape Correspondence Presenter: Jingting Zeng.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Modern Topics in Multivariate Methods for Data Analysis.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
CS654: Digital Image Analysis
Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
AISTATS 2010 Active Learning Challenge: A Fast Active Learning Algorithm Based on Parzen Window Classification L.Lan, H.Shi, Z.Wang, S.Vucetic Temple.
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Adaptive FIR Neural Model for Centroid Learning in Self-Organizing.
Classification Ensemble Methods 1
CS6772 Advanced Machine Learning Fall 2006 Extending Maximum Entropy Discrimination on Mixtures of Gaussians With Transduction Final Project by Barry.
Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.
NTU & MSRA Ming-Feng Tsai
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Semi-Supervised Clustering
J. Zhu, A. Ahmed and E.P. Xing Carnegie Mellon University ICML 2009
K-means and Hierarchical Clustering
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Concave Minimization for Support Vector Machine Classifiers
Presentation transcript:

DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie Mellon University 2 Microsoft Research

Active Learning (Pool-based) unlabeled data Expert Data Source Learning Mechanism label request labeled data User output learn a new model

Why Learn Actively? Billions of data waiting to be labeled e.g. labeling articles/books with topics takes time and effort for humans size of the textual media is growing fast, e.g. over ~1 billion new web pages are added every year new topics will emerge => a must to re-train again and again Large unlabeled data is often cheap to obtain Obtaining large LABELED data is expensive in time and money Impractical running times on large datasets

Two different trends on Active Learning Uncertainty Sampling: selects the example with the lowest certainty i.e. closest to the boundary, maximum entropy,... Density-based Sampling: considers the underlying data distribution selects representatives of large clusters aims to cover the input space quickly i.e. representative sampling, active learning using pre-clustering, etc.

Goal of this Work Find an active learning method that works well everywhere Some work best when very few instances sampled (i.e. density-based sampling) Some work best after substantial sampling (i.e. uncertainty sampling) Combine the best of both worlds for superior performance

Main Features of DUAL DUAL is dynamic rather than static is context-sensitive builds upon the work titled “Active Learning with Pre- Clustering”, (Nguyen & Smeulders, 2004) proposes a mixture model of density and uncertainty DUAL’s primary focus is to outperform static strategies over a large operating range improve learning for the later iterations rather than concentrating on the initial data labeling

Related Work DUALAL with Pre- Clustering Representative Sampling COMB Clustering Yes No Uncertainty + Density Yes No Dynamic YesNo Yes

Active Learning with Pre-Clustering We call it Density Weighed Uncertainty Sampling (DWUS in short). Why? assumes a hidden clustering structure of the data calculates the posterior P(y | x) as x and y are conditionally independent given k since points in one cluster assumed to share the same label selection criterion uncertainty scoredensity score [1] [2] [3]

Outline of DWUS 1. Cluster the data using K-medoid algorithm to find the cluster centroids c k 2. Estimate P(k|x) by a standard EM procedure 3. Model P(y|k) as a logistic regression classifier 4. Estimate P(y|x) using 5. Select an unlabeled instance using Eq Update the parameters of the logistic regression model (hence update P(y|k) ) 7. Repeat steps 3-5 until stopping criterion

Notes on DWUS  Posterior class distribution:  P(y | k) is calculated via  P(k|x) is estimated using an EM procedure after the clustering  p(x | k) is a multivariate Gaussian with the same σ for all clusters  The logistic regression model to estimate parameters

Motivation for DUAL Strength of DWUS: favors higher density samples close to the decision boundary fast decrease in error But! DWUS establishes diminishing returns! Why? Early iterations -> many points are highly uncertain Later iterations -> points with high uncertainty no longer in dense regions DWUS wastes time picking instances with no direct effect on the error

How does DUAL do better? Runs DWUS until it estimates a cross-over Monitor the change in expected error at each iteration to detect when it is stuck in local minima DUAL uses a mixture model after the cross-over ( saturation ) point Our goal should be to minimize the expected future error If we knew the future error of Uncertainty Sampling (US) to be zero, then we’d force But in practice, we do not know it

More on DUAL After cross-over, US does better => uncertainty score should be given more weight should reflect how well US performs can be calculated by the expected error of US on the unlabeled data * => Finally, we have the following selection criterion for DUAL: * US is allowed to choose data only from among the already sampled instances, and is calculated on the remaining unlabeled set

A simple Illustration I

A simple Illustration II

A simple Illustration III

A simple Illustration IV

Experiments initial training set size : 0.4% of the entire data ( n + = n - ) The results are averaged over 4 runs, each run takes 100 iterations DUAL outperforms DWUS with p< significance* after 40th iteration Representative Sampling (p<0.0001) on all COMB (p<0.0001) on 4 datasets, and p<0.05 on Image and M- vs-N US (p<0.001) on 5 datasets DS (p<0.0001) on 5 datasets * All significance results are based on a 2-sided paired t-test on the classification error

Results: DUAL vs DWUS

Results: DUAL vs US

Results: DUAL vs DS

Results: DUAL vs COMB

Results: DUAL vs Representative S.

Failure Analysis Current estimate of the cross-over point is not accurate on V-vs-Y dataset => simulate a better error estimator Currently, DUAL only considers the performance of US. But, on Splice DS is better => modify selection criterion:

Conclusion DUAL robustly combines density and uncertainty (can be generalized to other active sampling methods which exhibit differential performance) DUAL leads to more effective performance than individual strategies DUAL shows the error of one method can be estimated using the data labeled by the other DUAL can be applied to multi-class problems where the error is estimated either globally or at the class or the instance level

Future Work  Generalize DUAL to estimate which method is currently dominant or use a relative success weight  Apply DUAL to more than two strategies to maximize the diversity of an ensemble  Investigate better techniques to estimate the future classification error

THANK YOU!

The error expectation for a given point: Data density is estimated as a mixture of K Gaussians: EM procedure to estimate P(K): Likelihood:

Related Work Active Learning with Pre-Clustering Nguyen and Smeulders (ICML, 2004) uniform combination of uncertainty and density we use weighted scoring Representative Sampling Xu et al. (ECIR, 2003) selects cluster centroids in SVM margin only applicable in an SVM framework Online choice of Active Learning Algorithms (COMB) Baram et al. (ICML, 2003) decides which sampling method is optimal we decide the optimal operating range for the sampling methods

Supervised Learning (Passive) unlabeled data Expert Data Source Learning Mechanism labeled data User output

Semi-Supervised Learning (Passive) unlabeled data Expert Data Source Learning Mechanism labeled data User output unlabeled data

Active Learning 1. trains on initially small training data 2. chooses the most useful examples 3. requests the labels of the chosen data 4. aggregates the training data with the newly added examples and re-trains 5. stops either i. when reaches a max number of labeling requests or ii. when reaches a desired performance level  Goal I: make as few requests as possible  Goal II: achieve high performance with small amount of data