Jaime Carbonell, Pinar Donmez, Jingui He & Vamshi Ambati Language Technologies Institute Carnegie Mellon University www.cs.cmu.edu/~jgc 27 October 2010.

Slides:



Advertisements
Similar presentations
A Support Vector Method for Optimizing Average Precision
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Unsupervised Learning
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
An Introduction of Support Vector Machine
Machine learning continued Image source:
Jaime Carbonell ( With Pinar Donmez, Jingui He, Vamshi Ambati, Oznur Tastan, Xi Chen Language Technologies Inst. & Machine Learning.
Supervised Learning Recap
Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language.
Vamshi Ambati | Stephan Vogel | Jaime Carbonell Language Technologies Institute Carnegie Mellon University A ctive Learning and C rowd-Sourcing for Machine.
Optimizing Estimated Loss Reduction for Active Sampling in Rank Learning Presented by Pinar Donmez joint work with Jaime G. Carbonell Language Technologies.
Evaluating Search Engine
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.
Visual Recognition Tutorial
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Bing LiuCS Department, UIC1 Learning from Positive and Unlabeled Examples Bing Liu Department of Computer Science University of Illinois at Chicago Joint.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Ensemble Learning (2), Tree and Forest
Introduction to machine learning
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
1 Efficiently Learning the Accuracy of Labeling Sources for Selective Sampling by Pinar Donmez, Jaime Carbonell, Jeff Schneider School of Computer Science,
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Thesis Proposal PrActive Learning: Practical Active Learning, Generalizing Active Learning for Real-World Deployments.
Universit at Dortmund, LS VIII
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
CS Statistical Machine learning Lecture 24
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Machine Learning 5. Parametric Methods.
Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.
NTU & MSRA Ming-Feng Tsai
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Predicting Consensus Ranking in Crowdsourced Setting Xi Chen Mentors: Paul Bennett and Eric Horvitz Collaborator: Kevyn Collins-Thompson Machine Learning.
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Semi-Supervised Clustering
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Introductory Seminar on Research: Fall 2017
Asymmetric Gradient Boosting with Application to Spam Filtering
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Parametric Methods Berlin Chen, 2005 References:
Jonathan Elsas LTI Student Research Symposium Sept. 14, 2007
Presentation transcript:

Jaime Carbonell, Pinar Donmez, Jingui He & Vamshi Ambati Language Technologies Institute Carnegie Mellon University 27 October 2010 Active and Proactive Machine Learning

October 2010Jaime G. Carbonell, Language Technolgies Institute 2 Why is Active Learning Important?  Labeled data volumes  unlabeled data volumes 1.2% of all proteins have known structures <.01% of all galaxies in the Sloan Sky Survey have consensus type labels <.0001% of all web pages have topic labels << E-10% of all internet sessions are labeled as to fraudulence (malware, etc.) <.0001 of all financial transactions investigated w.r.t. fraudulence  If labeling is costly, or limited, select the instances with maximal impact for learning

Is (Pro)Active Learning Relevant to Language Technologies?  Text Classification By topic, genre, difficulty, … In learning to rank search results  Question Answering Question-type classification Answer ranking  Machine Translation Selecting sentences to translate for LDL’s Eliciting partial or full alignment October 2010Jaime G. Carbonell, Language Technolgies Institute 3

October 2010Jaime G. Carbonell, Language Technolgies Institute 4 Active Learning  Training data: Special case:  Functional space:  Fitness Criterion: a.k.a. loss function  Sampling Strategy:

October 2010Jaime G. Carbonell, Language Technolgies Institute 5 Sampling Strategies  Random sampling (preserves distribution)  Uncertainty sampling ( Lewis, 1996; Tong & Koller, 2000) proximity to decision boundary maximal distance to labeled x’s  Density sampling (kNN-inspired McCallum & Nigam, 2004)  Representative sampling (Xu et al, 2003)  Instability sampling (probability-weighted) x’s that maximally change decision boundary  Ensemble Strategies Boosting-like ensemble (Baram, 2003) DUAL (Donmez & Carbonell, 2007)  Dynamically switches strategies from Density-Based to Uncertainty- Based by estimating derivative of expected residual error reduction

Which point to sample? Grey = unlabeled Red = class A Brown = class B October 20106Jaime G. Carbonell, Language Technolgies Institute

Density-Based Sampling Centroid of largest unsampled cluster October 20107Jaime G. Carbonell, Language Technolgies Institute

Uncertainty Sampling Closest to decision boundary October 20108Jaime G. Carbonell, Language Technolgies Institute

Maximal Diversity Sampling Maximally distant from labeled x’s October 20109Jaime G. Carbonell, Language Technolgies Institute

Ensemble-Based Possibilities Uncertainty + Diversity criteria Density + uncertainty criteria October Jaime G. Carbonell, Language Technolgies Institute

October 2010Jaime G. Carbonell, Language Technolgies Institute 11 Strategy Selection: No Universal Optimum Optimal operating range for AL sampling strategies differs How to get the best of both worlds? (Hint: ensemble methods, e.g. DUAL)

October 2010Jaime G. Carbonell, Language Technolgies Institute 12 How does DUAL do better?  Runs DWUS until it estimates a cross-over  Monitor the change in expected error at each iteration to detect when it is stuck in local minima  DUAL uses a mixture model after the cross-over ( saturation ) point  Our goal should be to minimize the expected future error If we knew the future error of Uncertainty Sampling (US) to be zero, then we’d force But in practice, we do not know it

October 2010Jaime G. Carbonell, Language Technolgies Institute 13 More on DUAL [ECML 2007]  After cross-over, US does better => uncertainty score should be given more weight  should reflect how well US performs can be calculated by the expected error of US on the unlabeled data * =>  Finally, we have the following selection criterion for DUAL: * US is allowed to choose data only from among the already sampled instances, and is calculated on the remaining unlabeled set to

October 2010Jaime G. Carbonell, Language Technolgies Institute 14 Results: DUAL vs DWUS

October 2010Jaime G. Carbonell, Language Technolgies Institute 15 Beyond Dual  Paired Sampling with Geodesic Density Estimation Donmez & Carbonell, SIAM 2008  Active Rank Learning Search results: Donmez & Carbonell, WWW 2008 In general: Donmez & Carbonell, ICML 2008  Structure Learning Inferring 3D protein structure from 1D sequence Remains open problem

October 2010Jaime G. Carbonell, Language Technolgies Institute 16 Active Sampling for RankSVM I  Consider a candidate  Assume is added to training set with  Total loss on pairs that include is:  n is the # of training instances with a different label than  Objective function to be minimized becomes:

October 2010Jaime G. Carbonell, Language Technolgies Institute 17 Active Sampling for RankSVM II  Assume the current ranking function is  There are two possible cases:  Assume  Derivative w.r.t at a single point or

October 2010Jaime G. Carbonell, Language Technolgies Institute 18 Active Sampling for RankSVM III  Substitute in the previous equation to estimate  Magnitude of the total derivative  estimates the ability of to change the current ranker if added into training  Finally,

October 2010Jaime G. Carbonell, Language Technolgies Institute 19 Active Sampling for RankBoost I  Again, estimate how the current ranker would change if was in the training set  Estimate this change by the difference in ranking loss before and after is added  Ranking loss w.r.t is (Freund et al., 2003):

October 2010Jaime G. Carbonell, Language Technolgies Institute 20 Active Sampling for RankBoost II  Difference in the ranking loss between the current and the enlarged set:  indicates how much the current ranker needs to change to compensate for the loss introduced by the new instance  Finally, the instance with the highest loss differential is sampled:

October 2010Jaime G. Carbonell, Language Technolgies Institute 21 Results on TREC03

October 2010Jaime G. Carbonell, Language Technolgies Institute 22 Active vs Proactive Learning Active LearningProactive Learning Number of Oracles Individual (only one)Multiple, with different capabilities, costs and areas of expertise Reliability Infallible (100% right)Variable across oracles and queries, depending on difficulty, expertise, … Reluctance Indefatigable (always answers) Variable across oracles and queries, depending on workload, certainty, … Cost per query Invariant (free or constant)Variable across oracles and queries, depending on workload, difficulty, … Note: “Oracle”  {expert, experiment, computation, …}

Active Learning is Awesome, but … is it Enough? Single Perfect Source Multiple Sources Differing Expertise Labeling Noise Answer Reluctance Fixed Labeling Cost Varying-Cost Model Task Difficulty Ambiguity Expertise Level TraditionalActiveLearning GoingBeyond Fixed over timeTime-varying 23 J MLR_’09 KDD ‘09 SDM_sub ‘10 CIKM ‘08 ProactiveLearning

October 2010Jaime G. Carbonell, Language Technolgies Institute 24 Scenario 1: Reluctance  2 oracles: reliable oracle: expensive but always answers with a correct label reluctant oracle: cheap but may not respond to some queries  Define a utility score as expected value of information at unit cost

October 2010Jaime G. Carbonell, Language Technolgies Institute 25 How to estimate ?  Cluster unlabeled data using k-means  Ask the label of each cluster centroid to the reluctant oracle. If label received: increase of nearby points no label: decrease of nearby points equals 1 when label received, -1 otherwise  # clusters depend on the clustering budget and oracle fee

October 2010Jaime G. Carbonell, Language Technolgies Institute 26 Algorithm for Scenario 1

October 2010Jaime G. Carbonell, Language Technolgies Institute 27 Scenario 2: Fallibility  Two oracles: One perfect but expensive oracle One fallible but cheap oracle, always answers  Alg. Similar to Scenario 1 with slight modifications  During exploration: Fallible oracle provides the label with its confidence Confidence = of fallible oracle If then we don’t use the label but we still update

October 2010Jaime G. Carbonell, Language Technolgies Institute 28 Scenario 3: Non-uniform Cost  Uniform cost: Fraud detection, face recognition, etc.  Non-uniform cost: text categorization, medical diagnosis, protein structure prediction, etc.  2 oracles: Fixed-cost Oracle Variable-cost Oracle

October 2010Jaime G. Carbonell, Language Technolgies Institute 29 Underlying Sampling Strategy  Conditional entropy based sampling, weighted by a density measure  Captures the information content of a close neighborhood close neighbors of x

October 2010Jaime G. Carbonell, Language Technolgies Institute 30 Results: Reluctance

October 2010Jaime G. Carbonell, Language Technolgies Institute 31 Cost varies non-uniformly statistically significant (p<0.01)

Sequential Bayesian Filtering  Tracking the states of multiple systems as each evolves over time  Sequentially arriving observations (noisy labels)  Goal: Estimate posterior distribution 32 Changing Accuracy with time t Noisy labels SDM ‘10

33 Predict Update A Closer Look to the Model SDM ‘10

Predictor Selection 34 Accuracy at the last time selected Probability of accuracy There is a chance that the accuracy might have increased Our belief of the accuracy diverges over time as the source goes unexplored SDM ‘10

Pinar Donmez35 Red: true Blue: estimated Black: mle

Does Tracking Predictor Accuracy Actually Help in Proactive Learning? 36 SDM ‘10

October 2010Jaime G. Carbonell, Language Technolgies Institute 37 Proactive Learning in General  Multiple Experts (a.k.a. Oracles) Different areas of expertise Different costs Different reliabilities Different availability  What question to ask and whom to query? Joint optimization of query & oracle selection Referals among Oracles (with referal fees) Learn about Oracle capabilities as well as solving the Active Learning problem at hand Non-static Oracle properties

October 2010Jaime G. Carbonell, Language Technolgies Institute 38 Current Issues in Proactive Learning  Large numbers of oracles [Donmez, Carbonell & Schneider, KDD-2009] Based on multi-armed bandit approach  Non-stationary oracles [Donmez, Carbonell & Schneider, SDM-2010] Expertise changes with time (improve or decay) Exploration vs exploitation tradeoff  What if labeled set is empty for some classes? Minority class discovery (unsupervised) [He & Carbonell, NIPS 2007, SIAM 2008, SDM 2009] After first instance discovery  proactive learning, or  minority-class characterization [He & Carbonell, SIAM 2010]

October 2010Jaime G. Carbonell, Language Technolgies Institute 39 Minority Classes vs Outliers  Rare classes A group of points Clustered Non-separable from the majority classes  Outliers A single point Scattered Separable

The Big Picture Unbalanced Unlabeled Data Set Rare Category Detection Learning in Unbalanced Settings Classifier Raw Data Feature Representation Relational Temporal Feature Extraction October Jaime G. Carbonell, Language Technolgies Institute

October 2010Jaime G. Carbonell, Language Technolgies Institute Budget exhausted? Minority Class Discovery Method 1. Calculate problem-specific similarity 2.,, Query 5. a new class? Increase t by 1 6. Output No Yes No Relevance Feedback

October 2010Jaime G. Carbonell, Language Technolgies Institute 42 Summary of Real Data Sets  Abalone 4177 examples 7-dimensional features 20 classes Largest class: 16.50% Smallest class: 0.34%  Shuttle 4515 examples 9-dimensional features 7 classes Largest class: 75.53% Smallest class: 0.13%

October 2010Jaime G. Carbonell, Language Technolgies Institute 43 Results on Real Data Sets Abalone Shuttle MALICE Interleave Random sampling

Source Language Corpus Source Language Corpus Mode l Trainer MT System S S Active Learner S,T Active Learning for MT Expert Translator Sampled corpus Parallel corpus October Jaime G. Carbonell, Language Technolgies Institute

S,T 1 Source Language Corpus Source Language Corpus Mode l Trainer MT System S S ACT Framework S,T 2 S,T n A ctive C rowd T ranslation Sentenc e Selectio n Translation Selection October Jaime G. Carbonell, Language Technolgies Institute

Active Learning Strategy: Diminishing Density Weighted Diversity Sampling 46 Experiments: Language Pair: Spanish-English Iterations: 20 Batch Size: 1000 sentences each Translation: Moses Phrase SMT Development Set: 343 sens Test Set: 506 sens Graph: X: Performance (BLEU ) Y: Data (Thousand words) October 2010Jaime G. Carbonell, Language Technolgies Institute

Translation Selection from AMT Translator Reliability Translation Selection: October Jaime G. Carbonell, Language Technolgies Institute

Parting Thoughts  Proactive Learning New field just started New work and full details  Domnez Dissertation Applications Abound: e-science (compbio), finance, network securirty, language technologies (MT), … Theory still in the making (e.g. Liu Yang) Open challenge: Proactive structure learning  Rare Class discovery and Classification Dovetails with Active/Proactive Learning New Work and Full Details  Jingrui He Dissertation October 2010Jaime G. Carbonell, Language Technolgies Institute 48

October 2010Jaime G. Carbonell, Language Technolgies Institute 49 THANK YOU!

October 2010Jaime G. Carbonell, Language Technolgies Institute 50 Specially Designed Exponential Families [Efron & Tibshirani 1996]  Favorable compromise between parametric and nonparametric density estimation  Estimated density Carrier density Normalizing parameter parameter vector vector of sufficient statistics

October 2010Jaime G. Carbonell, Language Technolgies Institute 51

October 2010Jaime G. Carbonell, Language Technolgies Institute 52 SEDER Algorithm  Carrier density: kernel density estimator   To decouple the estimation of different parameters Decompose Relax the constraint such that

October 2010Jaime G. Carbonell, Language Technolgies Institute 53 Parameter Estimation  Theorem 3 [SDM 2009]: the maximum likelihood estimate and of and satisfy the following conditions: where

October 2010Jaime G. Carbonell, Language Technolgies Institute 54 Parameter Estimation cont.  Let  : where, : positive parameter in most cases

October 2010Jaime G. Carbonell, Language Technolgies Institute 55 Scoring Function  The estimated density  Scoring function: norm of the gradient where

October 2010Jaime G. Carbonell, Language Technolgies Institute 56 Summary of Real Data Sets Data Set ndmLargest Class Smallest Class Ecoli %2.68% Glass %4.21% Page Blocks %0.51% Abalone %0.34% Shuttle %0.13% Moderately Skewed Extremely Skewed

October 2010Jaime G. Carbonell, Language Technolgies Institute 57 Moderately Skewed Data Sets Ecoli Glass MALICE

GRADE: Full Prior Information 2. Calculate class-specific similarity 3.,, Query 6. class c? Increase t by 1 7. Output No Yes 1. For each rare class c, Relevance Feedback October Jaime G. Carbonell, Language Technolgies Institute

Results on Real Data Sets Ecoli Glass Abalone Shuttle MALICE October Jaime G. Carbonell, Language Technolgies Institute

October 2010Jaime G. Carbonell, Language Technolgies Institute 60 Performance Measures  MAP (Mean Average Precision) MAP is the average of AP values for all queries  NDCG (Normalized Discounted Cumulative Gain) The impact of each relevant document is discounted as a function of rank position