Active Learning for Probabilistic Models Lee Wee Sun Department of Computer Science National University of Singapore LARC-IMS Workshop.

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
On-line learning and Boosting
Randomized Sensing in Adversarial Environments Andreas Krause Joint work with Daniel Golovin and Alex Roper International Joint Conference on Artificial.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Maximum Margin Markov Network Ben Taskar, Carlos Guestrin Daphne Koller 2004.
1 Maximal Independent Set. 2 Independent Set (IS): In a graph G=(V,E), |V|=n, |E|=m, any set of nodes that are not adjacent.
1 NP-Complete Problems. 2 We discuss some hard problems:  how hard? (computational complexity)  what makes them hard?  any solutions? Definitions 
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
1 Maximal Independent Set. 2 Independent Set (IS): In a graph, any set of nodes that are not adjacent.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Near-optimal Nonmyopic Value of Information in Graphical Models Andreas Krause, Carlos Guestrin Computer Science Department Carnegie Mellon University.
Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
Evaluating Hypotheses
Conditional Random Fields
Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees Radford M. Neal and Jianguo Zhang the winners.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Visual Recognition Tutorial
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Bayesian Networks Alan Ritter.
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Distributed Constraint Optimization Michal Jakob Agent Technology Center, Dept. of Computer Science and Engineering, FEE, Czech Technical University A4M33MAS.
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
Kernel Classifiers from a Machine Learning Perspective (sec ) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
DATA MINING LECTURE 13 Absorbing Random walks Coverage.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Lecture 2: Statistical learning primer for biologists
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Adaptive Dependent Context BGL: Budgeted Generative (-) Learning Given nothing about training instances, pay for any feature [no “labels”, no “attributes”
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
KNN & Naïve Bayes Hongning Wang
Lecture 7: Constrained Conditional Models
Learning Deep Generative Models by Ruslan Salakhutdinov
Inference in Bayesian Networks
Today.
Approximating the MST Weight in Sublinear Time
Statistical Models for Automatic Speech Recognition
Data Mining Lecture 11.
Distributed Submodular Maximization in Massive Datasets
Maximal Independent Set
Bayesian Models in Machine Learning
Coverage Approximation Algorithms
A new and improved algorithm for online bin packing
Expectation-Maximization & Belief Propagation
LECTURE 23: INFORMATION THEORY REVIEW
Compact routing schemes with improved stretch
Machine Learning: Lecture 6
CS154, Lecture 16: More NP-Complete Problems; PCPs
Machine Learning: UNIT-3 CHAPTER-1
Presentation transcript:

Active Learning for Probabilistic Models Lee Wee Sun Department of Computer Science National University of Singapore LARC-IMS Workshop

Probabilistic Models in Networked Environments Probabilistic graphical models are powerful tools in networked environments Example task: Given some labeled nodes, what are the labels of remaining nodes? May also need to learn parameters of model (later) LARC-IMS Workshop ? ? ? Labeling university web pages with CRF ?

Active Learning Given a budget of k queries, which nodes to query to maximize performance on remaining nodes? What are reasonable performance measures with provable guarantees for greedy methods? LARC-IMS Workshop ? ? ? Labeling university web pages with CRF ?

Entropy First consider non-adaptive policy Chain rule of entropy Maximizing entropy of selected variables (Y 1 ) minimizes the conditional entropy LARC-IMS Workshop ConstantMaximizeMinimize target

Greedy method – Given already selected set S, add variable Y i to maximize Near optimality: because of submodularity of entropy. LARC-IMS Workshop

Submodularity Diminishing return property LARC-IMS Workshop

Adaptive Policy What about adaptive policies? LARC-IMS Workshop Non-adaptiveAdaptive k

Let ρ be a path down the policy tree, and let the policy entropy be Then we can show where Y G is the graph labeling Correspond to chain rule in non-adaptive case – maximizing policy entropy minimizes conditional entropy LARC-IMS Workshop

Recap: Greedy algorithm is near-optimal for non- adaptive case For adaptive case, consider greedy algorithm that selects the variable with the largest entropy conditioned on the observations Unfortunately, for adaptive case, we can show that, for every α > 0, there is a probabilistic model such that LARC-IMS Workshop

Tsallis Entropy and Gibbs Error In statistical mechanics, Tsallis entropy is a generalization of Shannon entropy Shannon entropy is special case for q = 1. We call the case q = 2, Gibbs Error LARC-IMS Workshop

Properties of Gibbs Error Gibbs error is the expected error of the Gibbs classifier – Gibbs classifier: Draw a labeling from the distribution and use the labeling as the prediction At most twice Bayes (best possible) error. LARC-IMS Workshop

Lower bound to entropy – Maximizing policy Gibbs error, maximize lower bound to policy entropy LARC-IMS Workshop  Gibbs Error

Policy Gibbs error LARC-IMS Workshop

Maximizing policy Gibbs error minimizes expected weighted posterior Gibbs error Make progress on either the version space or posterior Gibbs error LARC-IMS Workshop Version spacePosterior Gibbs error

Gibbs Error and Adaptive Policies Greedy algorithm: Select node i with the largest conditional Gibbs error Near-optimality holds for the case of policy Gibbs error (in contrast to policy entropy) LARC-IMS Workshop

Proof idea: – Show that policy Gibbs error is the same as the expected version space reduction. – Version space is the total probability of remaining labelings on unlabeled nodes (labelings that are consistent with labeled nodes) – Version space reduction function is adaptive submodular, giving required result for policy Gibbs error (using result of Golovin and Krause). LARC-IMS Workshop Version space

Adaptive Submodularity LARC-IMS Workshop x3x3 x3x3 ρ ρ’ Diminishing return property – Change in version space when x i is concatenated to path ρ and y is received – Adaptive submodular because

Worst Case Version Space Maximizing policy Gibbs error maximizes expected version space reduction Related greedy algorithm: Select the least confident variable – Select the variable with the smallest maximum label probability Approximately maximizes worst case version space reduction LARC-IMS Workshop

Let Using greedy strategy that selects least confident variable achieves because version space reduction function is pointwise submodular LARC-IMS Workshop

Pointwise Submodularity Let V(S,y) be the version space remaining if y is the true labeling of all nodes and subset S has been labeled 1-V(S,y) is pointwise submodular as it is submodular for every labeling y LARC-IMS Workshop

Summary So Far … Greedy AlgorithmCriteriaOptimalityProperty Select maximum entropy variable Entropy of selected variables No constant factor approximation Select maximum Gibbs error variable Policy Gibbs error (expected version space reduction) 1-1/eAdaptive submodular Select least confident variable Worst case version space reduction 1-1/ePointwise submodular … … LARC-IMS Workshop

Learning Parameters Take a Bayesian approach Put prior over parameters Integrate away parameters when computing probability of labeling Also works with commonly encountered pooled based active learning scenario (independent instances – no dependencies other than on parameter) LARC-IMS Workshop

Experiments Named entity recognition with Bayesian CRF on CoNLL 2003 dataset Greedy algs performance similar and better than passive learning (random) LARC-IMS Workshop

Weakness of Gibbs Error A labeling is considered incorrect if even one component does not agree LARC-IMS Workshop

Generalized Gibbs Error Generalize Gibbs error to use loss function L Example: Hamming loss, 1-F-score, etc. Reduces to Gibbs error when L(y,y’) = 1-δ(y,y’) where – δ(y,y’) = 1 when y = y’, and – δ(y,y’) = 0 otherwise LARC-IMS Workshop y2y2 y2y2 y1y1 y1y1 y3y3 y3y3 y4y4 y4y4

Generalized policy Gibbs error (to maximize) LARC-IMS Workshop Generalized Gibbs Error Remaining weighted Generalized Gibbs error (agrees with y on ρ)

Generalized policy Gibbs error is the average of Call this function the generalized version space reduction function Unfortunately, not adaptive submodular for arbitrary L. LARC-IMS Workshop y2y2 y2y2 y1y1 y1y1 y3y3 y3y3 y4y4 y4y4

However, generalized version space reduction function is pointwise submodular – Has good approximation in the worst case LARC-IMS Workshop y2y2 y2y2 y1y1 y1y1 y3y3 y3y3 y4y4 y4y4

Hedging against worst case labeling may be too conservative Can hedge against the total generalized version space among surviving labelings instead LARC-IMS Workshop y2y2 y2y2 y1y1 y1y1 y3y3 y3y3 y4y4 y4y4 y2y2 y2y2 y1y1 y1y1 y3y3 y3y3 y4y4 y4y4 instead of

Call this total generalized version space reduction function Total generalized version space reduction function is pointwise submodular – Has good approximation in the worst case LARC-IMS Workshop

Summary LARC-IMS Workshop Greedy AlgorithmCriteriaOptimalityProperty Select maximum entropy variable Entropy of selected variables No constant factor approximation Select maximum Gibbs error variable Policy Gibbs error (expected version space reduction) 1-1/eAdaptive submodular Select least confident variable Worst case version space reduction 1-1/ePointwise submodular Select variable that maximizes worst case generalized version space reduction Worst case generalized version space reduction 1-1/ePointwise submodular Select variable that maximizes worst case total generalized version space reduction Worst case total generalized version space reduction 1-1/ePointwise submodular

Experiments Text classification 20Newsgroup dataset Classify 7 pairs of newsgroups AUC for classification error Max Gibbs error vs Total Generalized Version Space with Hamming Loss LARC-IMS Workshop

Acknowledgements Joint work with – Nguyen Viet Cuong (NUS) – Ye Nan (NUS) – Adam Chai (DSO) – Chieu Hai Leong (DSO) LARC-IMS Workshop