Active Learning. 2 Learning from Examples  Passive learning A random set of labeled examples A random set of labeled examples.

Slides:



Advertisements
Similar presentations
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
Lecture 3 Nonparametric density estimation and classification
Chapter 4: Linear Models for Classification
Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines and Kernel Methods
Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Active Learning of Binary Classifiers
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
Reduced Support Vector Machine
Vapnik-Chervonenkis Dimension
SVM Active Learning with Application to Image Retrieval
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Vapnik-Chervonenkis Dimension Definition and Lower bound Adapted from Yishai Mansour.
Machine Learning CMPT 726 Simon Fraser University
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Machine Learning: Ensemble Methods
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ensemble Learning (2), Tree and Forest
Incorporating Unlabeled Data in the Learning Process
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
Data mining and machine learning A brief introduction.
Machine Learning CSE 681 CH2 - Supervised Learning.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.
Universit at Dortmund, LS VIII
Benk Erika Kelemen Zsolt
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
Lecture 4 Linear machine
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Support Vector Machines Tao Department of computer science University of Illinois.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.
CS6772 Advanced Machine Learning Fall 2006 Extending Maximum Entropy Discrimination on Mixtures of Gaussians With Transduction Final Project by Barry.
DATA MINING LECTURE 10b Classification k-nearest neighbor classifier
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
KNN & Naïve Bayes Hongning Wang
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Computational Learning Theory
CS 4/527: Artificial Intelligence
K Nearest Neighbor Classification
A New Boosting Algorithm Using Input-Dependent Regularizer
Online Learning Kernels
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Support Vector Machines
CSCI B609: “Foundations of Data Science”
Nonparametric density estimation and classification
Supervised machine learning: creating a model
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Active Learning

2 Learning from Examples  Passive learning A random set of labeled examples A random set of labeled examples

3 Active Learning  Active learning The leaner chooses the specific examples to be labeled The leaner chooses the specific examples to be labeled  The learner works harder, in order to use fewer examples

4 Membership Queries  The learner constructs the examples from basic units  Examples of problems that are only solvable under this setting (finite automata)  Problem: might get to irrelevant regions of the input space

5 Selective Sampling  Available are 2 oracles: Sample – returning unlabeled queries according to the input distribution Sample – returning unlabeled queries according to the input distribution Label – given an unlabeled example, returns its label Label – given an unlabeled example, returns its label  Query filtering – From the set of unlabeled examples, choose the most informative, and query for their label.

6 Selecting the Most Informative Queries  Input:X ~ D (in R d )  Concepts:c: X  {0,1}  Bayesian Model: C ~ P  Version Space: V i =V( … )

7 Selecting the Most Informative Queries  Instantaneous information gain from the i th example:

8 Selecting the Most Informative Queries  For the next example x i :  P 0 = Pr(c(x i )==0) G(x i |V i ) P0P0

9 Example  X = [0,1]  W ~ U;  V i =[the max X value labeled with 0, the min X value labeled with 1]

10 Example - Expected Prediction Error  The final predictor’s error is proportional to the length of the VS segment: Both W final and W target are selected uniformly from the VS (p = 1/L) Both W final and W target are selected uniformly from the VS (p = 1/L) The error of each such pair is |W final - W target | The error of each such pair is |W final - W target |  Using n random examples: 1/n But by cutting it in the middle the expected error decreases exponentially W target W final Error

11 Query by Committee (Seung, Opper & Sompolinsky 1992; Freund, Seung, Shamir & Tishby)  Uses oracles: Gibbs(V,x) Gibbs(V,x) h  rand p (V)h  rand p (V) Return h(x)Return h(x) Sample Sample Lable(x) Lable(x)

12 Query by Committee (Seung, Opper & Sompolinsky 1992; Freund, Seung, Shamir & Tishby)  While (t < Tn) x = Sample() x = Sample() y1 = Gibbs(V,x); y1 = Gibbs(V,x); y2 = Gibbs(V,x); y2 = Gibbs(V,x); If (y1 != y2) then If (y1 != y2) then Label(x) (and use it to learn and to get the new VS…)Label(x) (and use it to learn and to get the new VS…) t = 0t = 0 Update TnUpdate Tn endif endif  End  Return Gibbs(Vn,x)

13 QBC Finds Better Queries than Random:  Prob of querying an example X which divides the VS to fractions F and 1-F: 2F(1-F)  Reminder: the information gain is H(F)  But this is not enough…

14 Example  W is in [0,1] 2  X is a line parallel to one of the axes  The error is proportional to the perimeter of the VS rectangle

15  If for a concept class C: VCdim(c) < ∞ VCdim(c) < ∞ The expected information gain of queries made by QBC is uniformly lower bounded by g > 0 The expected information gain of queries made by QBC is uniformly lower bounded by g > 0  Then, with probability larger than 1-δ over the target concepts, the sequence of examples and the choices made by QBC: NSample is bounded NSample is bounded NLabel is proportional to log(NSample) NLabel is proportional to log(NSample) The error probability of Gibbs(V QBC,x) < ε The error probability of Gibbs(V QBC,x) < ε

16 QBC will Always Stop  The information gain of all the samples (I samples ) grows more slowly as the no. of samples grows (proportional to d*log(me/d) )  The information gain from queries (I queries ) is lower bounded and thus grows linearly  I samples ≥I queries  The time between two query events grows exponentially  The algorithm will pass the Tn bound and stop

17

18 I samples Cumulative Information Gain The expected cumulative info. gain:

19 I samples  Souer’s Lemma: The number of different sets of labels for m examples ≤ (em/d) d  Uniform distribution over n labels has the maximum entropy  The max expected info. gain is d*log(em/d)

20 The Error Probability  Definition: Pr(h(x) ≠ c(x)); h,c~P VS  This is exactly the probability of querying a sample in QBC  This is stopping condition in QBC

21 Before We Go Further…  The basic intuition - gaining more information by choosing examples that cut the VS to parts of similar size  This condition is not sufficient  If there exists a lower bound on the expected info. gain QBC will work  The error bound in QBC is based on the analogy between the problem definition and Gibbs, not on the VS cutting.

22 But in Practice…  Proved for linear separators if the sample space and VS distributions are uniform.  Is the setting realistic?  Implementation of Gibbs by Sampling from Convex Bodies

23 Kernel QBC

24 What about Noise?  In practice labels might be noisy  Active learners are sensitive to noise since they try to minimize redundancy

25 Noise Tolerant QBC  do Let x be a random instance. Let x be a random instance. θ 1 = rand(posterior) θ 1 = rand(posterior) θ 2 = rand(posterior) θ 2 = rand(posterior) If argmax p(y|x,θ 1 ) ≠ argmax p(y|x,θ 2 ) then If argmax p(y|x,θ 1 ) ≠ argmax p(y|x,θ 2 ) then ask for the label of x. ask for the label of x. Update the posterior. Update the posterior.  Until no labels were requested for t consecutive instances.  Return rand(posterior)

26 SVM Active Learning with Applications to Text Classification Tong & Koller (2001)  Setting: pool-based active learning  Aim: Fast reduction of the VS’s size  Identifying the query that halves the VS: Simple Margin: choose the next query as the point closest to the current separator: min|w i * Φ(x)| Simple Margin: choose the next query as the point closest to the current separator: min|w i * Φ(x)| MinMax Margin: max min(m+,m-) – to get an max split MinMax Margin: max min(m+,m-) – to get an max split Ratio Margin: – to get an equal split Ratio Margin: – to get an equal split

27 SVM Active Learning with Applications to Text Classification Tong & Koller (2001)  The VS in SVM is the unit vectors  (The data must be separable in the feature space)  Points in F  hyperplanes in W

28 SVM Active Learning with Applications to Text Classification Tong & Koller (2001)

29 Results  Reuters and newsgroups data  Each document is represented by a 10 5 dimensions vector of words frequencies

30 Results

31 Results

32 What’s next…  Theory meet practice  New methods (other than cutting the VS)…  Generative setting (“committee based sampling for training probabilistic classifiers”, Engelson,1995)  Interesting applications

33