Download presentation
Presentation is loading. Please wait.
1
Active Learning
2
2 Learning from Examples Passive learning A random set of labeled examples A random set of labeled examples
3
3 Active Learning Active learning The leaner chooses the specific examples to be labeled The leaner chooses the specific examples to be labeled The learner works harder, in order to use fewer examples
4
4 Membership Queries The learner constructs the examples from basic units Examples of problems that are only solvable under this setting (finite automata) Problem: might get to irrelevant regions of the input space
5
5 Selective Sampling Available are 2 oracles: Sample – returning unlabeled queries according to the input distribution Sample – returning unlabeled queries according to the input distribution Label – given an unlabeled example, returns its label Label – given an unlabeled example, returns its label Query filtering – From the set of unlabeled examples, choose the most informative, and query for their label.
6
6 Selecting the Most Informative Queries Input:X ~ D (in R d ) Concepts:c: X {0,1} Bayesian Model: C ~ P Version Space: V i =V( … ) + - + -
7
7 Selecting the Most Informative Queries Instantaneous information gain from the i th example:
8
8 Selecting the Most Informative Queries For the next example x i : P 0 = Pr(c(x i )==0) G(x i |V i ) P0P0
9
9 Example X = [0,1] W ~ U; V i =[the max X value labeled with 0, the min X value labeled with 1] ---+++---+++
10
10 Example - Expected Prediction Error The final predictor’s error is proportional to the length of the VS segment: Both W final and W target are selected uniformly from the VS (p = 1/L) Both W final and W target are selected uniformly from the VS (p = 1/L) The error of each such pair is |W final - W target | The error of each such pair is |W final - W target | Using n random examples: 1/n But by cutting it in the middle the expected error decreases exponentially W target W final Error
11
11 Query by Committee (Seung, Opper & Sompolinsky 1992; Freund, Seung, Shamir & Tishby) Uses oracles: Gibbs(V,x) Gibbs(V,x) h rand p (V)h rand p (V) Return h(x)Return h(x) Sample Sample Lable(x) Lable(x)
12
12 Query by Committee (Seung, Opper & Sompolinsky 1992; Freund, Seung, Shamir & Tishby) While (t < Tn) x = Sample() x = Sample() y1 = Gibbs(V,x); y1 = Gibbs(V,x); y2 = Gibbs(V,x); y2 = Gibbs(V,x); If (y1 != y2) then If (y1 != y2) then Label(x) (and use it to learn and to get the new VS…)Label(x) (and use it to learn and to get the new VS…) t = 0t = 0 Update TnUpdate Tn endif endif End Return Gibbs(Vn,x)
13
13 QBC Finds Better Queries than Random: Prob of querying an example X which divides the VS to fractions F and 1-F: 2F(1-F) Reminder: the information gain is H(F) But this is not enough…
14
14 Example W is in [0,1] 2 X is a line parallel to one of the axes The error is proportional to the perimeter of the VS rectangle
15
15 If for a concept class C: VCdim(c) < ∞ VCdim(c) < ∞ The expected information gain of queries made by QBC is uniformly lower bounded by g > 0 The expected information gain of queries made by QBC is uniformly lower bounded by g > 0 Then, with probability larger than 1-δ over the target concepts, the sequence of examples and the choices made by QBC: NSample is bounded NSample is bounded NLabel is proportional to log(NSample) NLabel is proportional to log(NSample) The error probability of Gibbs(V QBC,x) < ε The error probability of Gibbs(V QBC,x) < ε
16
16 QBC will Always Stop The information gain of all the samples (I samples ) grows more slowly as the no. of samples grows (proportional to d*log(me/d) ) The information gain from queries (I queries ) is lower bounded and thus grows linearly I samples ≥I queries The time between two query events grows exponentially The algorithm will pass the Tn bound and stop
17
17
18
18 I samples Cumulative Information Gain The expected cumulative info. gain:
19
19 I samples Souer’s Lemma: The number of different sets of labels for m examples ≤ (em/d) d Uniform distribution over n labels has the maximum entropy The max expected info. gain is d*log(em/d)
20
20 The Error Probability Definition: Pr(h(x) ≠ c(x)); h,c~P VS This is exactly the probability of querying a sample in QBC This is stopping condition in QBC
21
21 Before We Go Further… The basic intuition - gaining more information by choosing examples that cut the VS to parts of similar size This condition is not sufficient If there exists a lower bound on the expected info. gain QBC will work The error bound in QBC is based on the analogy between the problem definition and Gibbs, not on the VS cutting.
22
22 But in Practice… Proved for linear separators if the sample space and VS distributions are uniform. Is the setting realistic? Implementation of Gibbs by Sampling from Convex Bodies
23
23 Kernel QBC
24
24 What about Noise? In practice labels might be noisy Active learners are sensitive to noise since they try to minimize redundancy
25
25 Noise Tolerant QBC do Let x be a random instance. Let x be a random instance. θ 1 = rand(posterior) θ 1 = rand(posterior) θ 2 = rand(posterior) θ 2 = rand(posterior) If argmax p(y|x,θ 1 ) ≠ argmax p(y|x,θ 2 ) then If argmax p(y|x,θ 1 ) ≠ argmax p(y|x,θ 2 ) then ask for the label of x. ask for the label of x. Update the posterior. Update the posterior. Until no labels were requested for t consecutive instances. Return rand(posterior)
26
26 SVM Active Learning with Applications to Text Classification Tong & Koller (2001) Setting: pool-based active learning Aim: Fast reduction of the VS’s size Identifying the query that halves the VS: Simple Margin: choose the next query as the point closest to the current separator: min|w i * Φ(x)| Simple Margin: choose the next query as the point closest to the current separator: min|w i * Φ(x)| MinMax Margin: max min(m+,m-) – to get an max split MinMax Margin: max min(m+,m-) – to get an max split Ratio Margin: – to get an equal split Ratio Margin: – to get an equal split
27
27 SVM Active Learning with Applications to Text Classification Tong & Koller (2001) The VS in SVM is the unit vectors (The data must be separable in the feature space) Points in F hyperplanes in W
28
28 SVM Active Learning with Applications to Text Classification Tong & Koller (2001)
29
29 Results Reuters and newsgroups data Each document is represented by a 10 5 dimensions vector of words frequencies
30
30 Results
31
31 Results
32
32 What’s next… Theory meet practice New methods (other than cutting the VS)… Generative setting (“committee based sampling for training probabilistic classifiers”, Engelson,1995) Interesting applications
33
33
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.