Active Learning. 2 Learning from Examples  Passive learning A random set of labeled examples A random set of labeled examples.

Active Learning

2 Learning from Examples  Passive learning A random set of labeled examples A random set of labeled examples

3 Active Learning  Active learning The leaner chooses the specific examples to be labeled The leaner chooses the specific examples to be labeled  The learner works harder, in order to use fewer examples

4 Membership Queries  The learner constructs the examples from basic units  Examples of problems that are only solvable under this setting (finite automata)  Problem: might get to irrelevant regions of the input space

5 Selective Sampling  Available are 2 oracles: Sample – returning unlabeled queries according to the input distribution Sample – returning unlabeled queries according to the input distribution Label – given an unlabeled example, returns its label Label – given an unlabeled example, returns its label  Query filtering – From the set of unlabeled examples, choose the most informative, and query for their label.

6 Selecting the Most Informative Queries  Input:X ~ D (in R d )  Concepts:c: X  {0,1}  Bayesian Model: C ~ P  Version Space: V i =V( … ) + - + -

7 Selecting the Most Informative Queries  Instantaneous information gain from the i th example:

8 Selecting the Most Informative Queries  For the next example x i :  P 0 = Pr(c(x i )==0) G(x i |V i ) P0P0

9 Example  X = [0,1]  W ~ U;  V i =[the max X value labeled with 0, the min X value labeled with 1] ---+++---+++

10 Example - Expected Prediction Error  The final predictor’s error is proportional to the length of the VS segment: Both W final and W target are selected uniformly from the VS (p = 1/L) Both W final and W target are selected uniformly from the VS (p = 1/L) The error of each such pair is |W final - W target | The error of each such pair is |W final - W target |  Using n random examples: 1/n But by cutting it in the middle the expected error decreases exponentially W target W final Error

11 Query by Committee (Seung, Opper & Sompolinsky 1992; Freund, Seung, Shamir & Tishby)  Uses oracles: Gibbs(V,x) Gibbs(V,x) h  rand p (V)h  rand p (V) Return h(x)Return h(x) Sample Sample Lable(x) Lable(x)

12 Query by Committee (Seung, Opper & Sompolinsky 1992; Freund, Seung, Shamir & Tishby)  While (t < Tn) x = Sample() x = Sample() y1 = Gibbs(V,x); y1 = Gibbs(V,x); y2 = Gibbs(V,x); y2 = Gibbs(V,x); If (y1 != y2) then If (y1 != y2) then Label(x) (and use it to learn and to get the new VS…)Label(x) (and use it to learn and to get the new VS…) t = 0t = 0 Update TnUpdate Tn endif endif  End  Return Gibbs(Vn,x)

13 QBC Finds Better Queries than Random:  Prob of querying an example X which divides the VS to fractions F and 1-F: 2F(1-F)  Reminder: the information gain is H(F)  But this is not enough…

14 Example  W is in [0,1] 2  X is a line parallel to one of the axes  The error is proportional to the perimeter of the VS rectangle

15  If for a concept class C: VCdim(c) < ∞ VCdim(c) < ∞ The expected information gain of queries made by QBC is uniformly lower bounded by g > 0 The expected information gain of queries made by QBC is uniformly lower bounded by g > 0  Then, with probability larger than 1-δ over the target concepts, the sequence of examples and the choices made by QBC: NSample is bounded NSample is bounded NLabel is proportional to log(NSample) NLabel is proportional to log(NSample) The error probability of Gibbs(V QBC,x) < ε The error probability of Gibbs(V QBC,x) < ε

16 QBC will Always Stop  The information gain of all the samples (I samples ) grows more slowly as the no. of samples grows (proportional to d*log(me/d) )  The information gain from queries (I queries ) is lower bounded and thus grows linearly  I samples ≥I queries  The time between two query events grows exponentially  The algorithm will pass the Tn bound and stop

18 I samples Cumulative Information Gain The expected cumulative info. gain:

19 I samples  Souer’s Lemma: The number of different sets of labels for m examples ≤ (em/d) d  Uniform distribution over n labels has the maximum entropy  The max expected info. gain is d*log(em/d)

20 The Error Probability  Definition: Pr(h(x) ≠ c(x)); h,c~P VS  This is exactly the probability of querying a sample in QBC  This is stopping condition in QBC

21 Before We Go Further…  The basic intuition - gaining more information by choosing examples that cut the VS to parts of similar size  This condition is not sufficient  If there exists a lower bound on the expected info. gain QBC will work  The error bound in QBC is based on the analogy between the problem definition and Gibbs, not on the VS cutting.

22 But in Practice…  Proved for linear separators if the sample space and VS distributions are uniform.  Is the setting realistic?  Implementation of Gibbs by Sampling from Convex Bodies

23 Kernel QBC

24 What about Noise?  In practice labels might be noisy  Active learners are sensitive to noise since they try to minimize redundancy

25 Noise Tolerant QBC  do Let x be a random instance. Let x be a random instance. θ 1 = rand(posterior) θ 1 = rand(posterior) θ 2 = rand(posterior) θ 2 = rand(posterior) If argmax p(y|x,θ 1 ) ≠ argmax p(y|x,θ 2 ) then If argmax p(y|x,θ 1 ) ≠ argmax p(y|x,θ 2 ) then ask for the label of x. ask for the label of x. Update the posterior. Update the posterior.  Until no labels were requested for t consecutive instances.  Return rand(posterior)

26 SVM Active Learning with Applications to Text Classification Tong & Koller (2001)  Setting: pool-based active learning  Aim: Fast reduction of the VS’s size  Identifying the query that halves the VS: Simple Margin: choose the next query as the point closest to the current separator: min|w i * Φ(x)| Simple Margin: choose the next query as the point closest to the current separator: min|w i * Φ(x)| MinMax Margin: max min(m+,m-) – to get an max split MinMax Margin: max min(m+,m-) – to get an max split Ratio Margin: – to get an equal split Ratio Margin: – to get an equal split

27 SVM Active Learning with Applications to Text Classification Tong & Koller (2001)  The VS in SVM is the unit vectors  (The data must be separable in the feature space)  Points in F  hyperplanes in W

28 SVM Active Learning with Applications to Text Classification Tong & Koller (2001)

29 Results  Reuters and newsgroups data  Each document is represented by a 10 5 dimensions vector of words frequencies

30 Results

31 Results

32 What’s next…  Theory meet practice  New methods (other than cutting the VS)…  Generative setting (“committee based sampling for training probabilistic classifiers”, Engelson,1995)  Interesting applications

Active Learning. 2 Learning from Examples  Passive learning A random set of labeled examples A random set of labeled examples.

Similar presentations

Presentation on theme: "Active Learning. 2 Learning from Examples  Passive learning A random set of labeled examples A random set of labeled examples."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Active Learning. 2 Learning from Examples  Passive learning A random set of labeled examples A random set of labeled examples.

Similar presentations

Presentation on theme: "Active Learning. 2 Learning from Examples  Passive learning A random set of labeled examples A random set of labeled examples."— Presentation transcript:

Similar presentations

About project

Feedback