Download presentation
Presentation is loading. Please wait.
Published byCorey Johnston Modified over 8 years ago
1
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.9: Semi-Supervised Learning Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall
2
Rodney Nielsen, Human Intelligence & Language Technologies Lab Implementation: Real Machine Learning Schemes Decision trees From ID3 to C4.5 (pruning, numeric attributes,...) Classification rules From PRISM to RIPPER and PART (pruning, numeric data, …) Association Rules Frequent-pattern trees Extending linear models Support vector machines and neural networks Instance-based learning Pruning examples, generalized exemplars, distance functions
3
Rodney Nielsen, Human Intelligence & Language Technologies Lab Implementation: Real Machine Learning Schemes Numeric prediction Regression/model trees, locally weighted regression Bayesian networks Learning and prediction, fast data structures for learning Clustering: hierarchical, incremental, probabilistic Hierarchical, incremental, probabilistic, Bayesian Semisupervised learning Clustering for classification, co-training
4
Rodney Nielsen, Human Intelligence & Language Technologies Lab Semisupervised Learning Semisupervised learning: attempts to use unlabeled data as well as labeled data The aim is to improve classification performance Why try to do this? Unlabeled data is often plentiful and labeling data can be expensive Web mining: classifying web pages Text mining: identifying names in text Video mining: classifying people in the news Leveraging the large pool of unlabeled examples would be very attractive
5
Rodney Nielsen, Human Intelligence & Language Technologies Lab Clustering for Classification Idea: use Naïve Bayes on labeled examples and then apply EM Build Naïve Bayes model on labeled data Until convergence: Label unlabeled data based on class probabilities (“Expectation” step) Train new Naïve Bayes model based on all the data (“Maximization” step) Essentially the same as EM for clustering with fixed cluster membership probabilities for labeled data and #clusters = #classes
6
Rodney Nielsen, Human Intelligence & Language Technologies Lab Comments Has been applied successfully to document classification Certain phrases are indicative of classes Some of these phrases occur only in the unlabeled data, some in both sets EM can generalize the model by taking advantage of co-occurrence of these phrases Refinement 1: reduce weight of unlabeled data Refinement 2: allow multiple clusters per class
7
Rodney Nielsen, Human Intelligence & Language Technologies Lab Co-training Method for learning from multiple views (multiple sets of attributes), eg: First set of attributes describes content of web page Second set of attributes describes links that link to the web page Until stopping criteria: Step 1: build model from each view Step 2: use models to assign labels to unlabeled data Step 3: select those unlabeled examples that were most confidently predicted (often, preserving ratio of classes) Step 4: add those examples to the training set Assumption: views are independent
8
Rodney Nielsen, Human Intelligence & Language Technologies Lab EM and Co-training Like EM for semisupervised learning, but view is switched in each iteration of EM Uses all the unlabeled data (probabilistically labeled) for training Has also been used successfully with support vector machines Using logistic models fit to output of SVMs to estimate a class probability distribution Co-training sometimes also seems to work when views are chosen randomly! Why? Maybe Co-trained classifier is more robust
9
Rodney Nielsen, Human Intelligence & Language Technologies Lab Self-Training L L 0 Until stopping-criteria h(x) f(L) U * select(U, h) L L 0 +
10
Rodney Nielsen, Human Intelligence & Language Technologies Lab Example Selection Probability Probability ratio or probability margin Entropy Or several other possibilities (e.g., seach Burr Settles Active Learning Tutorial )
11
Rodney Nielsen, Human Intelligence & Language Technologies Lab Stopping Criteria T rounds, Repeat until convergence, Use held out validation data, or k-fold cross validation
12
Rodney Nielsen, Human Intelligence & Language Technologies Lab Seed Seed Data vs. Seed Classifier Training on seed data does not necessarily result in a classifier that perfectly labels the seed data Training on data output by a seed classifier does not necessarily result in the same classifier
13
Rodney Nielsen, Human Intelligence & Language Technologies Lab Indelibility Indelible L Until stopping-criteria h(x) f(L) U * select(U, h) L L + U U – U * Original: Y (U) can change L L 0 Until stopping-criteria h(x) f(L) U * select(U, h) L L 0 +
14
Rodney Nielsen, Human Intelligence & Language Technologies Lab Persistence Indelible L Until stopping-criteria h(x) f(L) U * select(U, h) L L + U U – U * P ersistent: X (L) can’t change L L 0 Until stopping-criteria h(x) f(L) U * U * +select(U, h) L L 0 + U U – U *
15
Rodney Nielsen, Human Intelligence & Language Technologies Lab Throttling Throttled L L 0 Until stopping-criteria h(x) f(L) U * select(U, h, k) L L 0 + Select k examples from U, with the greatest confidence Original: Threshold L L 0 Until stopping-criteria h(x) f(L) U * select(U, h, θ) L L 0 + Select all examples from U, with confidence > θ
16
Rodney Nielsen, Human Intelligence & Language Technologies Lab Balanced Balanced (&Throttled) L L 0 Until stopping-criteria h(x) f(L) U * select(U, h, k) L L 0 + Select k + positive & k - negative exs; often k + =k - or they are proportional to N + & N - Throttled L L 0 Until stopping-criteria h(x) f(L) U * select(U, h, k) L L 0 + Select k examples from U, with greatest confidence
17
Rodney Nielsen, Human Intelligence & Language Technologies Lab Preselection Preselect Subset of U L L 0 Until stopping-criteria h(x) f(L) U’ select(U, φ) U * select(U’, h, θ) L L 0 + Select exs from U’, a subset of U (typically random) Original: Test all of U L L 0 Until stopping-criteria h(x) f(L) -- U * select(U, h, θ) L L 0 + Select exs from all of U
18
Rodney Nielsen, Human Intelligence & Language Technologies Lab X = X 1 × X 2 ; two different views of the data x = (x 1, x 2 ) ; i.e., each instance is comprised of two distinct sets of features and values Assume each view is sufficient for correct classification Co-training
19
Rodney Nielsen, Human Intelligence & Language Technologies Lab Co-Training Algorithm 1 Table 1: Blum and Mitchell, 1998
20
Rodney Nielsen, Human Intelligence & Language Technologies Lab Companionbots Perceptive, emotive, conversational, healthcare, companion robots
21
Rodney Nielsen, Human Intelligence & Language Technologies Lab Elderly and Depression Depression Leading cause of disability M/F All ages Worldwide (WHO) Doubles cost of care for chronic diseases Stats for 65+ Double in number by 2030 12 20% 50-58% of hospital patients 36-50% of healthcare expenditures
22
Rodney Nielsen, Human Intelligence & Language Technologies Lab Companionbots Architecture Goal Manager Speech Recognition Dialogue Manager Natural Language Generation Natural Language U nderstanding Object Tracking AudioVisionLocation Force / Touch Distance M easurement R adar, IR … Scenario U nderstanding LanguageBeliefs Body & Motion Habits, Hobbies & Routines EmotionsHealth Scenario Prediction Emotion Prediction Dialogue Prediction M anipulation Manager Posture Manager Expression Manager Gesture Manager Locomotion Manager Sensor 1 Manager Object Recognition Emotion Recognition … Sensory Input … F undamental R ecognition … Situation U nderstanding Emotion U nderstanding E nvironment U nderstanding … Situation Prediction E nvironment Prediction … User Modeling & History Tracking … Behavior Manager Scenario Goal Manager Emotion Goal Manager Dialogue Goal Manager E nvironment Goal Manager Health Goal Manager Text to Speech Motor Controls … Mechatronic Control … Other M echatronic Controls Natural Behavior Generation Natural Movement Generation Natural Expression Generation … Tools Question Answering Information Retrieval / Extraction Document S ummarization AudioMovement Mechatronic Outputs Visual Displays … Time … InterpretationAction Instance selection for Co-training in emotion recognition
23
Rodney Nielsen, Human Intelligence & Language Technologies Lab Multimodal Emotion Recognition Vision Speech Language Why does this always have to happen to me
24
Rodney Nielsen, Human Intelligence & Language Technologies Lab Co-Training Emotion Recognition Givena set L of labeled training examples a set U of unlabeled training examples Create a pool U' of examples by choosing u examples at random from U Loop for k iterations: Use L to train a classifier h 1 that considers only Use L to train a classifier h 2 that considers only Use L to train a classifier h 3 that considers only Allow h 1 to label p 1 positive and n 1 negative examples from U’ Allow h 2 to label p 2 positive and n 2 negative examples from U' Allow h 3 to label p 3 positive and n 3 negative examples from U' Add these self-labeled examples to L Randomly choose examples from U to replenish U’ (Blum & Mitchell, 1998) Why does this always have to happen to me
25
Rodney Nielsen, Human Intelligence & Language Technologies Lab Semisupervised & Active Learning Most common strategy for instance selection Based on class probability estimates Semisupervised learning Select k instances with highest class probabilities Active learning Select k instances with lowest class probabilities
26
Rodney Nielsen, Human Intelligence & Language Technologies Lab Active Learning Usually an abundance of unlabeled data How much should you label? Which instances should you label? Does it matter? Can the learner benefit from selective labeling? Active Learning: incrementally requests labels for key instances
27
Rodney Nielsen, Human Intelligence & Language Technologies Lab Learning Paradigms ? ? Supervised Learning Unsupervised Learning Active Learning ? ? random query ? ? ? ? ? ?
28
Rodney Nielsen, Human Intelligence & Language Technologies Lab Active Learning Applications Speech Recognition 10 mins to annotate words in 1 min of speech 7 hrs to annotate phonemes of 1 minute speech Named Entity Recognition Half an hour for a simple newswire article PhD for a bioinformatics article Image annotation
29
Rodney Nielsen, Human Intelligence & Language Technologies Lab Face/Pedestrian/Object Detection
30
Rodney Nielsen, Human Intelligence & Language Technologies Lab Heuristic Active Learning Algorithm Start with unlabeled data Randomly pick small number of examples to have labeled Repeat Train classifier on labeled data Query the unlabeled ex that: Is closest to the boundary Has the least certainty Minimizes overall uncertainty ? ? ? ? random query ? ? ? ? ? ?
31
Rodney Nielsen, Human Intelligence & Language Technologies Lab Two Gaussians a) Two classes with Gaussian distributions b) Logistic Regression on 30 random labeled exs 70% accuracy c) Log. Reg. on 30 exs chosen by Active Learning 90% accuracy
32
Rodney Nielsen, Human Intelligence & Language Technologies Lab Space of Active Learning Query Types: Sampling method Membership Query Synthesis Stream-based Selective Sampling Pool-based Active Learning Uncertainty sampling Query by committee Expected model change Variance reduction Est. error reduction Density weighted methods
33
Rodney Nielsen, Human Intelligence & Language Technologies Lab Space of Active Learning Query Types: Sampling method Membership Query Synthesis Stream-based Selective Sampling Pool-based Active Learning Uncertainty sampling Query by committee Expected model change Variance reduction Est. error reduction Density weighted methods
34
Rodney Nielsen, Human Intelligence & Language Technologies Lab Active Learning Query Types From Burr Settles, 2009, AL Tutorial
35
Rodney Nielsen, Human Intelligence & Language Technologies Lab Membership Query Synthesis Dynamically construct query instances based on expected informativeness Applications Character recognition. Robot scientist: find optimal growth medium for a yeast 3x $ decrease vs. cheapest next 100x $ decrease vs. random selection
36
Rodney Nielsen, Human Intelligence & Language Technologies Lab Stream-based Selective Sampling Informativeness measure Region of uncertainty / Version space Applications POST Sensor scheduling IR ranking WSD
37
Rodney Nielsen, Human Intelligence & Language Technologies Lab Pool-based Active Learning Informativeness measure Applications Cancer diagnosis Text classification IE Image classfctn & retrieval Video classfctn & retrieval Speech recognition
38
Rodney Nielsen, Human Intelligence & Language Technologies Lab Pool-based Active Learning Loop From Burr Settles, 2009, AL Tutorial
39
Rodney Nielsen, Human Intelligence & Language Technologies Lab Space of Active Learning Query Types: Sampling method Membership Query Synthesis Stream-based Selective Sampling Pool-based Active Learning Uncertainty sampling Query by committee Expected model change Variance reduction Est. error reduction Density weighted methods
40
Rodney Nielsen, Human Intelligence & Language Technologies Lab Questions Questions???
41
Rodney Nielsen, Human Intelligence & Language Technologies Lab Instance Sampling in Active Learning Query Types: Sampling method Membership Query Synthesis Stream-based Selective Sampling Pool-based Active Learning Uncertainty sampling Query by committee Expected model change Variance reduction Est. error reduction Density weighted methods
42
Rodney Nielsen, Human Intelligence & Language Technologies Lab Uncertainty Sampling Uncertainty sampling Select examples based on confidence in prediction Least confident Margin sampling Entropy-based models
43
Rodney Nielsen, Human Intelligence & Language Technologies Lab Query by Committee Train a committee of hypotheses Representing different regions of the version space Obtain some measure of (dis)agreement on the instances in the dataset (e.g., vote entropy) Assume the most informative instance is the one on which the committee has the most disagreement Goal: minimize the version space No agreement on size of committee, but even 2-3 provides good results
44
Rodney Nielsen, Human Intelligence & Language Technologies Lab Competing Hypotheses a From Burr Settles, 2009, AL Tutorial
45
Rodney Nielsen, Human Intelligence & Language Technologies Lab Expected Model Change Query the instance that would result in the largest expected change in h based on the current model and Expectations E.g., the instance that would result in the largest gradient descent in the model parameters Prefer the instance x that leads to the most significant change in the model
46
Rodney Nielsen, Human Intelligence & Language Technologies Lab Expected Model Change What learning algorithms does this work for What are the issues Can be computationally expensive for large datasets and feature spaces Can be led astray if features aren’t properly scaled How do you properly scale the features?
47
Rodney Nielsen, Human Intelligence & Language Technologies Lab Estimated Error Reduction Other models approximate the goal of minimizing future error by minimizing (e.g., uncertainty,…) Estimated Error Reduction attempts to directly minimize E[error]
48
Rodney Nielsen, Human Intelligence & Language Technologies Lab Estimated Error Reduction Often computationally prohibitive Binary logistic regression would be O(|U||L|G) Where G is the number of gradient descent iterations to convergence Conditional Random Fields would be O(T|Y| T+2 |U||L|G) Where T is the number of instances in the sequence
49
Rodney Nielsen, Human Intelligence & Language Technologies Lab Variance Reduction Regression problems E[error 2 ] = noise + bias + variance: Learner can’t change noise or bias so minimize variance Fisher Information Ratio used for classification
50
Rodney Nielsen, Human Intelligence & Language Technologies Lab Outlier Phenomenon Uncertainty sampling and Query by Committee might be hindered by querying many outliers
51
Rodney Nielsen, Human Intelligence & Language Technologies Lab Density Weighted Methods Uncertainty sampling and Query by Committee might be hindered by querying many outliers Density weighted methods overcome this potential problem by also considering whether the example is representative of the input dist. Tends to work better than any of the base classifiers on their own
52
Rodney Nielsen, Human Intelligence & Language Technologies Lab Diversity Naïve selection by earlier methods results in selecting examples that are very similar Must factor this in and look for diversity in the queries
53
Rodney Nielsen, Human Intelligence & Language Technologies Lab Active Learning Empirical Results Appears to work well, barring publication bias From Burr Settles, 2009, AL Tutorial
54
Rodney Nielsen, Human Intelligence & Language Technologies Lab Labeling Costs Are all labels created equal? Generating labels by experiments Some instances easier to label (eg, shorter sents) Can pre-label data for a small savings Experimental problems Value of information (VOI) Considers labeling & estmtd misclassification costs Critical to goal of Active Learning Divide informativeness by cost?
55
Rodney Nielsen, Human Intelligence & Language Technologies Lab Batch Mode Active Learning
56
Rodney Nielsen, Human Intelligence & Language Technologies Lab Active Learning Evaluation Learning curves for text classification: baseball vs. hockey. Curves plot classification accuracy as a function of the number of documents queried for two selection strategies: uncertainty sampling (active learning) and random sampling (passive learning). We can see that the active learning approach is superior here because its learning curve dominates that of random sampling. From Burr Settles, 2009, AL Tutorial
57
Rodney Nielsen, Human Intelligence & Language Technologies Lab Active Learning Evaluation We can conclude that an active learning algorithm is superior to some other approach (e.g., a random baseline like traditional passive supervised learning) if it dominates the other for most or all of the points along their learning curves. From Burr Settles, 2009, AL Tutorial
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.