Machine Learning CSE 454. Administrivia PS1 due next tues 10/13 Project proposals also due then Group meetings with Dan Signup out shortly.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Machine Learning: Intro and Supervised Classification
Random Forest Predrag Radenković 3237/10
Learning Algorithm Evaluation
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Data Mining Classification: Alternative Techniques
Model Assessment, Selection and Averaging
CMPUT 466/551 Principal Source: CMU
Final review LING572 Fei Xia Week 10: 03/13/08 1.
Machine Learning CSE 454. © Daniel S. Weld 2 Today’s Outline Brief supervised learning review Evaluation Overfitting Ensembles Learners: The more the.
Sparse vs. Ensemble Approaches to Supervised Learning
1 Text Categorization CSE 454. Administrivia Mailing List Groups for PS1 Questions on PS1? Groups for Project Ideas for Project.
Ensemble Learning: An Introduction
Machine Learning IV Ensembles CSE 473. © Daniel S. Weld 2 Machine Learning Outline Machine learning: Supervised learning Overfitting Ensembles of classifiers.
Three kinds of learning
Sparse vs. Ensemble Approaches to Supervised Learning
Part I: Classification and Bayesian Learning
Ensemble Learning (2), Tree and Forest
Crash Course on Machine Learning
Ensembles of Classifiers Evgueni Smirnov
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
Machine Learning CS 165B Spring 2012
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Text Classification, Active/Interactive learning.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
CS 391L: Machine Learning: Ensembles
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Learning from Observations Chapter 18 Through
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
CLASSIFICATION: Ensemble Methods
Classification Techniques: Bayesian Classification
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Machine Learning II Decision Tree Induction CSE 573.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Ensemble Methods in Machine Learning
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Classification Ensemble Methods 1
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Learning From Observations Inductive Learning Decision Trees Ensembles.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Ensembles of Classifiers Evgueni Smirnov. Outline 1 Methods for Independently Constructing Ensembles 1.1 Bagging 1.2 Randomness Injection 1.3 Feature-Selection.
CSE 446: Ensemble Learning Winter 2012 Daniel Weld Slides adapted from Tom Dietterich, Luke Zettlemoyer, Carlos Guestrin, Nick Kushmerick, Padraig Cunningham.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Machine Learning: Ensemble Methods
Machine Learning: Ensembles
Issues in Machine Learning
CSE P573 Applications of Artificial Intelligence Decision Trees
Classification Techniques: Bayesian Classification
Data Mining Practical Machine Learning Tools and Techniques
CSE 573 Introduction to Artificial Intelligence Decision Trees
Overview of Machine Learning
Machine Learning CSE 454.
Model Combination.
Model generalization Brief summary of methods
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Machine Learning CSE 454

Administrivia PS1 due next tues 10/13 Project proposals also due then Group meetings with Dan Signup out shortly

Class Overview Network Layer Document Layer Crawling Indexing Content Analysis Query processing Other Cool Stuff

© Daniel S. Weld 4 Today’s Outline Brief supervised learning review Evaluation Overfitting Ensembles Learners: The more the merrier Co-Training (Semi) Supervised learning with few labeled training ex

Types of Learning Supervised (inductive) learning Training data includes desired outputs Semi-supervised learning Training data includes a few desired outputs Unsupervised learning Training data doesn’t include desired outputs Reinforcement learning Rewards from sequence of actions

© Daniel S. Weld 6 Supervised Learning Inductive learning or “Prediction”: Given examples of a function (X, F(X)) Predict function F(X) for new examples X Classification F(X) = Discrete Regression F(X) = Continuous Probability estimation F(X) = Probability(X):

Classifier Hypothesis: Function for labeling examples Label: - Label: + ? ? ? ?

© Daniel S. Weld 8 Bias Which hypotheses will you consider? Which hypotheses do you prefer?

Naïve Bayes Probabilistic classifier: P(C i | Example) Bias? Assumes all features are conditionally independent given class Therefore, we then only need to know P(e j | c i ) for each feature and category © Daniel S. Weld 9

10 Naïve Bayes for Text Modeled as generating a bag of words for a document in a given category Assumes that word order is unimportant, only cares whether word appears in document Smooth probability estimates with Laplace m-estimates assuming uniform distribution over words (p = 1/|V |) and m = |V | Equivalent to a virtual sample of seeing each word in each category exactly once.

Naïve Bayes Country vs. County Seat Language Probability(Seat | County) = ?? Probability(Seat | Country) = ?? Population Pop.SeatLang.Class YYNCounty YYY YNYCountry NNY

Naïve Bayes Country vs. County Seat Language Probability(Seat | County) = / = 1.0 Probability(Seat | Country) = ?? Population Pop.SeatLang.Class YYNCounty YYY YNYCountry NNY

Naïve Bayes Country vs. County Seat Language Probability(Seat | County) = 2 + 1/ = 0.75 Probability(Seat | Country) = / = 0.25 Population Pop.SeatLang.Class YYNCounty YYY YNYCountry NNY

Probabilities: Important Detail! Any more potential problems here? P(spam | E 1 … E n ) =  P(spam | E i ) i  We are multiplying lots of small numbers Danger of underflow!  = 7 E -18  Solution? Use logs and add!  p 1 * p 2 = e log(p1)+log(p2)  Always keep in log form

Multi-Class Categorization Pick the category with max probability Create many 1 vs other classifiers Classes = City, County, Country Classifier 1 = {City} {County, Country} Classifier 2 = {County} {City, Country} Classifier 3 = {Country} {City, County} 15

Multi-Class Categorization Use a hierarchical approach (wherever hierarchy available) Entity Person Location Scientist Artist City County Country 16

© Daniel S. Weld 17 Today’s Outline Brief supervised learning review Evaluation Overfitting Ensembles Learners: The more the merrier Co-Training (Semi) Supervised learning with few labeled training ex

Experimental Evaluation Question: How do we estimate the performance of classifier on unseen data? Can’t just at accuracy on training data – this will yield an over optimistic estimate of performance Solution: Cross-validation Note: this is sometimes called estimating how well the classifier will generalize © Daniel S. Weld 18

Evaluation: Cross Validation Partition examples into k disjoint sets Now create k training sets Each set is union of all equiv classes except one So each set has (k-1)/k of the original training data  Train  Test …

Cross-Validation (2) Leave-one-out Use if < 100 examples (rough estimate) Hold out one example, train on remaining examples 10-fold If have ’s of examples M of N fold Repeat M times Divide data into N folds, do N fold cross- validation

© Daniel S. Weld 21 Today’s Outline Brief supervised learning review Evaluation Overfitting Ensembles Learners: The more the merrier Co-Training (Semi) Supervised learning with few labeled training ex Clustering No training examples

Overfitting Definition Hypothesis H is overfit when  H’ and H has smaller error on training examples, but H has bigger error on test examples Causes of overfitting Noisy data, or Training set is too small Large number of features Big problem in machine learning One solution: Validation set

© Daniel S. Weld23 Overfitting Accuracy On training data On test data Model complexity (e.g., number of nodes in decision tree)

Validation/Tuning Set Split data into train and validation set Score each model on the tuning set, use it to pick the ‘best’ model Test Tune

© Daniel S. Weld25 Early Stopping Model complexity (e.g., number of nodes in decision tree) Accuracy On training data On test data On validation data Remember this and use it as the final classifier

Extra Credit Ideas Different types of models Support Vector Machines (SVMs), widely used in web search Tree-augmented naïve Bayes Feature construction © Daniel S. Weld 26

Support Vector Machines Which one is best hypothesis?

Support Vector Machines Largest distance to neighboring data points SVMs in Weka: SMO

Tree Augmented Naïve Bayes (TAN) [Friedman,Geiger & Goldszmidt 1997] F 2F N-2F N-1F NF 1F 3 Class Value … Models limited set of dependencies Guaranteed to find best structure Runs in polynomial time

Construct Better Features Key to machine learning is having good features In industrial data mining, large effort devoted to constructing appropriate features Ideas?? © Daniel S. Weld 30

Possible Feature Ideas Look at capitalization (may indicated a proper noun) Look for commonly occurring sequences E.g. New York, New York City Limit to 2-3 consecutive words Keep all that meet minimum threshold (e.g. occur at least 5 or 10 times in corpus) © Daniel S. Weld 31

Properties of Text Word frequencies - skewed distribution `The’ and `of’ account for 10% of all words Six most common words account for 40% Zipf’s Law: Rank * probability = c Eg, c = 0.1 From [Croft, Metzler & Strohman 2010]

Associate Press Corpus `AP89’ From [Croft, Metzler & Strohman 2010]

Middle Ground Very common words  bad features Language-based stop list: words that bear little meaning words Subject-dependent stop lists Very rare words also bad features Drop words appearing less than k times / corpus

35 Stop lists Language-based stop list: words that bear little meaning words Subject-dependent stop lists From Peter Brusilovsky Univ Pittsburg INFSCI 2140

36 Stemming Are there different index terms? retrieve, retrieving, retrieval, retrieved, retrieves… Stemming algorithm: (retrieve, retrieving, retrieval, retrieved, retrieves)  retriev Strips prefixes of suffixes (-s, -ed, -ly, -ness) Morphological stemming Copyright © Weld

37 Stemming Continued Can reduce vocabulary by ~ 1/3 C, Java, Perl versions, python, c# Criterion for removing a suffix Does "a document is about w 1 " mean the same as a "a document about w 2 " Problems: sand / sander & wand / wander Commercial SEs use giant in-memory tables Copyright © Weld

© Daniel S. Weld 38 Today’s Outline Brief supervised learning review Evaluation Overfitting Ensembles Learners: The more the merrier Co-Training (Semi) Supervised learning with few labeled training ex

© Daniel S. Weld 39 Ensembles of Classifiers Traditional approach: Use one classifier Alternative approach: Use lots of classifiers Approaches: Cross-validated committees Bagging Boosting Stacking

© Daniel S. Weld 40 Voting

© Daniel S. Weld 41 Ensembles of Classifiers Assume Errors are independent (suppose 30% error) Majority vote Probability that majority is wrong… If individual area is 0.3 Area under curve for  11 wrong is Order of magnitude improvement! Ensemble of 21 classifiers Prob Number of classifiers in error = area under binomial distribution

© Daniel S. Weld 42 Constructing Ensembles Partition examples into k disjoint equiv classes Now create k training sets Each set is union of all equiv classes except one So each set has (k-1)/k of the original training data Now train a classifier on each set Cross-validated committees Holdout

© Daniel S. Weld 43 Ensemble Construction II Generate k sets of training examples For each set Draw m examples randomly (with replacement) From the original set of m examples Each training set corresponds to 63.2% of original (+ duplicates) Now train classifier on each set Intuition: Sampling helps algorithm become more robust to noise/outliers in the data Bagging

© Daniel S. Weld 44 Ensemble Creation III Maintain prob distribution over set of training ex Create k sets of training data iteratively: On iteration i Draw m examples randomly (like bagging) But use probability distribution to bias selection Train classifier number i on this training set Test partial ensemble (of i classifiers) on all training exs Modify distribution: increase P of each error ex Create harder and harder learning problems... “Bagging with optimized choice of examples” Boosting

© Daniel S. Weld 45 Ensemble Creation IV Stacking Train several base learners Next train meta-learner Learns when base learners are right / wrong Now meta learner arbitrates Train using cross validated committees Meta-L inputs = base learner predictions Training examples = ‘test set’ from cross validation

© Daniel S. Weld 46 Today’s Outline Brief supervised learning review Evaluation Overfitting Ensembles Learners: The more the merrier Co-Training (Semi) Supervised learning with few labeled training ex

© Daniel S. Weld 47 Co-Training Motivation Learning methods need labeled data Lots of pairs Hard to get… (who wants to label data?) But unlabeled data is usually plentiful… Could we use this instead?????? Semi-supervised learning

© Daniel S. Weld 48 Co-training Have little labeled data + lots of unlabeled Each instance has two parts: x = [x1, x2] x1, x2 conditionally independent given f(x) Each half can be used to classify instance  f1, f2 such that f1(x1) ~ f2(x2) ~ f(x) Both f1, f2 are learnable f1  H1, f2  H2,  learning algorithms A1, A2 Suppose

Co-training Example © Daniel S. Weld 49 Prof. Domingos Students: Parag,… Projects: SRL, Data mining I teach a class on data mining CSE 546: Data Mining Course Description:… Topics:… Homework: … Jesse Classes taken: 1. Data mining 2. Machine learning Research: SRL

© Daniel S. Weld 50 Without Co-training f 1 (x 1 ) ~ f 2 (x 2 ) ~ f(x) A 1 learns f 1 from x 1 A 2 learns f 2 from x 2 A Few Labeled Instances [x 1, x 2 ] f2f2 A2A2 Unlabeled Instances A1A1 f1f1 } Combine with ensemble? Bad!! Not using Unlabeled Instances! f’

© Daniel S. Weld 51 Co-training f 1 (x 1 ) ~ f 2 (x 2 ) ~ f(x) A 1 learns f 1 from x 1 A 2 learns f 2 from x 2 A Few Labeled Instances [x 1, x 2 ] Lots of Labeled Instances f2f2 Hypothesis A2A2 Unlabeled Instances A1A1 f1f1

© Daniel S. Weld 52 Observations Can apply A 1 to generate as much training data as one wants If x 1 is conditionally independent of x 2 / f(x), then the error in the labels produced by A 1 will look like random noise to A 2 !!! Thus no limit to quality of the hypothesis A 2 can make

© Daniel S. Weld 53 Co-training f 1 (x 1 ) ~ f 2 (x 2 ) ~ f(x) A 1 learns f 1 from x 1 A 2 learns f 2 from x 2 A Few Labeled Instances [x 1, x 2 ] Lots of Labeled Instances Hypothesis A2A2 Unlabeled Instances A1A1 f1f1 f2f2 f2f2 Lots of f2f2 f1f1

© Daniel S. Weld 54 It really works! Learning to classify web pages as course pages x1 = bag of words on a page x2 = bag of words from all anchors pointing to a page Naïve Bayes classifiers 12 labeled pages 1039 unlabeled