Slides for “Data Mining” by I. H. Witten and E. Frank

Slides:



Advertisements
Similar presentations
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Advertisements

Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
CMPUT 466/551 Principal Source: CMU
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Exploratory Data Mining and Data Preparation
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Sparse vs. Ensemble Approaches to Supervised Learning
Evaluation.
Ensemble Learning: An Introduction
Three kinds of learning
Machine Learning: Ensemble Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Part I: Classification and Bayesian Learning
Ensemble Learning (2), Tree and Forest
Ensembles of Classifiers Evgueni Smirnov
Module 04: Algorithms Topic 07: Instance-Based Learning
Data Engineering Data preprocessing and transformation Data Engineering Data preprocessing and transformation.
Issues with Data Mining
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Lecture 7 Ensemble Algorithms MW 4:00PM-5:15PM Dr. Jianjun Hu CSCE822 Data Mining and Warehousing University of South.
Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.
1 Data preparation: Selection, Preprocessing, and Transformation Literature: Literature: I.H. Witten and E. Frank, Data Mining, chapter 2 and chapter 7.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
CLASSIFICATION: Ensemble Methods
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Slides for “Data Mining” by I. H. Witten and E. Frank.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 6.2: Classification Rules Rodney Nielsen Many.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 1 of Data Mining By I. H. Witten, E. Frank and M. A. Hall.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.
Ensemble Methods in Machine Learning
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Classification Ensemble Methods 1
Classification and Regression Trees
Learning From Observations Inductive Learning Decision Trees Ensembles.
Ensembles of Classifiers Evgueni Smirnov. Outline 1 Methods for Independently Constructing Ensembles 1.1 Bagging 1.2 Randomness Injection 1.3 Feature-Selection.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
@relation age sex { female, chest_pain_type { typ_angina, asympt, non_anginal,
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 7 of Data Mining by I. H. Witten and E. Frank.
Data Science Credibility: Evaluating What’s Been Learned
Machine Learning: Ensemble Methods
Data Mining Practical Machine Learning Tools and Techniques
Data Mining Practical Machine Learning Tools and Techniques
Data Science Algorithms: The Basic Methods
Data Science Input: Concepts, Instances and Attributes WFH: Data Mining, Chapter 2 Rodney Nielsen Many/most of these slides were adapted from: I. H.
Data preprocessing and transformation
ECE 5424: Introduction to Machine Learning
Data Mining Practical Machine Learning Tools and Techniques
Machine Learning in Practice Lecture 23
Ensemble learning.
Machine Learning in Practice Lecture 17
Chapter 7: Transformations
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Presentation transcript:

Slides for “Data Mining” by I. H. Witten and E. Frank

Engineering the input and output 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error- vs entropy-based Nominal-to-binary conversion Dirty data Data cleansing Robust regression Anomaly detection Meta-learning Bagging Boosting Stacking Error-correcting output codes

Just apply a learner? NO! Scheme/parameter selection treat selection process as part of the learning process Modifying the input: Attribute selection Discretization Data cleansing Transformations Modifying the output Combine models to improve performance Bagging Boosting Stacking Error-correcting output codes Bayesian model averaging

Attribute selection Adding a random (i.e. irrelevant) attribute can significantly degrade C4.5’s performance Problem: attribute selection based on smaller and smaller amounts of data IBL very susceptible to irrelevant attributes Number of training instances required increases exponentially with number of irrelevant attributes Naïve Bayes doesn’t have this problem Relevant attributes can also be harmful!

Scheme-independent attribute selection Filter approach: assess based on general characteristics of the data One method: find subset of attributes that suffices to separate all the instances Another method: use different learning scheme (e.g. C4.5, 1R) to select attributes IBL-based attribute weighting techniques: also applicable (but can’t find redundant attributes) CFS: uses correlation-based evaluation of subsets

Attribute subsets for weather data

Searching attribute space Number of attribute subsets  exponential in number of attributes Common greedy approaches: forward selection backward elimination More sophisticated strategies: Bidirectional search Best-first search: can find optimum solution Beam search: approximation to best-first search Genetic algorithms

Scheme-specific selection Wrapper approach to attribute selection Implement “wrapper” around learning scheme Evaluation criterion: cross-validation performance Time consuming greedy approach, k attributes  k2  time prior ranking of attributes  linear in k Learning decision tables: scheme-specific attribute selection essential Can operate efficiently for decision tables and Naïve Bayes

Attribute discretization (numeric attributes only) Avoids normality assumption in Naïve Bayes and clustering 1R: uses simple discretization scheme C4.5 performs local discretization Global discretization can be advantageous because it’s based on more data Apply learner to k -valued discretized attribute or to k – 1 binary attributes that code the cut points

Discretization: unsupervised Determine intervals without knowing class labels When clustering, the only possible way! Two strategies: Equal-interval binning Equal-frequency binning (also called histogram equalization) Inferior to supervised schemes in classification tasks

Discretization: supervised Entropy-based method Build a decision tree with pre-pruning on the attribute being discretized Use entropy as splitting criterion Use minimum description length principle as stopping criterion Works well: the state of the art To apply min description length principle: The “theory” is the splitting point (log2[N – 1] bits) plus class distribution in each subset Compare description lengths before/after adding splitting point

Example: temperature attribute 64 65 68 69 70 71 72 72 75 75 80 81 83 85 Play Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No

Formula for MDLP N instances Original set: k classes, entropy E First subset: k1 classes, entropy E1 Second subset: k2 classes, entropy E2 Results in no discretization intervals for temperature attribute

Supervised discretization: other methods Replace top-down procedure by bottom-up Replace MDLP by chi-squared test Use dynamic programming to find optimum k-way split for given additive criterion Use entropy criterion  requires quadratic time (in number of instances) Use error rate  can be done in linear time

Error-based vs. entropy-based Question: could the best discretization ever have two adjacent intervals with the same class? Wrong answer: No. For if so, Collapse the two Free up an interval Use it somewhere else (This is what error-based discretization will do) Right answer: Surprisingly, yes. (and entropy-based discretization can do it)

Error-based vs. entropy-based A 2-class, 2-attribute problem Best discretization of a1, a2 Entropy-based discretization can detect change of class distribution True model

The converse of discretization Make nominal values into “numeric” ones Indicator attributes (used by IB1) Makes no use of potential ordering information Code an ordered nominal attribute into binary ones (used by M5’) Can be used for any ordered attribute Better than coding ordering into an integer (which implies a metric) In general: code subset of attributes as binary

Automatic data cleansing To improve a decision tree: Remove misclassified instances, then re-learn! Better (of course!): Human expert checks misclassified instances Attribute noise vs class noise Attribute noise should be left in training set (don’t train on clean set and test on dirty one) Systematic class noise (e.g. one class substituted for another): leave in training set Unsystematic class noise: eliminate from training set, if possible

Robust regression “Robust” statistical method  one that addresses problem of outliers To make regression more robust: Minimize absolute error, not squared error Remove outliers (e.g. 10% of points farthest from the regression plane) Minimize median instead of mean of squares (copes with outliers in x and y direction) Finds narrowest strip covering half the observations

Example: least median of squares Number of international phone calls from Belgium, 1950–1973

Detecting anomalies Visualization can help detect anomalies Automatic approach: committee of different learning schemes E.g. decision tree nearest-neighbor learner linear discriminant function Conservative approach: delete instances incorrectly classified by them all Problem: might sacrifice instances of small classes

“Meta” learning schemes Basic idea: build different “experts”, let them vote Advantage: often improves predictive performance Disadvantage: produces output that is very hard to analyze Schemes: Bagging Boosting Stacking error-correcting output codes apply to both classification and numeric prediction

Bagging Combining predictions by voting/averaging “Idealized” version: Simplest way! Each model receives equal weight “Idealized” version: Sample several training sets of size n (instead of just having one training set of size n) Build a classifier for each training set Combine the classifiers’ predictions Learning scheme is unstable  almost always improves performance Small change in training data can make big change in model (e.g. decision trees)

Bias-variance decomposition To analyze how much any specific training set affects performance Assume infinitely many classifiers, built from different training sets of size n For any learning scheme, Bias = expected error of the combined classifier on new data Variance = expected error due to the particular training set used Total expected error: bias + variance

More on bagging Bagging works because it reduces variance by voting/averaging the error In some pathological situations the overall error might increase Usually, the more classifiers the better Problem: we only have one dataset! Solution: generate new ones of size n by sampling from it with replacement Can help a lot if data is noisy

Bagging classifiers Model generation Classification Let n be the number of instances in the training data For each of t iterations: Sample n instances from training set (with replacement) Apply learning algorithm to the sample Store resulting model Classification For each of the t models: Predict class of instance using model Return class that is predicted most often

Boosting Also uses voting/averaging Weights models according to performance Iterative: new models are influenced by performance of previously built ones Encourage new model become an “expert” for instances misclassified by earlier models Intuitive justification: models should be experts that complement each other Several variants

AdaBoost.M1 Model generation Classification Assign equal weight to each training instance For t iterations: Apply learning algorithm to weighted dataset, store resulting model Compute model’s error e on weighted dataset If e = 0 or e > 0.5: Terminate model generation For each instance in dataset: If classified correctly by model: Multiply instance’s weight by e/(1-e) Normalize weight of all instances Classification Assign weight = 0 to all classes For each of the t models (or fewer): For the class this model predicts add –log e/(1-e) to this class’s weight Return class with highest weight

More on boosting Boosting needs weights … but Can apply boosting without weights resample with probability determined by weights disadvantage: not all instances are used advantage: if error > 0.5, can resample again Stems from computational learning theory Theoretical result: training error decreases exponentially Also: works if base classifiers are not too complex, and their error doesn’t become too large too quickly

More on boosting Continue boosting after training error = 0? Puzzling fact: generalization error continues to decrease! Seems to contradict Occam’s Razor Explanation: consider margin (confidence), not error Difference between estimated probability for true class and nearest other class (between –1 and 1) Boosting works with weak learners only condition: error doesn’t exceed 0.5 LogitBoost: more sophisticated boosting scheme

Stacking To combine predictions of base learners, don’t vote, use meta learner Base learners: level-0 models Meta learner: level-1 model Predictions of base learners are input to meta learner Base learners are usually different schemes Can’t use predictions on training data to generate data for level-1 model! Instead use cross-validation-like scheme Hard to analyze theoretically: “black magic”

More on stacking If base learners can output probabilities, use those as input to meta learner instead Which algorithm to use for meta learner? In principle, any learning scheme Prefer “relatively global, smooth” model Base learners do most of the work Reduces risk of overfitting Stacking can be applied to numeric prediction too

Error-correcting output codes Multiclass problem  binary problems Simple scheme: One-per-class coding Idea: use error-correcting codes instead base classifiers predict 1011111, true class = ?? Use code words that have large Hamming distance between any pair Can correct up to (d – 1)/2 single-bit errors class class vector a 1000 b 0100 c 0010 d 0001 class class vector a 1111111 b 0000111 c 0011001 d 0101010

More on ECOCs Two criteria : 3 classes  only 23 possible columns Row separation: minimum distance between rows Column separation: minimum distance between columns (and columns’ complements) Why? Because if columns are identical, base classifiers will likely make the same errors Error-correction is weakened if errors are correlated 3 classes  only 23 possible columns (and 4 out of the 8 are complements) Cannot achieve row and column separation Only works for problems with > 3 classes

Exhaustive ECOCs Exhaustive code for k classes: Columns comprise every possible k-string … … except for complements and all-zero/one strings Each code word contains 2k–1 – 1 bits Class 1: code word is all ones Class 2: 2k–2 zeroes followed by 2k–2 –1 ones Class i : alternating runs of 2k–i 0s and 1s last run is one short Exhaustive code, k = 4 class class vector a 1111111 b 0000111 c 0011001 d 0101010

More on ECOCs More classes  exhaustive codes infeasible Number of columns increases exponentially Random code words have good error-correcting properties on average! There are sophisticated methods for generating ECOCs with just a few columns ECOCs don’t work with NN classifier But: works if different attribute subsets are used to predict each output bit