Preventing Overfitting Problem: We don’t want to these algorithms to fit to ``noise’’ The generated tree may overfit the training data –Too many branches,

Slides:

Advertisements

Similar presentations

1 Classification using instance-based learning. 3 March, 2000Advanced Knowledge Management2 Introduction (lazy vs. eager learning) Notion of similarity.

Advertisements

Data Pre-processing Data Cleaning : Sampling:

Slides from: Doug Gray, David Poole

1 Data Mining: and Knowledge Acquizition — Chapter 5 — BIS /2014 Summer.

DECISION TREES. Decision trees  One possible representation for hypotheses.

Demo: Classification Programs C4.5 CBA Minqing Hu CS594 Fall 2003 UIC.

Classification Algorithms

Data Mining Classification: Alternative Techniques

Data Mining Classification: Alternative Techniques

Classification Techniques: Decision Tree Learning

Instance Based Learning

1 Machine Learning: Lecture 7 Instance-Based Learning (IBL) (Based on Chapter 8 of Mitchell T.., Machine Learning, 1997)

Lazy vs. Eager Learning Lazy vs. eager learning

Statistical Classification Rong Jin. Classification Problems X Input Y Output ? Given input X={x 1, x 2, …, x m } Predict the class label y  Y Y = {-1,1},

1er. Escuela Red ProTIC - Tandil, de Abril, Instance-Based Learning 4.1 Introduction Instance-Based Learning: Local approximation to the.

Machine Learning Neural Networks

Bayesian classifiers.

Instance based learning K-Nearest Neighbor Locally weighted regression Radial basis functions.

Instance-Based Learning

Classification and Regression. Classification and regression  What is classification? What is regression?  Issues regarding classification and regression.

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

Classification Continued

These slides are based on Tom Mitchell’s book “Machine Learning” Lazy learning vs. eager learning Processing is delayed until a new instance must be classified.

Aprendizagem baseada em instâncias (K vizinhos mais próximos)

KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.

Classification.

INSTANCE-BASE LEARNING

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor.

CS Instance Based Learning1 Instance Based Learning.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Radial-Basis Function Networks

Evaluating Classifiers

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

Artificial Neural Networks

Mohammad Ali Keyvanrad

Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Chapter 9 Neural Network.

Basic Data Mining Technique

11/12/2012ISC471 / HCI571 Isabelle Bichindaritz 1 Prediction.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

1 Instance Based Learning Ata Kaban The University of Birmingham.

Linear Discrimination Reading: Chapter 2 of textbook.

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Practical Issues of Classification Underfitting and Overfitting –Training errors –Generalization (test) errors Missing Values Costs of Classification.

Classification Algorithms Decision trees Rule-based induction Neural networks Memory(Case) based reasoning Genetic algorithms Bayesian networks Basic Principle.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

Preventing Overfitting Problem: We don’t want to these algorithms to fit to ``noise’’ Reduced-error pruning : –breaks the samples into a training set and.

CpSc 881: Machine Learning Instance Based Learning.

CpSc 810: Machine Learning Instance Based Learning.

Outline K-Nearest Neighbor algorithm Fuzzy Set theory Classifier Accuracy Measures.

Lazy Learners K-Nearest Neighbor algorithm Fuzzy Set theory Classifier Accuracy Measures.

Data Mining and Decision Support

CS Machine Learning Instance Based Learning (Adapted from various sources)

K-Nearest Neighbor Learning.

Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.

Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.

Kansas State University Department of Computing and Information Sciences CIS 890: Special Topics in Intelligent Systems Wednesday, November 15, 2000 Cecil.

Instance-Based Learning Evgueni Smirnov. Overview Instance-Based Learning Comparison of Eager and Instance-Based Learning Instance Distances for Instance-Based.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

CS 8751 ML & KDDInstance Based Learning1 k-Nearest Neighbor Locally weighted regression Radial basis functions Case-based reasoning Lazy and eager learning.

Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.

Neural Networks Advantages Criticism

CSCI N317 Computation for Scientific Applications Unit Weka

Avoid Overfitting in Classification

Roc curves By Vittoria Cozza, matr

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor

Presentation transcript:

Preventing Overfitting Problem: We don’t want to these algorithms to fit to ``noise’’ The generated tree may overfit the training data –Too many branches, some may reflect anomalies due to noise or outliers –Result is in poor accuracy for unseen samples

Avoid Overfitting in Classification Two approaches to avoid overfitting –Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold –Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the “best pruned tree”

Reduced-error pruning : breaks the samples into a training set and a test set. The tree is induced completely on the training set. Working backwards from the bottom of the tree, the subtree starting at each nonterminal node is examined. If the error rate on the test cases improves by pruning it, the subtree is removed. The process continues until no improvement can be made by pruning a subtree, The error rate of the final tree on the test cases is used as an estimate of the true error rate.

Decision Tree Pruning : physician fee freeze = n: | adoption of the budget resolution = y: democrat (151.0) | adoption of the budget resolution = u: democrat (1.0) | adoption of the budget resolution = n: | | education spending = n: democrat (6.0) | | education spending = y: democrat (9.0) | | education spending = u: republican (1.0) physician fee freeze = y: | synfuels corporation cutback = n: republican (97.0/3.0) | synfuels corporation cutback = u: republican (4.0) | synfuels corporation cutback = y: | | duty free exports = y: democrat (2.0) | | duty free exports = u: republican (1.0) | | duty free exports = n: | | | education spending = n: democrat (5.0/2.0) | | | education spending = y: republican (13.0/2.0) | | | education spending = u: democrat (1.0) physician fee freeze = u: | water project cost sharing = n: democrat (0.0) | water project cost sharing = y: democrat (4.0) | water project cost sharing = u: | | mx missile = n: republican (0.0) | | mx missile = y: democrat (3.0/1.0) | | mx missile = u: republican (2.0) Simplified Decision Tree: physician fee freeze = n: democrat (168.0/2.6) physician fee freeze = y: republican (123.0/13.9) physician fee freeze = u: | mx missile = n: democrat (3.0/1.1) | mx missile = y: democrat (4.0/2.2) | mx missile = u: republican (2.0/1.0) Evaluation on training data (300 items): Before Pruning After Pruning Size Errors Size Errors Estimate 25 8( 2.7%) 7 13( 4.3%) ( 6.9%) <

False Positives True Positives False Negatives Actual Predicted Evaluation of Classification Systems Training Set: examples with class values for learning. Test Set: examples with class values for evaluating. Evaluation: Hypotheses are used to infer classification of examples in the test set; inferred classification is compared to known classification. Accuracy: percentage of examples in the test set that are classified correctly.

Model Evaluation Analytic goal: achieve understanding –Exploratory evaluation : understand a novel area of study –Experimental evaluation : support or refute some models Engineering goal : solve a practical problem Estimator of classifiers : accuracy –Accuracy : how well does a model classify –Higher accuracy does not necessarily imply better performance on target task

Confusion Metrics A: True + C : False - B : False + D : True - Actual Class Predicted class + - Y N Entries are counts of correct classifications and counts of errors Other evaluation metrics True positive rate (TP) = A/(A+C)= 1- false negative rate False positive rate (FP)= B/(B+D) = 1- true negative rate Sensitivity = true positive rate Specificity = true negative rate Positive predictive value = A/(A+B) Recall = A/(A+C) = true positive rate = sensitivity Precision = A/(A+B) = PPV

Probabilistic Interpretation of CM Class Distribution Confusion matrix P (+) : P (-) prior probabilities approximated by class frequencies Defined for a particular training set Defined for a particular classifier P(+ | Y) P(- | N) Posterior probabilities likelihoods approximated using error frequencies P(Y |+) P(Y |- )

More Than Accuracy Cost and Benefits –Medical diagnosis : cost of falsely indicating “cancer” is different from cost of missing a true cancer case –Fraud detection : cost of falsely challenging customer is different from cost of leaving fraud undetected –Customer segmentation : Benefit of not contacting a non-buyer is different from benefit of contacting a buyer

Model Evaluation within Context Must take costs and distributions into account Calculate expected profit: profit = P(+)*(TP*B(Y, +) + (1-TP)*C(N, +)) + P(-)*((1-FP)*B(N, -) + FP*C(Y, -)) Choose the classifier that maximises profit Benefits of correct classificationcosts of incorrect classification

Lift & Cumulative Response Curves Lift = P(+ | Y)/P(+) : How much better with model than without

Parametric Models : Parametrically Summarise Data

Contributory Models : retain training data points; each potentially affects the estimation at new point

Neural Networks Advantages –prediction accuracy is generally high –robust, works when training examples contain errors –output may be discrete, real-valued, or a vector of several discrete or real-valued attributes –fast evaluation of the learned target function Criticism –long training time –difficult to understand the learned function (weights) –not easy to incorporate domain knowledge

A Neuron The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping kk - f weighted sum Input vector x output y Activation function weight vector w  w0w0 w1w1 wnwn x0x0 x1x1 xnxn

Network Training The ultimate objective of training –obtain a set of weights that makes almost all the tuples in the training data classified correctly Steps –Initialize weights with random values –Feed the input tuples into the network one by one –For each unit Compute the net input to the unit as a linear combination of all the inputs to the unit Compute the output value using the activation function Compute the error Update the weights and the bias

Multi-Layer Perceptron Output nodes Input nodes Hidden nodes Output vector Input vector: x i w ij

Instance-Based Methods Instance-based learning: –Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified Typical approaches –k-nearest neighbor approach Instances represented as points in a Euclidean space. –Locally weighted regression Constructs local approximation –Case-based reasoning Uses symbolic representations and knowledge-based inference

The k-Nearest Neighbor Algorithm All instances correspond to points in the n-D space. The nearest neighbor are defined in terms of Euclidean distance. The target function could be discrete- or real- valued. For discrete-valued, the k-NN returns the most common value among the k training examples nearest to x q. Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples.. _ + _ xqxq + _ _ + _ _

Discussion on the k-NN Algorithm The k-NN algorithm for continuous-valued target functions –Calculate the mean values of the k nearest neighbors Distance-weighted nearest neighbor algorithm –Weight the contribution of each of the k neighbors according to their distance to the query point x q giving greater weight to closer neighbors –Similarly, for real-valued target functions Robust to noisy data by averaging k-nearest neighbors Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes. –To overcome it, axes stretch or elimination of the least relevant attributes.