Business Systems Intelligence: 5. Classification 2

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Slides from: Doug Gray, David Poole
1 Data Mining: and Knowledge Acquizition — Chapter 5 — BIS /2014 Summer.
ECG Signal processing (2)
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Artificial Neural Networks (1)
Perceptron Learning Rule
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
NEURAL NETWORKS Perceptron
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
An Introduction of Support Vector Machine
Support Vector Machines
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Discriminative and generative methods for bags of features
Machine Learning Neural Networks
Lecture 14 – Neural Networks
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Simple Neural Nets For Pattern Classification
Model Evaluation Metrics for Performance Evaluation
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 15: Introduction to Artificial Neural Networks Martin Russell.
Support Vector Machines Kernel Machines
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Artificial Neural Networks
CS Instance Based Learning1 Instance Based Learning.
Gini Index (IBM IntelligentMiner)
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Radial Basis Function Networks
Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
This week: overview on pattern recognition (related to machine learning)
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Artificial Neural Networks (ANN). Output Y is 1 if at least two of the three inputs are equal to 1.
Chapter 9 Neural Network.
Neural Networks AI – Week 23 Sub-symbolic AI Multi-Layer Neural Networks Lee McCluskey, room 3/10
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Ensemble Methods in Machine Learning
Classification Ensemble Methods 1
Data Mining and Decision Support
Validation methods.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.
Perceptrons Michael J. Watts
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
CS 9633 Machine Learning Support Vector Machines
Chapter 7. Classification and Prediction
Artificial Neural Networks
COMP61011 : Machine Learning Ensemble Models
Machine Learning Today: Reading: Maria Florina Balcan
Neural Networks Advantages Criticism
Classification and Prediction
Presentation transcript:

Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (www.comp.dit.ie/bmacnamee) Business Systems Intelligence: 5. Classification 2

And information on SAS is available at: www.sas.com Acknowledgments These notes are based (heavily) on those provided by the authors to accompany “Data Mining: Concepts & Techniques” by Jiawei Han and Micheline Kamber Some slides are also based on trainer’s kits provided by More information about the book is available at: www-sal.cs.uiuc.edu/~hanj/bk2/ And information on SAS is available at: www.sas.com

Classification & Prediction Today we will look at: What are classification & prediction? Issues regarding classification and prediction Classification techniques: Case based reasoning (k-nearest neighbour algorithm) Decision tree induction Bayesian classification Neural networks Support vector machines (SVM) Classification based on from association rule mining concepts Other classification methods Prediction Classification accuracy

Classification Classification: Typical Applications Predicts categorical class labels Typical Applications {CreditHistory, Salary} -> CreditApproval (Yes/No) {Temp, Humidity} --> Rain (Yes/No) Mathematically

Linear Classification Binary Classification problem The data above the red line belongs to class ‘x’ The data below red line belongs to class ‘o’ Examples – SVM, Perceptron, Probabilistic Classifiers x o

Discriminative Classifiers Advantages Prediction accuracy is generally high Robust, works when training examples contain errors Fast evaluation of the learned target function Criticism Long training time Difficult to understand the learned function (weights) Not easy to incorporate domain knowledge

Artificial Neural Networks A biologically inspired classification technique Formed from interconnected layers of simple artificial neurons ANN history: 1943: McCulloch & Pitts 1959: Rosenblatt (Perceptron) 1959: Widrow & Hoff (ADALINE and MADALINE) 1969: Marvin Minsky and Seymour Papert's 1974: Werbos (Backprop) 1982: John Hopfield 1943: Warren McCulloch Walter Pitts – electronic neuron model 1959 (Rosenblatt) Marvin Minsky and Seymour Papert's 1969 1959, Bernard Widrow and Marcian Hoff of Stanford developed models they called ADALINE and MADALINE MADALINE was the first neural network to be applied to a real world problem. It is an adaptive filter which eliminates echoes on phone lines. This neural network is still in commercial use. In 1982 several events caused a renewed interest. John Hopfield of Caltech presented a paper to the national Academy of Sciences. Hopfield's approach was not to simply model brains but to create useful devices. With clarity and mathematical analysis, he showed how such networks could work and what they could do. Yet, Hopfield's biggest asset was his charisma. He was articulate, likeable, and a champion of a dormant technology.

An Artifical Neuron The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping f (x) w0 w1 wn x0 x1 xn bias

ANN: Multi-Layer Perceptrons (MLPs) Multi Layer Perceptrons (MLPs) are one of the best known ANN types Composed of layers of fully interconnected artificial neurons Training involves repeatedly presenting a series of training cases to the network and adjusting neurons’ weights and biases to minimise classification error Typically the backpropogation of error algorithm is used for training

MLP Example Remember our surfing example An MLP can be built and trained to perform classification for this problem Wind Speed Wind Direction Temperature Wave Size Wave Period Good Surf Hidden Layer Input Layer Output Layer

Network Training The ultimate objective of training Steps Obtain a set of weights that makes almost all of the tuples in the training data classified correctly Steps Initialize weights with random values Feed the input tuples into the network one by one For each unit Compute the net input to the unit as a linear combination of all the inputs to the unit Compute the output value using the activation function Compute the error Update the weights and the bias

Summary of ANN Classification Strengths Fast classification Very good generalization capacity Weaknesses No explanation capability – black box Training can be slow – eager learning Retraining is difficult Lots of other network types, but MLP is probably the most common

Support Vector Machines (SVM) In classification problems we try to create decision boundaries between classes A choice must be made between possible boundaries Class 2 Class 1

SVMs (cont…) The decision boundary should be as far away from the data of both classes as possible Class 1 Class 2 m

Margins Support Vectors Large Margin Small Margin

Linear Support Vector Machine Given a set of points with label The SVM finds a hyperplane defined by the pair (w, b), where w is the normal to the plane and b is the distance from the origin Where: x - feature vector b - bias, y- class label ||w|| - margin

SVMs: The Clever Bit! What about when classes are not linearly separable? Kernel functions and the kernel trick are used to transform data into a different linearly separable feature space Input space f( ) Feature space f(.)

SVMs: The Clever Bit! (cont...) What if the data is not linearly separable? Project the data to high dimensional space where it is linearly separable and then we can use linear SVM – (Using Kernels) (1,0) (0,0) (0,1) + - -1 +1 + -

SVM Example Example of Non-linear SVM

SVM Example (cont…) Results

Summary of SVM Classification Strengths Over-fitting is not common Works well with high dimensional data Fast classification Good generalization capacity Weaknesses Retraining is difficult No explanation capability Slow training At the cutting edge of machine learning

SVM vs. ANN SVM ANN Relatively new concept Nice generalization properties Hard to learn – learned in batch mode using quadratic programming techniques Using kernels can learn very complex functions ANN Quite old Generalizes well but doesn’t have strong mathematical foundation Can easily be learned in incremental fashion To learn complex functions – use multilayer perceptron (not that trivial)

SVM Related Links http://svm.dcs.rhbnc.ac.uk/ http://www.kernel-machines.org/ C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998. SVMlight – Software (in C) http://ais.gmd.de/~thorsten/svm_light BOOK: An Introduction to Support Vector Machines N. Cristianini and J. Shawe-Taylor Cambridge University Press

Association-Based Classification Several methods for association-based classification ARCS: Quantitative association mining and clustering of association rules (Lent et al’97) It beats C4.5 in (mainly) scalability and also accuracy Associative classification: (Liu et al’98) It mines high support and high confidence rules in the form of “cond_set => y”, where y is a class label CAEP (Classification by aggregating emerging patterns) (Dong et al’99) Emerging patterns (EPs): the itemsets whose support increases significantly from one class to another Mine Eps based on minimum support and growth rate

What Is Prediction? Prediction is similar to classification First, construct a model Second, use model to predict unknown value Major method for prediction is regression Linear and multiple regression Non-linear regression Prediction is different from classification Classification refers to predict categorical class label Prediction models continuous-valued functions

Regress Analysis and Log-Linear Models in Prediction Linear regression: Y =  +  X Two parameters,  and , specify the line and are to be estimated by using the data at hand Using the least squares criterion to the known values of Y1, Y2,…, X1, X2,…. Multiple regression: Y = b0 + b1X1 + b2X2 Many nonlinear functions can be transformed into the above Log-linear models: The multi-way table of joint probabilities is approximated by a product of lower-order tables Probability: p(a, b, c, d) = ab acad bcd

Prediction: Numerical Data

Prediction: Categorical Data

Concerns Over Classification Techniques When choosing a technique for a specific classification problem we must consider the following issues: Classification accuracy Training speed Classification speed Danger of over-fitting Generalisation capacity Implications for retraining Explanation capability

Evaluating Classification Accuracy During development, and in testing before deploying a classifier in the wild, we need to be able to quantify the performance of the classifier How accurate is the classifier? When the classifier is wrong, how is it wrong? Useful to decide on which classifier (which parameters) to use and to estimate what the performance of the system will be

Evaluating Classifiers (cont…) How we do this depends on how much data is available If there is unlimited data available then there is no problem Usually we have less data than we would like so we have to compromise Use hold-out testing sets Cross validation K-fold cross validation Leave-one-out validation Parallel live test

Total number of available examples Hold-Out Testing Sets Split the available data into a training set and a test set Train the classifier in the training set and evaluate based on the test set A couple of drawbacks We may not have enough data We may happen upon an unfortunate split Total number of available examples Training Set Test Set

K-Fold Cross Validation Divide the entire data set into k folds For each of k experiments, use kth fold for testing and everything else for training Total number of available examples Test Set K = 0 Test Set K = 1 Test Set K = 2 Test Set K = 3

K-Fold Cross Validation (cont…) The accuracy of the system is calculated as the average error across the k folds The main advantages of k-fold cross validation are that every example is used in testing at some stage and the problem of an unfortunate split is avoided Any value can be used for k 10 is most common Depends on the data set

Leave-One-Out Cross Validation Extreme case of k-fold cross validation With N data examples perform N experiments with N-1 training cases and 1 test case Total number of available examples K = 0 K = 1 K = 2 K = N

Classifier Accuracy The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier Often also referred to as recognition rate Error rate (or misclassification rate) is the opposite of accuracy

False Positives Vs False Negatives While it is useful to generate the simple accuracy of a classifier, sometimes we need more When is the classifier wrong? False positives vs false negatives Related to type I and type II errors in statistics Often there is a different cost associated with false positives and false negatives Think about diagnosing diseases

Confusion Matrix Device used to illustrate how a classifier is performing in terms of false positives and false negatives Gives us more information than a single accuracy figure Allows us to think about the cost of mistakes Can be extended to any number of classes Classifier Result Class A (yes) Class B (no)  fn Expected Result fp

Other Accuracy Measures Sometimes a simple accuracy measure is not enough

ROC Curves Receiver Operating Characteristic (ROC) curves were originally used to make sense of noisy radio signals Can be used to help us talk about classifier performance and determine the best operating point for a classifier

ROC Curves (cont…) Consider how the relationship between true positives and false positives can change We need to choose the best operating point False Positives True Positives 1.0 For some great ROC curve examples have a look here

ROC Curves (cont…) ROC curves can be used to compare classifiers The greater the area under the curve the more accurate the classifier False Positives True Positives 1.0

Over-Fitting When we train a classifier we are trying to a learn a function approximated by the training data we happen to use What if the training data doesn’t cover the whole problem space? We can learn the training data too closely which hampers the ability to generalise This problem is known as overfitting Depending on the type of classifier used there are different approaches to avoiding this

Ensembles In order to improve classification accuracy we can aggregate the results of an ensemble of classifiers Classifier0 Classifier1 Classifiern Aggregation

Bagging Given a set S of s samples Generate a bootstrap sample T from S Cases in S may not appear in T or may appear more than once Repeat this sampling procedure, getting a sequence of k independent training sets A corresponding sequence of classifiers C1,C2,…,Ck is constructed for each of these training sets, by using the same classification algorithm

Bagging (cont…) To classify an unknown sample X,let each classifier predict or vote The Bagged Classifier C* counts the votes and assigns X to the class with the “most” votes

Boosting Technique — Algorithm Assign every example an equal weight 1/N For t = 1, 2, …, T Do Obtain a hypothesis (classifier) h(t) under w(t) Calculate the error of h(t) and re-weight the examples based on the error . Each classifier is dependent on the previous ones. Samples that are incorrectly predicted are weighted more heavily Normalize w(t+1) to sum to 1 (weights assigned to different classifiers sum to 1) Output a weighted sum of all the hypothesis, with each hypothesis weighted according to its accuracy on the training set

Summary Classification is an extensively studied problem Mainly in statistics and machine learning Classification is probably one of the most widely used data mining techniques Scalability is still an important issue for database applications Research directions: classification of non-relational data, e.g., text, spatial, multimedia, etc..

Questions? ?