Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural.

Slides:



Advertisements
Similar presentations
Neural networks Introduction Fitting neural networks
Advertisements


An Introduction of Support Vector Machine
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Data mining and statistical learning - lecture 6
Machine Learning Neural Networks
Neural Networks I CMPUT 466/551 Nilanjan Ray. Outline Projection Pursuit Regression Neural Network –Background –Vanilla Neural Networks –Back-propagation.
Lecture 14 – Neural Networks
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines and Kernel Methods
x – independent variable (input)
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Chapter 5 NEURAL NETWORKS
Neural Networks Marco Loog.
Tree-based methods, neutral networks
November 2, 2010Neural Networks Lecture 14: Radial Basis Functions 1 Cascade Correlation Weights to each new hidden node are trained to maximize the covariance.
Data mining and statistical learning - lecture 13 Separating hyperplane.
MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Data mining and statistical learning - lecture 11 Neural networks - a model class providing a joint framework for prediction and classification  Relationship.
Comp 540 Chapter 9: Additive Models, Trees, and Related Methods
An Introduction to Support Vector Machines Martin Law.
Radial Basis Function Networks
Outline Separating Hyperplanes – Separable Case
Biointelligence Laboratory, Seoul National University
Classification Part 3: Artificial Neural Networks
Classification, Part 2 BMTRY 726 4/11/14. The 3 Approaches (1) Discriminant functions: -Find function f(x) that maps each point x directly into class.
Chapter 9 Neural Network.
Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Classification / Regression Neural Networks 2
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
An Introduction to Support Vector Machines (M. Law)
Ensemble Learning (1) Boosting Adaboost Boosting is an additive model
Non-Bayes classifiers. Linear discriminants, neural networks.
Linear Models for Classification
Support Vector Machines in Marketing Georgi Nalbantov MICC, Maastricht University.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
Computational Intelligence: Methods and Applications Lecture 29 Approximation theory, RBF and SFN networks Włodzisław Duch Dept. of Informatics, UMK Google:
Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
Basis Expansions and Generalized Additive Models Basis expansion Piecewise polynomials Splines Generalized Additive Model MARS.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Chapter 11 – Neural Nets © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Deep Feedforward Networks
Artificial Neural Networks I
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Machine learning, pattern recognition and statistical data modelling
An Introduction to Support Vector Machines
Classification / Regression Neural Networks 2
Neuro-Computing Lecture 4 Radial Basis Function Network
Neural Network - 2 Mayank Vatsa
Lecture Notes for Chapter 4 Artificial Neural Networks
Support Vector Machines
Basis Expansions and Generalized Additive Models (2)
Neural networks (1) Traditional multi-layer perceptrons
Linear Discrimination
Support Vector Machines 2
Presentation transcript:

Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural networks  Considerations in neural network modelling  Multivariate Adaptive Regression Splines

Data mining and statistical learning - lecture 12 Feed forward neural network Feed-forward neural network Input layer Hidden layer(s) Output layer x1x1 x2x2 xpxp z1z1 z2z2 zMzM … … f1f1 fKfK …

Data mining and statistical learning - lecture 12 Terminology Feed-forward network –Nodes in one layer are connected to the nodes in next layer Recurrent network –Nodes in one layer may be connected to the ones in previous layer or within the same layer

Data mining and statistical learning - lecture 12 Multilayer perceptrons Any number of inputs Any number of outputs One or more hidden layers with any number of units. Linear combinations of the outputs from one layer form inputs to the following layers Sigmoid activation functions in the hidden layers. x1x1 x2x2 xpxp z1z1 z2z2 zMzM … … f1f1 fKfK …

Data mining and statistical learning - lecture 12 Parameters in a multilayer perceptron C 1, C 2 : combination function g,  : activation function  0m  0k : bias of hidden unit  im  jk : weight of connection

Data mining and statistical learning - lecture 12 Least squares fitting of neural networks Consider a simple perceptron (no hidden layer) Find weights and bias minimizing the error function x1x1 x2x2 xpxp f1f1 f2f2 fKfK …

Data mining and statistical learning - lecture 12 Alternative measures of fit For regression we normally use the sum-of-squared errors as measure of fit For classification we use either squared errors or cross-entropy (deviance) and the corresponding classifier is argmax k f k (x) The measure of fit can also be adapted to specific distributions, such as Poisson distributions

Data mining and statistical learning - lecture 12 Combination and activation functions Combination function –Linear combination: –Radial combination: Activation function in the hidden layer –Identity –Sigmoid Activation function in the output layer –Softmax –Identity

Data mining and statistical learning - lecture 12 Ordinary radial basis function networks (ORBF) Input and output layers and one hidden layer Hidden layer: Combination function=radial Activation function=exponential, softmax Output layer: Combination function=linear Activation function =any, normally identity x1x1 x2x2 xpxp z1z1 z2z2 zMzM … … f1f1 fKfK …

Data mining and statistical learning - lecture 12 Issues in neural network modelling Preliminary training – learning with different initial weights (since multiple local minima are possible) Scaling of the inputs is important (standardization) The number of nodes in the hidden layer(s) The choice of activation function in the output layer –Interval – identity –Nominal – softmax

Data mining and statistical learning - lecture 12 Overcoming over-fitting 1.Early stopping 2.Adding a penalty function Objective function=Error function+Penalty term

Data mining and statistical learning - lecture 12 MARS: Multivariate Adaptive Regression Splines An adaptive procedure for regression that can be regarded as a generalization of stepwise linear regression

Data mining and statistical learning - lecture 12 Reflected pair of functions with a knot at the value x 1

Data mining and statistical learning - lecture 12 Reflected pairs of functions with knots at the values x 1 and x 2 x1x1 x2x2

Data mining and statistical learning - lecture 12 MARS with a single input X taking the values x 1, …, x N Form the collection of base functions Construct models of the form where each h m (X) is a function in C or a product of two or more such functions

Data mining and statistical learning - lecture 12 MARS model with a single input X taking the values x 1, x 2 x1x1 x2x2

Data mining and statistical learning - lecture 12 MARS model with a single input X taking the values x 1, x 2 x1x1 x2x2

Data mining and statistical learning - lecture 12 MARS: Multivariate Adaptive Regression Splines At each stage we consider as a new basis function pair all products of functions already in the model with one of the reflected pairs in the set C Although each basis function depends only on a single X j it is considered as a function over the entire input space

Data mining and statistical learning - lecture 12 MARS: Multivariate Adaptive Regression Splines - model selection MARS functions typically overfit the data and so a backward deletion procedure is applied The size of the model is determined by Generalized Cross Validation An upper limit can be set on the order of interaction

Data mining and statistical learning - lecture 12 The MARS model can be viewed as a generalization of the classification and regression tree (CART)

Data mining and statistical learning - lecture 12 Some characteristics of different learning methods CharacteristicNeural networks TreesMARS Natural handling of data of “mixed” typePoorGood Handling of missing valuesPoorGood Robustness to outliers in input spacePoorGoodPoor Insensitive to monotone transformations of inputsPoorGoodPoor Computational scalability (large N)PoorGood Ability to deal with irrelevant inputsPoorGood Ability to extract linear combinations of featuresGoodPoor InterpretabilityPoorFairGood Predictive powerGoodPoorFair

Data mining and statistical learning - lecture 12 Separating hyperplane

Data mining and statistical learning - lecture 12 Optimal separating hyperplane - support vector classifier margin Find the hyperplane that creates the biggest margin between the training points for class 1 and -1

Data mining and statistical learning - lecture 12 Formulation of the optimization problem Signed distance to decision border y=1 for one of the groups and y=-1 for the other one

Data mining and statistical learning - lecture 12 Two equivalent formulations of the optimization problem

Data mining and statistical learning - lecture 12 Characteristics of the support vector classifier Points well inside their class boundary do not play a big role in the shaping of the decision border Cf. linear discriminant analysis (LDA) for which the decision boundary is determined by the covariance matrix of the class distributions and their centroids

Data mining and statistical learning - lecture 12 Support vector machines using basis expansions (polynomials, splines)

Data mining and statistical learning - lecture 12 Characteristics of support vector machines The dimension of the enlarged feature space can be very large Overfitting is prevented by a built-in shrinkage of beta coefficients Irrelevant inputs can create serious problems

Data mining and statistical learning - lecture 12 The SVM as a penalization method Misclassification: f(x) 0 when y=-1 Loss function: Loss function + penalty:

Data mining and statistical learning - lecture 12 The SVM as a penalization method Minimizing the loss function + penalty is equivalent to fitting a support vector machine to data The penalty factor is a function of the constant providing an upper bound of

Data mining and statistical learning - lecture 12 Some characteristics of different learning methods CharacteristicNeural networks Support vector machines TreesMARS Natural handling of data of “mixed” typePoor Good Handling of missing valuesPoor Good Robustness to outliers in input spacePoor GoodPoor Insensitive to monotone transformations of inputs Poor GoodPoor Computational scalability (large N)Poor Good Ability to deal with irrelevant inputsPoor Good Ability to extract linear combinations of featuresGood Poor InterpretabilityPoor FairGood Predictive powerGood PoorFair