Classification, Part 2 BMTRY 726 4/11/14. The 3 Approaches (1) Discriminant functions: -Find function f(x) that maps each point x directly into class.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.
Brief introduction on Logistic Regression
Logistic Regression Psy 524 Ainsworth.
Neural Network I Week 7 1. Team Homework Assignment #9 Read pp. 327 – 334 and the Week 7 slide. Design a neural network for XOR (Exclusive OR) Explore.
Pattern Recognition and Machine Learning
Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
Chapter 4: Linear Models for Classification
Artificial Neural Networks - Introduction -
Artificial Neural Networks - Introduction -
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Lecture 14 – Neural Networks
Support Vector Machines (and Kernel Methods in general)
x – independent variable (input)
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 15: Introduction to Artificial Neural Networks Martin Russell.
Linear Methods for Classification
Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Read chapter 4.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural.
Theory Simulations Applications Theory Simulations Applications.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Outline Separating Hyperplanes – Separable Case
Artificial Intelligence Lecture No. 28 Dr. Asad Ali Safi ​ Assistant Professor, Department of Computer Science, COMSATS Institute of Information Technology.
Principles of Pattern Recognition
Classification Part 3: Artificial Neural Networks
Neural Networks AI – Week 21 Sub-symbolic AI One: Neural Networks Lee McCluskey, room 3/10
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
Linear Methods for Classification : Presentation for MA seminar in statistics Eli Dahan.
Artificial Neural Networks Students: Albu Alexandru Deaconescu Ionu.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Lecture 2: Statistical learning primer for biologists
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.
Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Support Vector Machines Optimization objective Machine Learning.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
Linear Methods for Classification, Part 1
Deep Feedforward Networks
Artificial Neural Networks I
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Machine Learning. Support Vector Machines A Support Vector Machine (SVM) can be imagined as a surface that creates a boundary between points of data.
Pattern Recognition and Machine Learning
Machine Learning. Support Vector Machines A Support Vector Machine (SVM) can be imagined as a surface that creates a boundary between points of data.
Machine Learning. Support Vector Machines A Support Vector Machine (SVM) can be imagined as a surface that creates a boundary between points of data.
Artificial Intelligence Lecture No. 28
Support Vector Machines
Parametric Methods Berlin Chen, 2005 References:
Linear Discrimination
Presentation transcript:

Classification, Part 2 BMTRY 726 4/11/14

The 3 Approaches (1) Discriminant functions: -Find function f(x) that maps each point x directly into class label -NOTE: in this case, probabilities play no role (2) Linear (Quadratic) Discriminant Analysis - Solve inference problem using prior class probability and -use Bayes’ thm and find posterior class probabilities -use posteriors to make optimal decision (3) Logistic Regression -Solve inference problem by determining -use posteriors to make optimal decision

Logistic Regression Probably the most commonly used linear classifier (certainly one we all know) If the outcome is binary, we can describe the relationship between our features x and the probability of our outcomes as a linear relationship: Using this we define the posterior probability of being in either of the two classes as

Logistic Regression for K > 2 But what happens if we have more than two classes? Examples: (1) Ordinal outcome (i.e. Likert scale) defining attitudes towards safety of epidurals during labor among moms to be -Features may include things like age, ethnicity, level of education, socio-economic status, parity, etc. (2) Goal is to distinguish between several different types of lung tumors (both malignant and non-malignant): -small cell lung cancer -non-small cell lung cancer -granulomatosis -sarcoidosis In this case features may be pixels from CT scan image

Logistic Regression for K > 2 In the first example, cancer stage is ordinal One possible option is to fit a cumulative logit model The model for P(C < j|X = x) is just a logit model for a binary response In this case, the response takes value 1 if y j + 1

Logistic Regression for K > 2 It is of greater interest however to model all K – 1 cumulative logits in a single model. This leads us to the proportional odds model Notice the intercept is allowed to vary as j increases However, the other model parameters remain constant Does this make sense given the name of the model?

Logistic Regression for K > 2 Assumptions for the proportional odds model -Intercepts are increasing with increasing j -Models share the same rate of increase with increasing j -Odds ratios are proportional to distance between x 1 and x 2 and the proportionality constant is same for each logit So for j < k, the curve for P(C < k| X = x) is equivalent to curve P(C < j| X = x) shifted (  0k –  0j )/  units in the x direction Odds ratios to interpret the model are cumulative odds ratios

Logistic Regression for K > 2 In the second example, our class categories are not ordinal We can however fit a multinomial logit model The model includes K – 1 logit models

Logistic Regression for K > 2 We can estimate the posterior probabilities of each of our K classes from the multinomial logit model When K = 2, this reduces down to a single linear function (i.e. a single logistic regression)

Logistic Regression for K > 2 When K = 2, this reduces down to a single linear function (i.e. a single logistic regression) Though we’ve referenced the last category, since the data have no natural ordering we could reference any category we choose. Unlike the cumulative logit and proportional odds models, all parameters vary in these models

Logistic Regression for K > 2 As in the case of the ordinal models, it makes more sense to fit these models simultaneously As a result, there are some assumptions and constraints we must impose (1) The different classes in the data represent a multinomial distribution -Constraint: all posterior probabilities must sum to 1 -In order to achieve this all models fit simultaneously (2) Independence of Irrelevant Alternatives: - relative odds between any two outcomes independent of number and nature of other outcomes being simultaneously considered

Logistic Regression vs. LDA Both LDA and logistic regression represent models of the log-posterior odds between classes k and K that are linear functions of x LDA: Logistic regression:

Logistic Regression vs. LDA The posterior conditional density of class k for both LDA and logistic regression can be written in the linear logit form The joint density for both can be written in the same way Both methods represent linear decision boundaries that classify observations So what’s the difference?

Logistic Regression vs. LDA The difference lies in how the linear coefficients are estimated LDA: Parameters are fit by maximizing the full log likelihood based on the joint density -recall here  is the Gaussian density function Logistic regression: In this case the marginal density P(C = k) is arbitrary and parameters are estimated by maximizing the conditional multinomial likelihood -although ignored, we can think of this marginal density as being estimated in a nonparametric unrestricted fashion

Logistic Regression vs. LDA So this means… (1) LR makes fewer assumptions about distribution of the data (more general approach) (2) But LR “ignores” the marginal distribution P(C = k) - Including additional distributional assumptions provides more information about parameters allowing for more efficient estimation (i.e. lower variance) - if Gaussian assumption correct, could lose up to 30% efficiency -OR need 30% more data for conditional likelihood to do as well as full likelihood

Logistic Regression vs. LDA (3) If observations far from decision boundary (i.e. probably NOT Gaussian), they influence estimation of common covariance matrix -i.e. LDA is not robust to outliers (4) data in a two class model can be perfectly separated by a hyperplane, LR parameters are undefined. But LDA coefficients still well defined (marginal likelihood avoids this degeneracy) -e.g. one particular feature has all of its density in one class -Advantages/disadvantages to both methods -LR thought of as more robust because makes fewer assumptions -In practice they tend to perform similarly

Problems with Both Methods There are a few problems with both methods… (1)As with regression, we have a hard time including a large number of covariates especially if n is small (2)A linear boundary may not really be an appropriate choice for separating our classes So what can we do?

LDA and logistic regression works well if classes are linear separable… But what if they aren’t? Linear boundaries may be almost useless

Nonlinear test statistics The optimal decision boundary may not be a hyperplane → nonlinear test statistic accept H0H0 H1H1 Multivariate statistical methods are a Big Industry: Neural Networks Support Vector Machines Kernel density methods

Artificial Neural Networks (ANNs) Central Idea: Extract linear combinations of inputs as derived features and then model the outcome (classes) as a nonlinear function of these features Huh!? Really they are nonlinear statistical models but with pieces that are familiar to us already

Biologic Neurons Input signals come from the axons of other neurons, which connect to dendrites (input terminals) at the synapses If a sufficient excitatory signal is received, the neuron fires and sends an output signal along the axons The firing of the neuron occurs when a threshold excitation is reached Idea for Neural Networks came from biology- more specifically, the brain…

Brains versus Computers : Some numbers -Approximately 10 billion neurons in the human cortex, compared with 10 of thousands of processors in the most powerful parallel computers -Each biological neuron is connected to several thousand other neurons, similar to the connectivity in powerful parallel computers -Lack of processing units can be compensated by speed. The typical operating speeds of biological neurons is measured in milliseconds while a silicon chip can operate in nanoseconds -The human brain is extremely energy efficient, using approximately joules per operation per second, whereas the best computers today use around joules per operation per second -Brains have been evolving for tens of millions of years, computers have been evolving for tens of decades.

ANNs Non-linear (mathematical) models of an artificial neuron w1w1 w2w2 w3w3 wpwp x1x1 x2x2 x3x3 xpxp  h g O Output signal Input Signal Synaptic Weights Activation/ Threshold Function

ANNs Neural Network is 2-stage classification (or regression) model Can be represented as network diagram -for classification these represent the K -classes - k th unit models probability of being in k th class Y1Y1 Y2Y2 YKYK Y1Y1 Z2Z2 Z1Z1 Z3Z3 ZMZM X2X2 X1X1 X3X3 XpXp X p-1 … … …

ANNs Z m represent derived features created from linear combinations of the X ’s Y ’s are modeled as a function of linear combinations of the Z m  is called the activation function Y1Y1 Y2Y2 YKYK Y1Y1 Z2Z2 Z1Z1 Z3Z3 ZMZM X2X2 X1X1 X3X3 XpXp X p-1 … … …

ANNs The activation function, , could be any function we choose In practice, there are only a few that are frequently used

ANNs ANNs are based on simpler classifiers called perceptrons The original single layer perceptron used the hard threshold sign function but this lacks flexibility making separation of classes difficult Later adapted to use the sigmoid function -Note this should be familiar (think back to logistic regression) ANNs are adaptation of the original single layer perceptron that include multiple layers (and have hence also been referred to as multi-layer perceptrons) Use of the sigmoid function also links it with multinomial logistic regression

Next Class… How do you fit an ANN? What are the issues with ANN? Software