Machine Learning Math Essentials Part 2

Slides:



Advertisements
Similar presentations
Machine Learning Math Essentials Part 2
Advertisements

Component Analysis (Review)
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Computer vision: models, learning and inference Chapter 8 Regression.
CS Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct
Dimension reduction (1)
Chapter 4: Linear Models for Classification
Lecture 2 Probability and Measurement Error, Part 1.
Pattern Recognition and Machine Learning
x – independent variable (input)
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 10 Statistical Modelling Martin Russell.
Probability theory 2011 The multivariate normal distribution  Characterizing properties of the univariate normal distribution  Different definitions.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Linear Methods for Classification
Data mining and statistical learning - lecture 13 Separating hyperplane.
Visual Recognition Tutorial1 Random variables, distributions, and probability density functions Discrete Random Variables Continuous Random Variables.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Probability theory 2008 Outline of lecture 5 The multivariate normal distribution  Characterizing properties of the univariate normal distribution  Different.
Modern Navigation Thomas Herring
Separate multivariate observations
CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
Today Wrap up of probability Vectors, Matrices. Calculus
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
PATTERN RECOGNITION AND MACHINE LEARNING
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
Functional Brain Signal Processing: EEG & fMRI Lesson 7 Kaushik Majumdar Indian Statistical Institute Bangalore Center M.Tech.
Machine Learning Recitation 6 Sep 30, 2009 Oznur Tastan.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.
Linear Models for Classification
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Random Variables By: 1.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Deep Feedforward Networks
Background on Classification
Machine Learning Logistic Regression
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
CH 5: Multivariate Methods
Machine Learning Basics
Machine Learning Logistic Regression
Classification Discriminant Analysis
Classification Discriminant Analysis
REMOTE SENSING Multispectral Image Classification
REMOTE SENSING Multispectral Image Classification
ECE 417 Lecture 4: Multivariate Gaussians
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Machine Learning
Feature space tansformation methods
Generally Discriminant Analysis
LECTURE 09: DISCRIMINANT ANALYSIS
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Linear Discrimination
Test #1 Thursday September 20th
Probabilistic Surrogate Models
Presentation transcript:

Machine Learning Math Essentials Part 2

Gaussian distribution Most commonly used continuous probability distribution Also known as the normal distribution Two parameters define a Gaussian: Mean  location of center Variance 2 width of curve

Gaussian distribution In one dimension

Gaussian distribution In one dimension Causes pdf to decrease as distance from center increases Controls width of curve Normalizing constant: insures that distribution integrates to 1

Gaussian distribution  = 0 2 = 1  = 2 2 = 1  = 0 2 = 5  = -2 2 = 0.3

Multivariate Gaussian distribution In d dimensions x and  now d-dimensional vectors  gives center of distribution in d-dimensional space 2 replaced by , the d x d covariance matrix  contains pairwise covariances of every pair of features Diagonal elements of  are variances 2 of individual features  describes distribution’s shape and spread

Multivariate Gaussian distribution Covariance Measures tendency for two variables to deviate from their means in same (or opposite) directions at same time no covariance high (positive) covariance This is expected value of two random variables considered simultaneously. When working in higher-dimensional spaces, can have a whole matrix of covariances (will look at later). Formulas on this slide use summation (for discrete distributions), but could as well be integration (for continuous distributions).

Multivariate Gaussian distribution In two dimensions

Multivariate Gaussian distribution In two dimensions

Multivariate Gaussian distribution In three dimensions rng( 1 ); mu = [ 2; 1; 1 ]; sigma = [ 0.25 0.30 0.10; 0.30 1.00 0.70; 0.10 0.70 2.00] ; x = randn( 1000, 3 ); x = x * sigma; x = x + repmat( mu', 1000, 1 ); scatter3( x( :, 1 ), x( :, 2 ), x( :, 3 ), '.' );

Vector projection Orthogonal projection of y onto x Can take place in any space of dimensionality > 2 Unit vector in direction of x is x / || x || Length of projection of y in direction of x is || y ||  cos( ) Orthogonal projection of y onto x is the vector projx( y ) = x  || y ||  cos( ) / || x || = [ ( x  y ) / || x ||2 ] x (using dot product alternate form) y x  projx( y )

Linear models There are many types of linear models in machine learning. Common in both classification and regression. A linear model consists of a vector w in d-dimensional feature space. The vector w attempts to capture the strongest gradient (rate of change) in the output variable, as seen across all training samples. Different linear models optimize w in different ways. A point x in feature space is mapped from d dimensions to a scalar (1-dimensional) output z by projection onto w: cf. Lecture 5b w0   w  

Linear models There are many types of linear models in machine learning. The projection output z is typically transformed to a final predicted output y by some function : example: for logistic regression,  is logistic function example: for linear regression, ( z ) = z Models are called linear because they are a linear function of the model vector components w1, …, wd. Key feature of all linear models: no matter what  is, a constant value of z is transformed to a constant value of y, so decision boundaries remain linear even after transform.

Geometry of projections slide thanks to Greg Shakhnarovich (CS195-5, Brown Univ., 2006)

Geometry of projections slide thanks to Greg Shakhnarovich (CS195-5, Brown Univ., 2006)

Geometry of projections slide thanks to Greg Shakhnarovich (CS195-5, Brown Univ., 2006)

Geometry of projections margin slide thanks to Greg Shakhnarovich (CS195-5, Brown Univ., 2006)

From projection to prediction positive margin  class 1 negative margin  class 0

Logistic regression in two dimensions Interpreting the model vector of coefficients From MATLAB: B = [ 13.0460 -1.9024 -0.4047 ] w0 = B( 1 ), w = [ w1 w2 ] = B( 2 : 3 ) w0, w define location and orientation of decision boundary - w0 is distance of decision boundary from origin decision boundary is perpendicular to w magnitude of w defines gradient of probabilities between 0 and 1 w

Logistic function in d dimensions slide thanks to Greg Shakhnarovich (CS195-5, Brown Univ., 2006)

Decision boundary for logistic regression slide thanks to Greg Shakhnarovich (CS195-5, Brown Univ., 2006)