Neural Computation Prof. Nathan Intrator

Slides:



Advertisements
Similar presentations
Eigen Decomposition and Singular Value Decomposition
Advertisements

3D Geometry for Computer Graphics
Component Analysis (Review)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Machine Learning Lecture 8 Data Processing and Representation
PCA + SVD.
Slides by Olga Sorkine, Tel Aviv University. 2 The plan today Singular Value Decomposition  Basic intuition  Formal definition  Applications.
Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006.
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Computer Graphics Recitation 5.
3D Geometry for Computer Graphics
Neural Computation Prof. Nathan Intrator
Classification and Diagnostic of Cancers
Self Organization: Hebbian Learning CS/CMPE 333 – Neural Networks.
Dimensional reduction, PCA
Independent Component Analysis (ICA) and Factor Analysis (FA)
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
3D Geometry for Computer Graphics
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Ordinary least squares regression (OLS)
Principal component analysis (PCA) Purpose of PCA Covariance and correlation matrices PCA using eigenvalues PCA using singular value decompositions Selection.
Dan Simon Cleveland State University
SVD(Singular Value Decomposition) and Its Applications
Summarized by Soo-Jin Kim
Unsupervised learning
Chapter 2 Dimensionality Reduction. Linear Methods
CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2014.
Eigenvalue Problems Solving linear systems Ax = b is one part of numerical linear algebra, and involves manipulating the rows of a matrix. The second main.
Alignment Introduction Notes courtesy of Funk et al., SIGGRAPH 2004.
Principles of Pattern Recognition
Computational Intelligence: Methods and Applications Lecture 5 EDA and linear transformations. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
CSE554AlignmentSlide 1 CSE 554 Lecture 5: Alignment Fall 2011.
Digital Image Processing, 3rd ed. © 1992–2008 R. C. Gonzalez & R. E. Woods Gonzalez & Woods Matrices and Vectors Objective.
Additive Data Perturbation: data reconstruction attacks.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Unsupervised learning
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
A note about gradient descent: Consider the function f(x)=(x-x 0 ) 2 Its derivative is: By gradient descent (If f(x) is more complex we usually cannot.
CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2013.
Neuronal Goal Neuronal Goal n-dimensional vectors m-dimensional vectors m < n transform from 2 to 1 dimension Ex: We look for axes which minimise projection.
Contents PCA GHA APEX Kernel PCA CS 476: Networks of Neural Computation, CSD, UOC, 2009 Conclusions WK9 – Principle Component Analysis CS 476: Networks.
Neural Computation Prof. Nathan Intrator
Introduction to Linear Algebra Mark Goldman Emily Mackevicius.
EIGENSYSTEMS, SVD, PCA Big Data Seminar, Dedi Gadot, December 14 th, 2014.
Principle Component Analysis and its use in MA clustering Lecture 12.
Principal Component Analysis (PCA)
Signal & Weight Vector Spaces
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Principal Components Analysis ( PCA)
Unsupervised Learning II Feature Extraction
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
Unsupervised Learning II Feature Extraction
CSE 554 Lecture 8: Alignment
Principal Component Analysis (PCA)
PREDICT 422: Practical Machine Learning
Dimensionality Reduction
LECTURE 11: Advanced Discriminant Analysis
LECTURE 10: DISCRIMINANT ANALYSIS
Principal Component Analysis (PCA)
Principal Component Analysis
Principal Component Analysis
Feature space tansformation methods
Generally Discriminant Analysis
LECTURE 09: DISCRIMINANT ANALYSIS
Feature Selection Methods
Marios Mattheakis and Pavlos Protopapas
Presentation transcript:

Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 7 Office hours: Wed 4-5 nin@tau.ac.il (c) 1999. Tralvex Yeap. All Rights Reserved

Outline Goals for neural learning - Unsupervised Goals for statisical/computational learning PCA ICA Exploratory Projection Pursuit Search for non-Gaussian distributions Practical implementations (c) 1999. Tralvex Yeap. All Rights Reserved

Statistical Approach to Unsupervised Learning Understanding the nature of data variability Modeling the data (sometimes very flexible model) Understanding the nature of the noise Applying prior knowledge Extracting features based on: Prior knowledge Class prediction Unsupervised learning (c) 1999. Tralvex Yeap. All Rights Reserved

Principal Component Analysis. Włodzisław Duch SCE, NTU, Singapore http://www.ntu.edu.sg/home/aswduch (c) 1999. Tralvex Yeap. All Rights Reserved

transform from 2 to 1 dimension Neuronal Goal We look for axes which minimise projection errors and maximise the variance after projection n-dimensional vectors m-dimensional m < n Ex: transform from 2 to 1 dimension (c) 1999. Tralvex Yeap. All Rights Reserved

more information (variance) Algorithm (cont’d) Preserve as much of the variance as possible more information (variance) rotate less information project (c) 1999. Tralvex Yeap. All Rights Reserved

Linear transformations – example 2D vectors X in a unit circle with mean (1,1); Y = A*X, A = 2x2 matrix The shape is elongated, rotated and the mean is shifted. (c) 1999. Tralvex Yeap. All Rights Reserved

Invariant distances Euclidean distance is not invariant to general linear transformations This is invariant only for orthonormal matrices ATA = I that make rigid rotations, without stretching or shrinking distances. Idea: standardize the data in some way to create invariant distances. (c) 1999. Tralvex Yeap. All Rights Reserved

Data standardization For each vector component X(j)T=(X1(j), ... Xd(j)), j=1 .. n calculate mean and std: n – number of vectors, d – their dimension Vector of mean feature values. Averages over rows. (c) 1999. Tralvex Yeap. All Rights Reserved

Standard deviation Calculate standard deviation: Vector of mean feature values. Variance = square of standard deviation (std), sum of all deviations from the mean value. Transform X => Z, standardized data vectors (c) 1999. Tralvex Yeap. All Rights Reserved

Std data Std data: zero mean and unit variance. Standardize data after making data transformation. Effect: data is invariant to scaling only (diagonal transformation). Distances are invariant, data distribution is the same?? How to make data invariant to any linear transformations? (c) 1999. Tralvex Yeap. All Rights Reserved

Terminology (Covariance) How two dimensions vary from the mean with respect to each other cov(X,Y) > 0: Dimensions increase together cov(X,Y) < 0: One increases, one decreases cov(X,Y) = 0: Dimensions are independent (c) 1999. Tralvex Yeap. All Rights Reserved

Terminology (Covariance Matrix) Contains covariance values between all possible dimensions: Example for three dimensions (x,y,z) (Always symetric): cov(x,x)  variance of component x (c) 1999. Tralvex Yeap. All Rights Reserved

Properties of the Cov matrix Can be used for creating a distance that is not sensitive to linear transformation Can be used to find directions which maximize the variance Determines a Gaussian distribution uniquely (up to a shift) (c) 1999. Tralvex Yeap. All Rights Reserved

Data standardization example For our example Y=AX, assuming X means=1 and variances = 1 Transformation Vector of mean feature values. Variance check it! How to make this invariant? (c) 1999. Tralvex Yeap. All Rights Reserved

Covariance matrix Variance (spread around mean value) + correlation between features. CX is d x d where X is d x n dimensional matrix of vectors shifted to their means. Covariance matrix is symmetric Cij = Cji and positive definite. Diagonal elements are variances (square of std), si2 = Cii Pearson correlation coefficient Spherical distribution of data has Cij=I (unit matrix). Elongated ellipsoids: large off-diagonal elements, strong correlations between features. (c) 1999. Tralvex Yeap. All Rights Reserved

Mahalanobis distance Linear combinations of features leads to rotations and scaling of data. Mahalanobis distance: is invariant to linear transformations: (c) 1999. Tralvex Yeap. All Rights Reserved

Principal components How to avoid correlated features? Correlations  covariance matrix is non-diagonal ! Solution: diagonalize it, then use transformation that makes it diagonal to de-correlate features. Z are the eigen vectors of Cx In matrix form, X, Y are dxn, Z, CX, CY are dxd C – symmetric, positive definite matrix XTCX > 0 for ||X||>0; its eigenvectors are orthonormal: its eigenvalues are all non-negative Z – matrix of orthonormal eigenvectors (because Z is real+symmetric), transforms X into Y, with diagonal CY, i.e. decorrelated. (c) 1999. Tralvex Yeap. All Rights Reserved

Matrix form Eigenproblem for C matrix in matrix form: (c) 1999. Tralvex Yeap. All Rights Reserved

Principal components PCA: old idea, C. Pearson (1901), H. Hotelling 1933 Z – principal components, of vectors X transformed using eigenvectors of CX Covariance matrix of transformed vectors is diagonal => ellipsoidal distribution of data. Result: PC are linear combinations of all features, providing new uncorrelated features, with diagonal covariance matrix = eigenvalues. Small li  small variance  data change little in direction Yi PCA minimizes C matrix reconstruction errors: Zi vectors for large li are sufficient to get: because vectors for small eigenvalues will have very small contribution to the covariance matrix. (c) 1999. Tralvex Yeap. All Rights Reserved

Two components for visualization Diagonalization methods: see Numerical Recipes, www.nr.com New coordinate system: axis ordered according to variance = size of the eigenvalue. First k dimensions account for fraction of all variance (please note that li are variances); frequently 80-90% is sufficient for rough description. (c) 1999. Tralvex Yeap. All Rights Reserved

Solving for Eigenvalues & Eigenvectors Vectors x having same direction as Ax are called eigenvectors of A (A is an n by n matrix). In the equation Ax=x,  is called an eigenvalue of A. Ax=x  (A-I)x=0 How to calculate x and : Calculate det(A-I), yields a polynomial (degree n) Determine roots to det(A-I)=0, roots are eigenvalues  Solve (A- I) x=0 for each  to obtain eigenvectors x (c) 1999. Tralvex Yeap. All Rights Reserved

PCA properties PC Analysis (PCA) may be achieved by: transformation making covariance matrix diagonal projecting the data on a line for which the sums of squares of distances from original points to projections is minimal. orthogonal transformation to new variables that have stationary variances True covariance matrices are usually not known, estimated from data. This works well on single-cluster data; more complex structure may require local PCA, separately for each cluster. PC is useful for: finding new, more informative, uncorrelated features; reducing dimensionality: reject low variance features, reconstructing covariance matrices from low-dim data. (c) 1999. Tralvex Yeap. All Rights Reserved

PCA Wisconsin example Wisconsin Breast Cancer data: Collected at the University of Wisconsin Hospitals, USA. 699 cases, 458 (65.5%) benign (red), 241 malignant (green). 9 features: quantized 1, 2 .. 10, cell properties, ex: Clump Thickness, Uniformity of Cell Size, Shape, Marginal Adhesion, Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, Normal Nucleoli, Mitoses. 2D scatterograms do not show any structure no matter which subspaces are taken! (c) 1999. Tralvex Yeap. All Rights Reserved

Example cont. PC gives useful information already in 2D. Taking first PCA component of the standardized data: If (Y1>0.41) then benign else malignant 18 errors/699 cases = 97.4% Transformed vectors are not standardized, std’s are below. Eigenvalues converge slowly, but classes are separated well. (c) 1999. Tralvex Yeap. All Rights Reserved

PCA disadvantages Useful for dimensionality reduction but: Largest variance determines which components are used, but does not guarantee interesting viewpoint for clustering data. The meaning of features is lost when linear combinations are formed. Analysis of coefficients in Z1 and other important eigenvectors may show which original features are given much weight. PCA may be also done in an efficient way by performing singular value decomposition of the standardized data matrix. PCA is also called Karhuen-Loève transformation. Many variants of PCA are described in A. Webb, Statistical pattern recognition, J. Wiley 2002. (c) 1999. Tralvex Yeap. All Rights Reserved

Exercise (will be part of Ex. 1) How would you calculate efficiently the PCA of data where the dimensionality d is much larger than the number of vector observations n? (c) 1999. Tralvex Yeap. All Rights Reserved

2 skewed distributions PCA transformation for 2D data: First component will be chosen along the largest variance line, both clusters will strongly overlap, no interesting structure will be visible. In fact projection to orthogonal axis to the first PCA component has much more discriminating power. Discriminant coordinates should be used to reveal class structure. (c) 1999. Tralvex Yeap. All Rights Reserved

Hebb Rule Linear neuron Hebb rule Similar to LTP (but not quite…) (c) 1999. Tralvex Yeap. All Rights Reserved

Hebb Rule Average Hebb rule= correlation rule Q: correlation matrix of u (c) 1999. Tralvex Yeap. All Rights Reserved

Hebb Rule Hebb rule with threshold= covariance rule C: covariance matrix of u Note that <(v-< v >)(u-< u >)> would be unrealistic because it predicts LTP when both u and v are low (c) 1999. Tralvex Yeap. All Rights Reserved

Hebb Rule Main problem with Hebb rule: it’s unstable… Two solutions: Bounded weights Normalization of either the activity of the postsynaptic cells or the weights. (c) 1999. Tralvex Yeap. All Rights Reserved

BCM rule Hebb rule with sliding threshold BCM rule implements competition because when a synaptic weight grows, it raises by v2, making more difficult for other weights to grow. (c) 1999. Tralvex Yeap. All Rights Reserved

Weight Normalization Subtractive Normalization: (c) 1999. Tralvex Yeap. All Rights Reserved

Weight Normalization Multiplicative Normalization: Norm of the weights converge to 1/a (c) 1999. Tralvex Yeap. All Rights Reserved

Hebb Rule Convergence properties: Use an eigenvector decomposition: where em are the eigenvectors of Q (c) 1999. Tralvex Yeap. All Rights Reserved

Hebb Rule e1 e2 l1>l2 (c) 1999. Tralvex Yeap. All Rights Reserved

Hebb Rule Equations decouple because em are the eigenvectors of Q (c) 1999. Tralvex Yeap. All Rights Reserved

Hebb Rule (c) 1999. Tralvex Yeap. All Rights Reserved

Hebb Rule The weights line up with first eigenvector and the postsynaptic activity, v, converges toward the projection of u onto the first eigenvector (unstable PCA) (c) 1999. Tralvex Yeap. All Rights Reserved

Hebb Rule Non zero mean distribution: correlation vs covariance (c) 1999. Tralvex Yeap. All Rights Reserved

Hebb Rule First eigenvector: [1,-1] Limiting weights growth affects the final state 0.2 0.4 0.6 0.8 1 First eigenvector: [1,-1] 0.8 x a m w / 2 w w / w 1 max (c) 1999. Tralvex Yeap. All Rights Reserved

Hebb Rule Normalization also affects the final state. Ex: multiplicative normalization. In this case, Hebb rule extracts the first eigenvector but keeps the norm constant (stable PCA). (c) 1999. Tralvex Yeap. All Rights Reserved

Hebb Rule Normalization also affects the final state. Ex: subtractive normalization. (c) 1999. Tralvex Yeap. All Rights Reserved

Hebb Rule (c) 1999. Tralvex Yeap. All Rights Reserved

Hebb Rule The constrain does not affect the other eigenvector: The weights converge to the second eigenvector (the weights need to be bounded to guarantee stability…) (c) 1999. Tralvex Yeap. All Rights Reserved

Ocular Dominance Column One unit with one input from right and left eyes s: same eye d: different eyes (c) 1999. Tralvex Yeap. All Rights Reserved

Ocular Dominance Column The eigenvectors are: (c) 1999. Tralvex Yeap. All Rights Reserved

Ocular Dominance Column Since qd is likely to be positive, qs+qd>qs-qd. As a result, the weights will converge toward the first eigenvector which mixes the right and left eye equally. No ocular dominance... (c) 1999. Tralvex Yeap. All Rights Reserved

Ocular Dominance Column To get ocular dominance we need subtractive normalization. (c) 1999. Tralvex Yeap. All Rights Reserved

Ocular Dominance Column Note that the weights will be proportional to e2 or –e2 (i.e. the right and left eye are equally likely to dominate at the end). Which one wins depends on the initial conditions. Check that (c) 1999. Tralvex Yeap. All Rights Reserved

Ocular Dominance Column Ocular dominance column: network with multiple output units and lateral connections. (c) 1999. Tralvex Yeap. All Rights Reserved

Ocular Dominance Column Simplified model (c) 1999. Tralvex Yeap. All Rights Reserved

Ocular Dominance Column If we use subtractive normalization and no lateral connections, we’re back to the one cell case. Ocular dominance is determined by initial weights, i.e., it is purely stochastic. This is not what’s observed in V1. Lateral weights could help by making sure that neighboring cells have similar ocular dominance. (c) 1999. Tralvex Yeap. All Rights Reserved

Ocular Dominance Column Lateral weights are equivalent to feedforward weights (c) 1999. Tralvex Yeap. All Rights Reserved

Ocular Dominance Column Lateral weights are equivalent to feedforward weights (c) 1999. Tralvex Yeap. All Rights Reserved

Ocular Dominance Column (c) 1999. Tralvex Yeap. All Rights Reserved

Ocular Dominance Column We first project the weight vectors of each cortical unit (wiR,wiL) onto the eigenvectors of Q. (c) 1999. Tralvex Yeap. All Rights Reserved

Ocular Dominance Column There are two eigenvectors, w+ and w-, with eigenvalues qs+qd and qs-qd: (c) 1999. Tralvex Yeap. All Rights Reserved

Ocular Dominance Column (c) 1999. Tralvex Yeap. All Rights Reserved

Ocular Dominance Column Ocular dominance column: network with multiple output units and lateral connections. (c) 1999. Tralvex Yeap. All Rights Reserved

Ocular Dominance Column Once again we use a subtractive normalization, which holds w+ constant. Consequently, the equation for w- is the only one we need to worry about. (c) 1999. Tralvex Yeap. All Rights Reserved

Ocular Dominance Column If the lateral weights are translation invariant, Kw- is a convolution. This is easier to solve in the Fourier domain. (c) 1999. Tralvex Yeap. All Rights Reserved

Ocular Dominance Column The sine function with the highest Fourier coefficient (i.e. the fundamental) growth the fastest. (c) 1999. Tralvex Yeap. All Rights Reserved

Ocular Dominance Column In other words, the eigenvectors of K are sine functions and the eigenvalues are the Fourier coefficients for K. Needs a reference to fourier transform (c) 1999. Tralvex Yeap. All Rights Reserved

Ocular Dominance Column The dynamics is dominated by the sine function with the highest Fourier coefficients, i.e., the fundamental of K(x) (note that w- is not normalized along the x dimension). This results is an alternation of right and left columns with a periodicity corresponding to the frequency of the fundamental of K(x). Needs a reference to fourier transform (c) 1999. Tralvex Yeap. All Rights Reserved

Ocular Dominance Column If K is a Gaussian kernel, the fundamental is the DC term and w ends up being constant, i.e., no ocular dominance columns (one of the eyes dominate all the cells). If K is a mexican hat kernel, w will show ocular dominance column with the same frequency as the fundamental of K. Not that intuitive anymore… Needs a reference to fourier transform (c) 1999. Tralvex Yeap. All Rights Reserved

Ocular Dominance Column Simplified model (c) 1999. Tralvex Yeap. All Rights Reserved

Ocular Dominance Column Simplified model: weights matrices for right and left eyes W W W - W (c) 1999. Tralvex Yeap. All Rights Reserved