Download presentation
Presentation is loading. Please wait.
Published byClemence Small Modified over 8 years ago
2
2.4 Nonnegative Matrix Factorization NMF casts matrix factorization as a constrained optimization problem that seeks to factor the original matrix into the product of two nonnegative matrices. Motivation: easier to interpret provide better results in information retrieval, clustering
3
2.4 Nonnegative Matrix Factorization Definition: Solution: three general classes of algorithms for constructing a nonnegative matrix factorization multiplicative update, alternating least squares, and gradient descent algorithms
4
Procedure - Multiplicative Update Algorithm
5
Weaknesses: tends to be more sensitive to initialization It has also been shown that the multiplicative update procedure is slow to converge
6
Procedure - Alternating Least Squares
7
we have a least squares step, where we solve for one of the factor matrices, followed by another least squares step to solve for the other one. In between, we ensure nonnegativity by setting any negative elements to zero.
8
Example 2.4 we are going to factor the termdoc matrix into a nonnegative product of two matrices W and H, where W is 6*3 and H is 3*5 utilizes the multiplicative update option of the NMF function
9
2.5 Factor Analysis λ ij in the above model are called the factor loadings error terms ε i are called the specific factors. : the communality of xi.
10
2.5 Factor Analysis Assumptions: E[e] = 0 E[f] = 0 E[x] = 0 error terms εi are uncorrelated with each other common factors are uncorrelated with the specific factors fj. sample covariance (or correlation) matrix is of the form where Ψ is a diagonal matrix representing E[eeT]. The variance of εi is called the specificity of xi; so the matrix Ψ is also called the specificity matrix.
11
2.5 Factor Analysis Both Λ and f are unknown and must be estimated. And the estimates are not unique. Once this initial estimate is obtained, other solutions can be found by rotating Λ. goal of some rotations is to make the structure of Λ more interpretable, by making the λ ij close to one or zero. factor rotation methods can either be orthogonal or oblique. The orthogonal rotation methods include quartimax, varimax, orthomax, and equimax. The promax and procrustes rotations are oblique.
12
2.5 Factor Analysis want to transform the observations using the estimated factor analysis model either for plotting purposes or for further analysis methods, such as clustering or classification. We could think of these observations as being transformed to the.factor space. These are called factor scores, similarly to PCA. the factor scores are really estimates and depend on the method that is used. MATLAB Statistics Toolbox uses the maximum likelihood method to obtain the factor loadings, and implements various rotation methods mentioned earlier.
13
Example 2.5 Dataset: stockreturns consists of 100 observations, representing the percent change in stock prices for 10 companies. It turns out that the first four companies can be classified as technology, the next three as financial, and the last three as retail. use factor analysis to see if there is any structure in the data that supports this grouping.
14
Example 2.5 we plot the matrix Lam (the factor loadings) in Figure 2.4
15
Example 2.5
16
2.6 Fisher’s Linear Discriminant It is known as Fisher linear discriminant or mapping (FLD) [Duda and Hart, 1973] and is one of the tools used for pattern recognition and supervised learning. The goal of LDA is to reduce the dimensionality to 1-D(a linear projection ) so that the projected observations are well separated.
17
2.6 Fisher’s Linear Discriminant One approach: building a classifier with high- dimensional data is to project the observations onto a line in such a way that the observations are well- separated in this lower- dimensional space. The linear separability (and thus the classification) of the observations is greatly affected by the position or orientation of this line.
18
2.6 Fisher’s Linear Discriminant In LDA we seek a linear mapping that maximizes the linear class separability in the new representation of the data. Definitions: We consider a set of n p-dimensional observations x 1,…,x n, with samples labeled as belonging to class 1 (λ 1 ) and samples as belonging to class 2 (λ 2 ). We will denote the set of observations in the i-th class as Λ i.
19
2.6 Fisher’s Linear Discriminant our measure of the standard deviations: p- dimensional sample mean
20
2.6 Fisher’s Linear Discriminant use Equation 2.14 to measure the separation of the means for the two classes: use the scatter as our measure of the standard deviations. The LDA is defined as the vector w that maximizes the function
21
2.6 Fisher’s Linear Discriminant the solution to the maximization of Equation 2.16 as
22
2.6 Fisher’s Linear Discriminant Example 2.6 Generate some observations that are multivariate normal using the mvnrnd function and plot them as points in the top panel of Figure 2.7.
23
2.7 Intrinsic Dimensionality intrinsic dimensionality: defined as the smallest number of dimensions or variables needed to model the data without loss Approaches: we describe several local estimators: nearest neighbor, correlation dimension, and maximum likelihood. These are followed by a global method based on packing numbers.
24
2.7.1 Nearest Neighbor Approach Definitions: Let r k,x represent the distance from x to the k-th nearest neighbor of x. The average kth nearest neighbor distance is given by C n is independent of k.
25
2.7.1 Nearest Neighbor Approach then we obtain the following
26
2.7.1 Nearest Neighbor Approach Pettis et al. [1979] found that their algorithm works better if potential outliers are removed before estimating the intrinsic dimensionality. define outliers
27
Procedure - Intrinsic Dimensionality
28
Example 2.7 We first generate some data to illustrate the functionality of the algorithm. The helix is described by the following equations, and points are randomly chosen along this path: For this data set, the dimensionality is 3, but the intrinsic dimensionality is 1. The resulting estimate of the intrinsic dimensionality is 1.14.
29
2.7.2 Correlation Dimension correlation dimension estimator based on the assumption that the number of observations in a hyper-sphere with radius r is proportional to r d Given a set of observations S n ={x 1 … x n } so we can use this to estimate the intrinsic dimensionality d.
30
2.7.2 Correlation Dimension the correlation dimension is given by Since we have a finite sample, arriving at the limit of zero in Equation 2.24 is not possible The intrinsic dimensionality, can be estimated by calculating C (r ) for two values of r and then finding the ratio: Section 2.7.4 to illustrate the use of the correlation dimension Hospital
31
2.7.3 Maximum Likelihood Approach x 1 … x n, residing in a p-dimensional space R p x i =g(y i ) where the y i are sampled from an unknown smooth density f with support on R d assume that f(x) is approximately constant within a small hypersphere S x (r) of radius r around x
32
2.7.3 Maximum Likelihood Approach the rate λ(t) of the process N x (t) at dimensionality d as Levina and Bickel provide maximum likelihood estimator of the intrinsic dimensionality R 给定, 0<t<r λ x (t)=N x( t)/N x (r) λ x (t) 是分布函数 其中参数 d 未知 用最大似然估计 估计参数 d
33
2.7.3 Maximum Likelihood Approach a more convenient way to arrive at the estimate by fixing the number of neighbors k instead of the radius r recommend that one obtain estimates of the intrinsic dimensionality at each observation x i The final estimate is obtained by averaging these results
34
2.7.4 Estimation Using Packing Numbers the r-covering number N (r) of a set S ={x 1,…,x n } in this space is the smallest number of hyperspheres with radius r that is needed to cover all observations x i in the data set S. N (r) of a d-dimensional data set is proportional to r -d capacity dimension
35
2.7.4 Estimation Using Packing Numbers impossible to find the r-covering number Kegl uses the r-packing number M (r ) instead defined as the maximum number of observations xi in the data set S that can be covered by a single hypersphere of radius r We have:( 夹逼 )
36
2.7.4 Estimation Using Packing Numbers the packing estimate for intrinsic dimensionality is found using
37
Example 2.8 use function from Dimensionality Reduction Toolbox called generate_data that will allow us to generate data based on a helix The data are plotted in Figure 2.9, where we see that the data fall on a 1-D manifold along a helix the packing number estimator providing the best answer with this artificial data set.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.