Download presentation
Presentation is loading. Please wait.
Published byἙνώχ Βασιλόπουλος Modified over 6 years ago
1
ANTHAN HALKO, PER-GUNNAR MARTINSSON, YOEL SHAOLNISKY, AND MARK TYGERT
AN ALGORITHM FOR THE PRINCIPAL COMPONENT ANALYSIS OF LARGE DATA SETS ANTHAN HALKO, PER-GUNNAR MARTINSSON, YOEL SHAOLNISKY, AND MARK TYGERT PRESENTER: ZHAOCHONG LIU
2
OUTLINE Introduction Algorithm Out-of-core computation
Computational costs Examples Conclusion
3
Introduction Use with data sets that are too large to be stored in the random-access memory(RAM) of a typical computer system. SVD Minimize the total number of times that the algorithm has to access each entry of the matrix A being approximated.
4
Algorithm G: a real n × l matrix with independent and identically distributed entries.
5
Algorithm
6
Algorithm 2. Using a pivoted QR-decomposition H = Q R
Q : real m × ((i+1)l) matrix and its columns are orthonormal R: real ((i + 1)l) × ((i + 1)l) matrix
7
Algorithm 3. Compute the n × ((i+1)l) product matrix
4. Form an SVD of T
8
Algorithm 5. Compute the m × ((i+1)l) product matrix
6. Retrieve the leftmost m × k block U of , the leftmost n × k block V of , and the leftmost uppermost k × k block Σ of
9
Out-of-core Computation
Computation with on-the-fly evaluation of matrix entries Matrix not fit in memory Have access to a computational routine that can evaluate each entry of the matrix individually
10
Out-of-core Computation
Computation with storage on disk Retrieve as many rows of the matrix from disk as will fit in memory Calculate the products of the retrieved rows of the matrix and columns of G Then repeat
11
Computational costs Costs with on-the-fly evaluation of matrix entries
12
Computational cost Cost of the first step is Cost of step two is
Cost of step three is Cost of step four is Cost of step five is Cost of step six is Summing up :
13
Computational cost Cost with storage on disk
Each equation in step one need at most flops, the total is We can see that step two, four, five and six are not using A matrix, so the result are the same with the previous Cost of step three is Summing up:
14
Numerical Example First example:
Apply the algorithm to the m × n matrix A = E S F
15
Numerical Example
16
Numerical Example Second example:
Same as the first one, but the singular values are different.
17
Numerical Example
18
Numerical Example Third example:
Apply the algorithm with k = 3 to an m × 1000 matrix whose rows are independent and identically distributed realizations of the random vector
19
Numerical Example
20
Numerical Example
21
Measured Data Perform the algorithm with PCA of images of faces K = 50
393,216 * 102,042 matrix whose column consist of images
22
Measured Data
23
Measured Data
24
Last Example This data set consists of 10000 two-dimensional images
Each image is 129 pixels wide and high K = 250
25
Last Example
28
Conclusion This paper describes techniques for the principal component analysis of data sets that are too large to be sorted in random-access memory, and illustrates the performance of the methods on data from various sources, including standard test sets and numerical simulations. The core steps of the procedures parallelize easily; with the advent of widespread multicore and distributed processing.
29
Pros and Cons Pros: the algorithm is described very clear.
the numerical examples are comprehensive Cons: There is no compare on the example evaluation.
30
Thank you!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.