ANTHAN HALKO, PER-GUNNAR MARTINSSON, YOEL SHAOLNISKY, AND MARK TYGERT

ANTHAN HALKO, PER-GUNNAR MARTINSSON, YOEL SHAOLNISKY, AND MARK TYGERT
AN ALGORITHM FOR THE PRINCIPAL COMPONENT ANALYSIS OF LARGE DATA SETS ANTHAN HALKO, PER-GUNNAR MARTINSSON, YOEL SHAOLNISKY, AND MARK TYGERT PRESENTER: ZHAOCHONG LIU

OUTLINE Introduction Algorithm Out-of-core computation
Computational costs Examples Conclusion

Introduction Use with data sets that are too large to be stored in the random-access memory(RAM) of a typical computer system. SVD Minimize the total number of times that the algorithm has to access each entry of the matrix A being approximated.

Algorithm G: a real n × l matrix with independent and identically distributed entries.

Algorithm

Algorithm 2. Using a pivoted QR-decomposition H = Q R
Q : real m × ((i+1)l) matrix and its columns are orthonormal R: real ((i + 1)l) × ((i + 1)l) matrix

Algorithm 3. Compute the n × ((i+1)l) product matrix
4. Form an SVD of T

Algorithm 5. Compute the m × ((i+1)l) product matrix
6. Retrieve the leftmost m × k block U of , the leftmost n × k block V of , and the leftmost uppermost k × k block Σ of

Out-of-core Computation
Computation with on-the-fly evaluation of matrix entries Matrix not fit in memory Have access to a computational routine that can evaluate each entry of the matrix individually

Out-of-core Computation
Computation with storage on disk Retrieve as many rows of the matrix from disk as will fit in memory Calculate the products of the retrieved rows of the matrix and columns of G Then repeat

Computational costs Costs with on-the-fly evaluation of matrix entries

Computational cost Cost of the first step is Cost of step two is
Cost of step three is Cost of step four is Cost of step five is Cost of step six is Summing up :

Computational cost Cost with storage on disk
Each equation in step one need at most flops, the total is We can see that step two, four, five and six are not using A matrix, so the result are the same with the previous Cost of step three is Summing up:

Numerical Example First example:
Apply the algorithm to the m × n matrix A = E S F

Numerical Example

Numerical Example Second example:
Same as the first one, but the singular values are different.

Numerical Example

Numerical Example Third example:
Apply the algorithm with k = 3 to an m × 1000 matrix whose rows are independent and identically distributed realizations of the random vector

Numerical Example

Measured Data Perform the algorithm with PCA of images of faces K = 50
393,216 * 102,042 matrix whose column consist of images

Measured Data

Last Example This data set consists of 10000 two-dimensional images
Each image is 129 pixels wide and high K = 250

Last Example

Conclusion This paper describes techniques for the principal component analysis of data sets that are too large to be sorted in random-access memory, and illustrates the performance of the methods on data from various sources, including standard test sets and numerical simulations. The core steps of the procedures parallelize easily; with the advent of widespread multicore and distributed processing.

Pros and Cons Pros: the algorithm is described very clear.
the numerical examples are comprehensive Cons: There is no compare on the example evaluation.

Thank you!

ANTHAN HALKO, PER-GUNNAR MARTINSSON, YOEL SHAOLNISKY, AND MARK TYGERT

Similar presentations

Presentation on theme: "ANTHAN HALKO, PER-GUNNAR MARTINSSON, YOEL SHAOLNISKY, AND MARK TYGERT"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ANTHAN HALKO, PER-GUNNAR MARTINSSON, YOEL SHAOLNISKY, AND MARK TYGERT

Similar presentations

Presentation on theme: "ANTHAN HALKO, PER-GUNNAR MARTINSSON, YOEL SHAOLNISKY, AND MARK TYGERT"— Presentation transcript:

Similar presentations

About project

Feedback