Presentation is loading. Please wait.

Presentation is loading. Please wait.

ANTHAN HALKO, PER-GUNNAR MARTINSSON, YOEL SHAOLNISKY, AND MARK TYGERT

Similar presentations


Presentation on theme: "ANTHAN HALKO, PER-GUNNAR MARTINSSON, YOEL SHAOLNISKY, AND MARK TYGERT"— Presentation transcript:

1 ANTHAN HALKO, PER-GUNNAR MARTINSSON, YOEL SHAOLNISKY, AND MARK TYGERT
AN ALGORITHM FOR THE PRINCIPAL COMPONENT ANALYSIS OF LARGE DATA SETS ANTHAN HALKO, PER-GUNNAR MARTINSSON, YOEL SHAOLNISKY, AND MARK TYGERT PRESENTER: ZHAOCHONG LIU

2 OUTLINE Introduction Algorithm Out-of-core computation
Computational costs Examples Conclusion

3 Introduction Use with data sets that are too large to be stored in the random-access memory(RAM) of a typical computer system. SVD Minimize the total number of times that the algorithm has to access each entry of the matrix A being approximated.

4 Algorithm G: a real n × l matrix with independent and identically distributed entries.

5 Algorithm

6 Algorithm 2. Using a pivoted QR-decomposition H = Q R
Q : real m × ((i+1)l) matrix and its columns are orthonormal R: real ((i + 1)l) × ((i + 1)l) matrix

7 Algorithm 3. Compute the n × ((i+1)l) product matrix
4. Form an SVD of T

8 Algorithm 5. Compute the m × ((i+1)l) product matrix
6. Retrieve the leftmost m × k block U of , the leftmost n × k block V of , and the leftmost uppermost k × k block Σ of

9 Out-of-core Computation
Computation with on-the-fly evaluation of matrix entries Matrix not fit in memory Have access to a computational routine that can evaluate each entry of the matrix individually

10 Out-of-core Computation
Computation with storage on disk Retrieve as many rows of the matrix from disk as will fit in memory Calculate the products of the retrieved rows of the matrix and columns of G Then repeat

11 Computational costs Costs with on-the-fly evaluation of matrix entries

12 Computational cost Cost of the first step is Cost of step two is
Cost of step three is Cost of step four is Cost of step five is Cost of step six is Summing up :

13 Computational cost Cost with storage on disk
Each equation in step one need at most flops, the total is We can see that step two, four, five and six are not using A matrix, so the result are the same with the previous Cost of step three is Summing up:

14 Numerical Example First example:
Apply the algorithm to the m × n matrix A = E S F

15 Numerical Example

16 Numerical Example Second example:
Same as the first one, but the singular values are different.

17 Numerical Example

18 Numerical Example Third example:
Apply the algorithm with k = 3 to an m × 1000 matrix whose rows are independent and identically distributed realizations of the random vector

19 Numerical Example

20 Numerical Example

21 Measured Data Perform the algorithm with PCA of images of faces K = 50
393,216 * 102,042 matrix whose column consist of images

22 Measured Data

23 Measured Data

24 Last Example This data set consists of 10000 two-dimensional images
Each image is 129 pixels wide and high K = 250

25 Last Example

26

27

28 Conclusion This paper describes techniques for the principal component analysis of data sets that are too large to be sorted in random-access memory, and illustrates the performance of the methods on data from various sources, including standard test sets and numerical simulations. The core steps of the procedures parallelize easily; with the advent of widespread multicore and distributed processing.

29 Pros and Cons Pros: the algorithm is described very clear.
the numerical examples are comprehensive Cons: There is no compare on the example evaluation.

30 Thank you!


Download ppt "ANTHAN HALKO, PER-GUNNAR MARTINSSON, YOEL SHAOLNISKY, AND MARK TYGERT"

Similar presentations


Ads by Google