Download presentation
Presentation is loading. Please wait.
Published byHenry Doyle Modified over 8 years ago
1
SVD Singular Value Decomposition Wolf Dan
2
2 Outline Short reminder – the analyzing process Mathematical definition of the SVD How do we use SVD analysis on gene expression data ? Experiments using SVD SVDMAN – application
3
The analyzing process
4
4 The famous Red-Green images!
5
5 Processing: Once we are convinced about the quality of all the data we have we come to the crucial step. The approach to be taken differs according to the experimental design. Common themes: What are the genes that are differentially expressed in my two samples? What are the genes that have a similar expression profile in time? What experiments have similar gene expression patterns?
6
6 Gene- I expression across sample types Gene-I expression across time points Are these two gene profiles similar?: = Clustering of genes Is the overall gene expression for these two experiments similar? = Clustering of experiments. Are these two gene profiles similar? : = differential expression of genes b/w conditions: 1-> Fold change (assuming most genes don’t change) 2-> t-test, Z-test, Signal to noise (comparing with Wt experiments) Expression of genes at a particular time point Gene: 1-> i Time: 1-> 8 Replicates: 1-> 3 Significantly changing genes: 1-> Fold change (assuming most genes don’t change) 2 Z-score, Identify the genes that change the most:
7
7 Analysis: Clustering: Is an exploratory analysis tools We attempt to look for natural groups in the data. Q: What are the common ‘patterns’ of gene expression in my dataset? PCA / SVD (in our case): What are the main profiles (parent patterns, eigenvectors) of gene expression Gives you a set of ‘base profiles’ using which you can reconstruct each gene profile
8
8 Analysis (cont) You then look at these base profiles to see: What do they look like? How many are there? What is the contribution of each base profile in determining the final gene expression profile? Based on the relative contribution of each base profile can I group the genes into clusters?
9
9 Score plot:
10
10 Score plot in two dimensions:
11
11 Post processing: Class prediction and classifier construction The goal is to find a list of genes looking at whose expression we can predict the type (cancer or normal) of a sample. Reconstruct gene regulatory networks The goal is to learn the dependencies in gene expression and construct a graph (usually undirected)
12
Mathematical definition of the SVD
13
13 Mathematical definition of the SVD Let X denote an m x n matrix of real-valued data and rank r m≥n The equation for singular value decomposition of X is the following: where U is an m x n matrix, S is an n x n diagonal matrix, and V T is also an n x n matrix.
14
14 Mathematical definition of the SVD V: columns are the eigenvectors of A T A and form an orthonormal basis for the gene transcriptional responses S : diagonal, r singular values are the square roots of the eigenvalues of both AA T and A T A U: columns are the eigenvectors of AA T and form an orthonormal basis for the assay expression profiles, so that ui·uj = 1 for i = j
15
15 Matrix Approximation Let A be an m by n matrix such that Rank(A) = r If s 1 s 2 ... s r are the singular values of A, then B, rank q approximation of A that minimizes ||A - B|| F, is Proof: S. J. Leon, Linear Algebra with Applications, 5th Edition, p. 414 [Will]
16
How do we use SVD analysis on gene expression data ?
17
17
18
Gene expression database – a conceptual view Samples Genes Gene expression levels Sample annotations Gene annotations Gene expression matrix
19
19 In the case of microarray : x ij is the expression level of the i th gene in the j th assay. The elements of the i th row of X form the n- dimensional vector g i, which we refer to as the transcriptional response of the i th gene. The elements of the j th column of X form the m-dimensional vector a j, which we refer to as the expression profile of the j th assay.
20
20
22
Definitions: Our base formula: X i (t), i = 1, r, to be the first i rows of the matrix V T = characteristic modes The temporal variation of any gene j = The contribution of the first k modes to the temporal pattern of a gene = C j (k) = (U j,i i ) 2
23
Experiments using SVD
24
24 Experiments using SVD We will discuss two experiments: Experiment 1 : “fundemental patterns underlying gene expression profiles: simplicity from complexity”, Holter et al (2000) Experiment 2 : “dynamic modeling of gene expression data”, Holter et al (2000)
25
25 Experiment 1 : SVD analysis of the published data sets from: yeast cdc15 cell-cycle yeast sporulation serum-treated human fibroblasts
26
26 Exp 1 – running the SVD analysis
27
27 What do we learn from the table : Random data sets yield similar singular values because all characteristic modes contribute about equally The actual gene expression data sets yield singular values of sufficiently different magnitude only the first few modes are required to capture the essential features of the expression data in most cases.
28
28 Copyright ©2000 by the National Academy of Sciences Holter, Neal S. et al. (2000) Proc. Natl. Acad. Sci. USA 97, 8409-8414 Fig. 1. Characteristic modes (Xi(t)) for the gene expression and random data sets A B C D
29
29 What do we learn from the figures : The contribution of each mode to the final gene expression profile progressively diminishes from the lower to the higher order modes. approximately equal for the random data set. The structure of the two dominant modes is rather simple for all of the gene expression data sets the major features of the overall genetic response of the cells is contained in a combination of just a few different patterns.
30
30 What do we learn from the figures : (cont) Fig A : the shapes of the first two dominant modes do not change significantly upon removal of the last three time points, revealing their robustness. Fig B and C: two characteristic modes make a significantly greater contribution to the final profiles than the others
31
31 Expression profiles for yeast cell cycle data from characteristic nodes (singular values). 14 characteristic nodes Left to right: Microarrays for 1, 2, 3, 4, 5, all characteristic nodes, respectively. reconstruction of expression profiles:
32
32 reconstruction of expression profiles:
33
33 What do we learn from the figures: a representation comprising just the first two modes captures many of the essential features of the overall array of expression patterns. The remaining modes describe minor elements in the patterns, may be attributable to small scale fluctuations and experimental noise. uncovers an underlying simplicity in the genetic response patterns of cells However, it does not imply that other patterns of gene expression lack significance.
34
34 Copyright ©2000 by the National Academy of Sciences Holter, Neal S. et al. (2000) Proc. Natl. Acad. Sci. USA 97, 8409-8414 Fig. 5. Plot of the coefficients for characteristic mode 1 against the coefficients for characteristic mode 2
35
From Holter, et al. PNAS 98: 1693-1698. From Holter, et al. PNAS 98: 1693-1698.
36
36 What do we learn from the figures: The coefficients are a measure of the contribution of each mode to the structure of the expression profile of a given gene the data points are fairly densely concentrated near the perimeter of a circle or an ellipse. the interior rather sparsely populated. By contrast, coefficients for a random data set describe a filled circle the concentration of points near the perimeter of the circle or ellipse simply reflects the relative importance of the first two modes.
37
37 What do we learn from the figures: (cont) expression profiles clustered by more conventional methods correspond well to groups of genes with similar coefficients. reveals that previously identified clusters appear in adjacent sectors on the perimeter of the circle in the order of their temporal progression in the cell cycle and in the course of sporulation
38
38 What do we learn from the regularities ? most genes undergo either just one or just two "changes of expression phase" a majority of the genes transition from active to inactive or inactive to active at most once or twice. Although there are more complex expression patterns, these are sufficiently few so that they do not dominate the system's overall response
39
39 What do we learn from the regularities ? the observation for both the cell cycle and fibroblast data that the points fall near the perimeter of a circle, rather than an ellipse, means that the contributions of the two dominant modes are roughly equal. the observation that the perimeter is fairly evenly populated for these two data sets implies that the coefficients vary continuously. for the cell cycle data most of the cell cycle-regulated genes tend to be expressed for roughly the same length of time.
40
40 implications for the underlying mode of transcriptional regulation the cell cycle progression is a smooth function, with roughly equal numbers of genes being activated and inactivated per unit time and a regular succession in time of gene expression peaks (synthesis (S) phase and mitosis (M) ) The smooth evolution of gene expression patterns in time is consistent with the operation of such a subtle and continuous regulatory system
41
41 In summery : the complex "music of the genes" is orchestrated through a few simple underlying patterns of gene expression change. The music produced by the set of strings is then entirely specified by the contributions of each of the characteristic modes.
42
42 Experiment 2 : describe a time evolution of gene expression levels, that reflects the magnitude of the connectivities between genes. using a time translational matrix to predict future expression levels of genes based on their expression levels at some initial time.
43
43 Experiment 2 : We deduce the time translational matrix by modeling them within a linear framework by using the characteristic modes The resulting time translation matrix provides a measure of the relationships among the modes and governs their time evolution.
44
44 Experiment 2 : The problem : the number of time points is smaller than the number of genes, and thus the problem is underdetermined The solution : the inverse problem is mathematically well defined and tractable if one considers the causal relationships among the r characteristic modes obtained by SVD. where r is one less than the number of time points
45
45 Definitions: the expression levels of the r modes at time t = our linear model is : The time step is chosen to be the highest common factor among all of the experimentally measured time intervals : t j = n j t,
46
46 How we determine M: Z(t0) = Y(t0) For any integer k : The r 2 coefficients of M are chosen to minimize the cost function: The outcome of this analysis is that the gene expression data set can be reexpressed precisely by using: the r specific coefficients for each gene the r × r time translation matrix - M the initial values of each of the r modes.
47
47 Experiment 2 : determine M, the r × r time translation matrix, for three different data sets of gene expression profiles: yeast cell cycle (CDC15) by using the first 12 equally spaced time points yeast sporulation, which has 7 time points human fibroblast, which has 13 time points
48
48 Verifying the accuracy of M : By showing that the temporal evolution of the modes is reproduced well By showing that the reconstructed gene expression patterns are virtually indistinguishable from the experimental data.
49
49 Experiment 2 The averages of the experimental measurements (circles) and the predicted expression patterns (lines) of the six clusters
50
50 The first two characteristic modes for the (a) cdc15, (b) sporulation, and (c) fibroblast data sets. The circles correspond to the measured data, and the lines show the approximations based on the best-fit 2 × 2 time translation matrices.
51
51 Copyright ©2001 by the National Academy of Sciences Holter, Neal S. et al. (2001) Proc. Natl. Acad. Sci. USA 98, 1693-1698 Fig. 3. A reconstruction of the expression profiles for the cdc15 (Left), sporulation (Center), and fibroblast (Right) data sets A using 2*2 time translation matrix B using linear combinations of the 2 top modes C the experimental data
52
52 In summery : the results suggest that the causal links between the modes, and thence the genes, involve just a few essential connections. Any additional connections among the genes must therefore provide redundancy in the network it may be impossible to determine detailed connectivities among genes with just the microarray data, because the number of genes greatly exceeds the number of contributing modes.
53
53 In summery : They have shown that it is possible to accurately describe the interactions among the characteristic modes. an interaction model with only two connections reconstructs the key features of the gene expression in the simplest cases with good fidelity.
54
SVDMAN singular value decomposition analysis of microarray data
55
55
56
56
58
58 References [Holter]: Neal S. Holter, et. al., “Fundamental patterns underlying gene expression profiles: Simplicity from complexity,” Proc. Natl. Acad. Sci. USA, 10.1073/pnas. 150242097, 2000 (preprint). Available online at www.pnas.org/doi/10.1073/pnas.150242097 www.pnas.org/doi/10.1073/pnas.150242097 [Holter]: “Dynamic modeling of gene expression data” Neal S. Holter*, Amos Maritan,, Marek Cieplak*, Nina V. Fedoroff, and Jayanth R. Banavar* [Will]: Todd Will, “Introduction to the Singular Value Decomposition,” Davidson College, http://www.davidson.edu/math/will/svd/index.htmlhttp://www.davidson.edu/math/will/svd/index.html http://public.lanl.gov/mewall/svdman/ “SVDMAN—singular value decomposition analysis of microarray data”, Michael E. Wall, Patricia A. Dyck and Thomas S. Brettin * Citation: Wall, Michael E., Andreas Rechtsteiner, Luis M. Rocha."Singular value decomposition and principal component analysis". in A Practical Approach to Microarray Data Analysis. D.P. Berrar, W. Dubitzky, M. Granzow, eds. pp. 91-109, Kluwer: Norwell, MA (2003). LANL LA-UR-02- 4001.
59
QUESTIONS ? Thank you all, Have a great summer vacation …
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.