Gene Network Inference From Microarray Data

Gene Network Inference From Microarray Data

Copyright notice Many of the images in this power point presentation of other people. The Copyright belong to the original authors. Thanks!

Gene Network Inference

Level of Biochemical Detail
Detailed models require lots of data! Highly detailed biochemical models are only feasible for very small systems which are extensively studied Example: Arkin et al. (1998), Genetics 149(4): lysis-lysogeny switch in Lambda: 5 genes, 67 parameters based on 50 years of research, stochastic simulation required supercomputer

Example: Lysis-Lysogeny
Arkin et al. (1998), Genetics 149(4):

Level of Biochemical Detail
In-depth biochemical simulation of e.g. a whole cell is infeasible (so far) Less detailed network models are useful when data is scarce and/or network structure is unknown Once network structure has been determined, we can refine the model

Boolean or Continuous? Boolean Networks (Kauffman (1993), The Origins of Order) assumes ON/OFF gene states. Allows analysis at the network-level Provides useful insights in network dynamics Algorithms for network inference from binary data A B C C = A AND B 1

Boolean or Continuous? Boolean abstraction is poor fit to real data
Cannot model important concepts: amplification of a signal subtraction and addition of signals compensating for smoothly varying environmental parameter (e.g. temperature, nutrients) varying dynamical behavior (e.g. cell cycle period) Feedback control: negative feedback is used to stabilize expression  causes oscillation in Boolean model

Deterministic or Stochastic?
Use of concentrations assumes individual molecules can be ignored Known examples (in prokaryotes) where stochastic fluctuations play an essential role (e.g. lysis-lysogeny in lambda) Requires stochastic simulation (Arkin et al. (1998), Genetics 149(4): ), or modeling molecule counts (e.g. Petri nets, Goss and Peccoud (1998), PNAS 95(12):6750-5) Significantly increases model complexity

Deterministic or Stochastic?
Eukaryotes: larger cell volume, typically longer half-lives. Few known stochastic effects. Yeast: 80% of the transcriptome is expressed at mRNA copies/cell Holstege, et al.(1998), Cell 95: Human: 95% of transcriptome is expressed at <5 copies/cell Velculescu et al.(1997), Cell 88:

Spatial or Non-Spatial
Spatiality introduces additional complexity: intercellular interactions spatial differentiation cell compartments cell types Spatial patterns also provide more data e.g. stripe formation in Drosophila: Mjolsness et al. (1991), J. Theor. Biol. 152: Few (no?) large-scale spatial gene expression data sets available so far.

Data Requirements: Lower Bounds from Information Theory
How many bits of information are needed just to specify the connection pattern of a network? N2 possible connections between N nodes  N2 bits needed to specify which connections are present or absent O(N) bits of information per “data point”  O(N) data points needed

Effect of Limited Connectivity
Assume only K inputs per gene (on average)  NK connections out of N2 possible: possible connection patterns Number of bits needed to fully specify the connection pattern:  O(Klog(N/K)) data points needed

Comparison with clustering
Use pairwise correlation comparisons as a stand-in for clustering As number of genes increases, number of false positives will increase as well  need to use more stringent correlation test If we want to use the same correlation cutoff value r, we need to increase the number of data points as N increases  O(log(N)) data points needed

Summary Fully connected N (thousands)
Connectivity K Klog(N/K) (hundreds?) Clustering log(N) (tens) Additional constraints reduce data requirements: choice of regulatory functions limited connectivity Network inference is feasible, but does require much more data than clustering

Reverse Engineering Gene Network Methods
Boolean network Relevance network (co-expression network) Bayesian network Graphical Gaussian models Differential equation

Gene Networks: reverse engineering
Dynamical gene networks: discrete models-- Boolean networks Bayesian networks, Petri Net continuous models-- neural networks differential equations Static gene networks: statistical correlation analysis graph theory approach

Problems Static model: require less data but low accuracy
Dynamical model: require more data but high accuracy Noise and time delay  master equations Problem: scarcity of time series data or dimensionality problem, e.g. number of genes typically far exceeds the number of time points for which data are available, making the problem an ill-posed one

Gene Co-expression Relation
The relation of n gene expressions can be represented by an n×n symmetric correlation (e.g. Pearson correlation) matrix M. Coexistence of collectivity and noise: m=Mn+Mc. Strong correlation part Mc indicates modular collectivity. Weak correlation part Mn indicates “noise” between unrelated genes.

Relevance networks (Butte and Kohane, 2000)
Choose a measure of association A(.,.) Define a threshold value tA For all pairs of domain variables (X,Y) compute their association A(X,Y) 4. Connect those variables (X,Y) by an undirected edge whose association A(X,Y) exceeds the predefined threshold value tA

Relevance networks (Butte and Kohane, 2000)

Determining the Threshold by Random Matrix Theory
Construct a series of correlation matrices with different cutoff values. For a certain cutoff, the absolute values less than the cutoff are set to zero Only the correlation coefficients with absolute values beyond the cutoff are kept . Calculate the NNSD of the series of correlation matrices. Determine the cutoff threshold by testing Fit-of-Goodness to Poisson distribution using Chi-square test.

Yeast Gene Co-expression Network at Cutoff 0.77
Red represents the major functional category of each module while purple, yellow and tan represent other functional categories, which are often clustered into sub-modules. Genes in lavender participate in processes closely related to genes in red. White nodes are unknown genes while black nodes are genes whose functional links to other genes are not currently understood. Green nodes are genes in metabolic processes, which are influenced by many biological processes. LightCyan nodes in Module 15 are genes involved in cell cycling regulation and related processes.

Graphical Gaussian Models
GGMs are undirected probabilistic graphical models that allow the identification of conditional independence relations among the nodes under the assumption of a multivariate Gaussian distribution of the data. The inference of GGMs is based on a (stable) estimation of the covariance matrix of this distribution. A high correlation coefficient Cik between two nodes may indicate a direct interaction. The strengths of these direct interactions are measured by the partial correlation coefficient πik, which describes the correlation between nodes Xi and Xk conditional on all the other nodes in the network.

2 1 direct interaction strong partial correlation π12 Partial correlation, i.e. correlation conditional on all other domain variables Corr(X1,X2|X3,…,Xn) But usually: #observations < #variables

To infer a GGM, one typically employs the following procedure. From the given data, the empirical covariance matrix is computed, inverted and the partial correlations ρik are computed. The distribution of | ρik | is inspected, and edges (i, k) corresponding to significantly small values of | ρik | are removed from the graph. The critical step in the application of this procedure is the stable estimation of the covariance matrix and its inverse. Schafer and Strimmer (2005) propose a novel covariance matrix estimator regularized by a shrinkage approach after extensively exploring alternative regularization methods based on bagging.

Further drawbacks Relevance networks and Graphical Gaussian models can extract undirected edges only. Bayesian networks promise to extract at least some directed edges. But can we trust in these edge directions? It may be better to learn undirected edges than learning directed edges with false orientations.

Bayesian networks (BN) in brief
Graphs in which nodes represent random variables (Lack of) Arcs represent conditional independence assumptions Present & absent arcs provide compact representation of joint probability distributions BNs have complicated notion of independence, which takes into account the directionality of the arcs

Bayes’ Rule Can rearrange the conditional probability formula
to get P(A|B) P(B) = P(A,B), but by symmetry we can also get: P(B|A) P(A) = P(A,B) It follows that: The power of Bayes' rule is that in many situations where we want to compute P(A|B) it turns out that it is difficult to do so directly, yet we might have direct information about P(B|A). Bayes' rule enables us to compute P(A|B) in terms of P(B|A).

Bayesian networks NODES A Marriage between graph theory and probability theory. Directed acyclic graph (DAG) represents conditional independence relations. Markov assumption leads to a factorization of the joint probability distribution: B C EDGES D E F

Simple Bayesian network example, from “Bayesian Networks Without Tears” article P(hear your dog bark as you get home) = P(hb) = ?

Need prior P for root nodes and conditional Ps, that consider all possible values of parent nodes, for nonroot nodes

Major benefit of BN We can know P(hb) based only on the conditional probabilities of hb and its parent node. We don’t need to know/include all the ancestor probabilities between hb and the root nodes.

Independence assumptions
Source of savings in # of values needed From our simple example: are ‘family-out’ and ‘hear-bark’ independent, i.e. P(hb|fo)=P(hb)? Intuition might say they are not independent…

Independence assumptions
…but in fact they can be assumed to be independent if some conditions are met. Conditions are symbolized by presence/absence and direction of arrows between nodes. Knowing whether dog is or is not in the house is all that is needed to know probability of hearing a bark, so family being in or out is independent. This kind of independence assumption is what allows savings in how many numbers must be specified for probabilities.

Learning Bayesian Belief Networks
The network structure is given in advance and all the variables are fully observable in the training examples. ==> Trivial Case: just estimate the conditional probabilities. The network structure is given in advance but only some of the variables are observable in the training data. ==> Similar to learning the weights for the hidden units of a Neural Net: Gradient Ascent Procedure The network structure is not known in advance. ==> Use a heuristic search or constraint-based technique to search through potential structures.

BN from microarray Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data,” Segal E, Shapira M, Regev A, Pe’er D, Botstein D, Koller D, Friedman N, Nature Genetics, June 2003

Results of SSR article Expression data set, from other researchers circa 2000, is for genes of yeast subjected to various kinds of stress Compiled list of 466 candidate regulators Applied analysis to 2355 genes in all 173 arrays of yeast data set This gave automatic inference of 50 modules of genes All modules were analyzed with external data sources to check functional coherence of gene products and validity of regulatory program Three novel hypotheses suggested by method were tested in bio lab and found to be accurate

Differential Equations
Typically uses linear differential equations to model the gene trajectories: dxi(t) / dt = a0 + ai,1 x1(t)+ ai,2 x2(t)+ … + ai,n xn(t) Several reasons for that choice: lower number of parameters implies that we are less likely to over fit the data sufficient to model complex interactions between the genes

Small Network Example x2 x1 x4 x3 _ +
dx1(t) / dt = x1(t) dx2(t) / dt = x3(t) x4(t) dx3(t) / dt = x1(t) x3(t) dx4(t) / dt = x1(t) x3(t) x4(t)

Small Network Example x1 x2 x3 x4 one interaction coefficient _ _ + _

Small Network Example x2 x1 x4 x3 constant coefficients _ +

Issues with Differential Equations
Even under the simplest linear model, there are m(m+1) unknown parameters to estimate: m(m-1) directional effects m self effects m constant effects Number of data points is mn and we typically have that n << m (few time-points). To avoid over fitting, extra constraints must be incorporated into the model such as: Smoothness of the equations Sparseness of the network (few non-null interaction coefficients)

Collins et al. PNAS Using SVD for a family of possible solutions
Using robust regression to choose from them

Goal is to use as few measurements as possible
Goal is to use as few measurements as possible. By this method (with exact measurements): M = O(log(N))

If the system is near a steady state, dynamics can be approximated by linear system of Differential Equations: xi = concentration of mRNA (reflects expression level of genes) λi = self-degradation rates bi = external stimuli ξi = noise Wij = type and strength of effect of jth gene on ith gene

Suppositions made: No time-dependency in connections (so W is not time-dependent), and they are not changed by the tests System near steady state Noise will be discarded, so exact measurements are assumed can be calculated exactly enough

System becomes: With A = W + diag(-λi) Compute by using several measurements of the data for X. (e.g. using interpolation) Goal = deduce W (or A) from the rest If M=N, compute (XT)-1, but mostly M << N (this is our goal: M = log(N))

Therefore, use SVD (to find least squares sol.):
Here, U and V are orthogonal (UT = U-1) and W is diag(w1,…,wN) with wi the singular values of X Suppose all wi = 0 are in the beginning, so wi = 0 for i = 1…L and wi ≠ 0 (i=L+1...L+N)

Then the least squares (L2) solution to the problem is:
With 1/wj replaced by 0 if wj = 0 So this formula tries to match every data point as closely as possible to the solution.

But all possible solutions are:
with C = (cij)NxN where cij = 0 if j > L and otherwise just a scalar coefficient How to choose from the family of solutions ? The least squares method tries to match every datapoint as closely as possible → a not-so-sparse matrix with a lot of small entries.

Basing on prior biological knowledge, impose this on the solutions. e
Basing on prior biological knowledge, impose this on the solutions. e.g.: when we know 2 genes are related, the solution must reflect this in the matrix Work from the assumption that normal gene networks are sparse, and look for the matrix that is most sparse thus: search cij to maximize the number of zero-entries in A

So: get as much zero-entries as you can therefore get a sparse matrix the non-zero entries form the connections fit as much measurements as you can, exactly: “robust regression” (So you suppose exact measurements)

Do this using L1 regression. Thus, when considering
we want to “minimize” A. The L1 regression idea is then to look for the solution C where is minimal. This causes as many zeros as possible. Implementation was done using the simplex method (linear adjustment method)

Results: Mc = O(log(N))
Better than only SVD, without regression:

Thus, to reverse-engineer a network of N genes, we “only” need Mc = O(logN) experiments.
Then Mc << N, and the computational cost will be O(N4) (Brute-force methods would have a cost of O(N!/(k!(N-k)!)) with k non-zero entries)

Discussion Advantages:
Few data needed, in comparison with neural networks, Bayesian models No prior knowledge needed Easy to parallelize, as it recovers the connectivity matrix row by row (gene by gene) Also applicable to protein networks

Gene Network Inference From Microarray Data

Similar presentations

Presentation on theme: "Gene Network Inference From Microarray Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gene Network Inference From Microarray Data

Similar presentations

Presentation on theme: "Gene Network Inference From Microarray Data"— Presentation transcript:

Similar presentations

About project

Feedback