Download presentation
Presentation is loading. Please wait.
1
cs726 Modeling regulatory networks in cells using Bayesian networks Golan Yona Department of Computer Science Cornell University
2
cs726 Outline Regulatory networks Expression data Bayesian Networks What Why How Learning networks from expression data Using Bayesian networks to analyze expression data (Friedman et al)
3
cs726 Regulatory networks KEGG Regulatory Pathways
4
cs726
5
Metabolic pathways KEGG Metabolic Pathways
6
cs726
7
Expression Arrays Measure the expression levels of thousands of genes in a cell under specific conditions (e.g. cell cycle) simultaneously Each cell has the same genomic data but different subsets of proteins are being expressed in different cells and at the same cell under different conditions. Protein level is controlled by controlling –transcription initiation –mRNA transcription –mRNA transport –splicing –post-translational modifications –degradation of mRNA and proteins. Microarray measure the level of mRNA, thus providing an indirect evidence for the control of protein levels
8
cs726 Micro Spotting pin
9
cs726
10
Some are over-expressed (red), some under-expressed (green) measured with respect to a control group of genes (“fixed” genes) Different pathways are activated under different conditions
11
cs726 Goals Recover protein interactions and sub-networks that correspond to regulatory networks in the cell. Basic assumption: some genes are dependent on others while others exhibit independence or conditional independence The means: Bayesian networks. Capable of modeling the statistical dependencies between different variables (genes) Different from clustering analysis.. Applicable when the dependency between genes is “local” Problems: data is noisy, partial, sometimes misleading (translation, activation), not enough to ensure statistically significant models, time scale
12
cs726 Bayesian Networks A compromise between the assumption of complete dependency and complete conditional independence (Naïve Bayes) Less constraining yet still tractable We know something about the statistical dependencies between features but not necessarily about the type of the underlying distributions
13
cs726 Example Oil pressure In engine Fan speed Coolant temp. Engine temp. Oil temp. Air pressure In tire Smoke
14
cs726 Bayesian Networks Also called belief nets A graph description of dependencies and independencies between variables. Each node corresponds to a variable (gene). The graph is directed and acyclic The variables are discrete A variable can take on a value from a set of values {a1,a2,…} e.g. on/off The probability of a specific value P(a i ) and i P(a i ) = 1 A link joining node A to node C is directional and represents the set of conditional probabilities P(c j /a i ) – causality (the probability that C is on when A is off) The network is described in terms of its nodes and edges and the conditional probability distributions associated with each node.
15
cs726 Network structure For every node A The parents of A is the set of immediate predecessors of A The children of A is the set of immediate successors of A B is a descendant of A if there is a directed path from A to B Conditional probability Network assertions: The value of a variable depends on its parents A variable is conditionally independent of it non-descendants given its parents Eng cold,Eng cold, Eng hot, Eng hot Fan fastFan slowFan fastFan slow High0.10.40.60.9 Low 0.90.60.40.1 Coolant temp. Parents Fan speed Coolant temp. Engine temp. Oil temp. Smoke
16
cs726 Calculating the probability of an assignment The network describes the joint probability distribution of all variables (some conditionally independent and some are not) Depends on the structure! The probability of a specific assignment of values y 1,y 2,…y n for the variables Y 1,Y 2,…,Y n This is the likelihood of the data given the model. All you need to know is..
17
cs726 Learning Bayesian network from data Given the data set with specific assignments for variables (on/off for each gene), how can we find the most probable network structure that explains the data (the best “match” to the data)? How to quantify a match? Note that there are two aspects of the network that we need to learn –Structure (nodes, edges) –Conditional probability distributions Common strategy: assign a score to each network G
18
cs726 Common strategy: assign a score to each network G Pick the network that maximizes the score Likelihoodprior
19
cs726 Learning The likelihood of the data given the model is estimated by averaging over all possible assignments of parameters (conditional probabilities) to G Summation over all possible assignments for conditional probabilities. The major contribution is from the set estimated from the data Given a specific structure, for every node we lookup its parents and calculate the empirical conditional probability distribution
20
cs726 Model selection The second term (log prior) is a measure for the complexity of the model (through uncertainty) Occam razor : entia non sunt multiplicanda praeter necessitatem (thou shall not multiply entities) MDL principle In the papers discussed here it is being ignored
21
cs726 In search for the best network In theory: test different structures, calculate the probability of assignment to variables for each network structure, and output the network that maximizes the likelihood of the data given the network. Impossible in practice – the number of possible networks over n genes is For the yeast genome with 6000 genes this is > 10 5,000,000
22
cs726 Possible solution Apply a heuristic local greedy search: Start with a random network and locally improve it, by testing perturbations over the original structure. Test one edge at a time, by adding, removing or reversing the edge, and testing its affect on the score. If the score improves - accept
23
cs726 How to learn from expression data Two types of features learned from multiple networks First - a gene Y is in the Markov blanket of X (two genes are involved in the same biological process. No other gene mediates the dependence) Problem of unobserved variables that can intermediate the interaction Second type – a gene X is ancestor of Y (based on all networks that are learned)
24
cs726 Application to the Yeast Cell cycle data Expression level measurements for 6177 genes along different time points in six cell cycles – altogether 76 measurements for each gene Only 800 genes vary during cell cycle and 250 cluster into 8 fairly distinct classes. Networks are learned for the 800 genes Confidence values based on the set of networks learned from different bootstrap sets
25
cs726 Typical sub-network
26
cs726 Biological significance Order relations: there are a few dominant genes that appear before many others, e.g. genes that are involved in cell cycle control and initiation.
27
cs726 Most are nuclear proteins, but also cytoplasm membrane proteins (budding and sporulation) Some DNA repair proteins (prerequisite for transcription) RSR1 – initiator of signal trunsduction cascades in the cell
28
cs726 Biological significance Markov connection: functionally related
29
cs726 Most pairs have similar functions (verified sometimes through transitivity) Some are physically adjacent on the chromosome Some relations cannot be detected directly from expression data Detect conditional independence – group of genes that are expressed similarly, but one is a parent of all others and there are no connections between the others the parent is a control gene (e.g. CLN2 early cell cycle control gene, that controls RNR3, SVS1, SRO4 and RAD41 that are functionally unrelated).
30
cs726 Conclusions A powerful tool, but –not enough data –Computational problems –Learning algorithms –Authors decompose networks into basic elements again Many possible extensions
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.