Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.

Similar presentations


Presentation on theme: "Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University."— Presentation transcript:

1 Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University of Jerusalem Biological Background Similar DNA... Yet different expression of proteins The expression profile depends on u Tissue u External conditions u Growth stages u... Gene expression is responsible for cell activity (including regulation of expression) DNA RNA Protein cDNA microarray DNA hybridization measures abundance of RNA Recently developed technologies allow parallel measurement of the expression level of thousands of genes/proteins This allows biologists to view the cell as a complete system 1 Big Challenge  Extracting meaningful information from the expression data  Infer regulatory mechanisms  Reveal function of proteins  Experiment planning Prior Work  Clustering of expression data  Groups together genes with similar expression patterns  Disadvantage: Does not reveal structural relations between genes  Boolean Networks  Deterministic models of the logical interactions between genes  Disadvantages: Deterministic, impractical for real data We suggest a probabilistic framework capable of learning complex relations between genes. 2 Bayesian Networks  A Bayesian Network (BN) is a graphical representation of a probability distribution.  Advantages:  Compact & intuitive representation  Captures causal relationships  Efficient model learning  Deals with Noisy data  Integration of prior knowledge  Effective inference for experiment planning 0.90.1 e b e 0.20.8 0.01 0.99 0.90.1 be b b e BE P(A | E,B) Gene E Gene D Gene B Gene A Gene C Qualitative part: Directed acyclic graph (DAG): Nodes - random variables of interest Edges - direct (causal) influence Quantitative part: Local probability models. Set of conditional probability distributions. 3  Data from Spellman et al. (Mol.Bio. of the Cell 1998)  http://genome-www.stanford.edu/cell-cycle  Contains 76 samples of all the yeast genome  Different methods for synchronizing cell-cycle in yeast  Time series at few minutes (5-20min) intervals  Spellman et al. Identified 800 cell-cycle regulated genes, and clustered them  250 of these genes were combined in 8 clusters We took these 250 genes and  Discretized into three levels of expression  Run 100-fold bootstrap using our sparse learning algorithm  Computed confidence in predictions Evaluation  Pairs with 80% confidence were evaluated against original clustering:  70% of these were intra-cluster  The rest show interesting inter-cluster relations Biological Insight  M. Linial, Life Sciences, Hebrew U., examined relations  Most relations involved unknown/putative proteins, ...but we can guess functions based on homologies  … and they mostly make a lot of biological sense  only 3 pairs considered suspicious Preliminary Experiments 6 To get better results, we need  More data!  Publicly available gene expression experiments are extremely small.  Frequent samples:  Current sampling is far below rate of the regulation process  External Variables:  We want to relate regulation to external events: stimuli, temperature, nutrient levels, etc. We plan to improve modeling by  More suitable local distribution models  Correct handling of hidden variables  Can we recognize hidden causes of coordinated regulation events  Improving computational efficiency  Incorporating prior knowledge  Need to incorporate large mass of biological knowledge, and insight from sequence/structure databases  Learning from interventions  How to learn causality from knockout experiments? How to plan such experiments?  Related issues have been examined in the BN literature Future Directions & Work 8  N. Friedman, I. Nachman, and D. Pe’er, Learning of Bayesian Network structure form massive datasets: The “sparse candidate algorithm”. HUJI tech report CS99-3. (Submitted)  N. Friedman, M. Goldszmidt, and A. Wyner. Data Analysis with Bayesian Networks: A Bootstrap Approach. HUJI tech report CS99-4. (Submitted)  N. Friedman, M. Linial, I. Nachman, and D. Pe’er, Using Bayesian Networks to analyze whole genome expression data: A Preliminary Investigation. HUJI tech report CS99-6. (In preparation.)  D. Heckerman, A tutorial on learning with Bayesian Networks. In Learning Graphical Models, MIT press 1998  J. Pearl, Probabilistic Reasoning in Intelligent Systems. Morgan Kaufman, San Francisco, Calif., 1988  Spellman et. al., Comprehensive Identification of Cell Cycle- regulated Genes of the Yeast Sacch. Cervisiae by Microarray Hybridization, Mol. Bio. of the Cell, vol. 9, December 1998. References 9  Possible extensions: Random variables that measure  External stimuli  Environment parameters (temp, nutrients, PH, etc.)  Biological factors Measured expression level of each gene Random variables affecting on another Bayesian Networks for Gene Expression We want to apply methods for learning Bayesian networks to analyze gene expression experiments 4 Learner Data + Prior information E D B A C  Efficient algorithms exist for learning a BN from data.  Learning a BN can:  Reveal underlying structure of domain.  Direct relations between variables  Find causal influence.  Discover hidden variables. Learning Bayesian Networks Issues:  Massive number of variables (thousands)  Small number of samples (dozens)  Sparse networks (only a small number of genes directly affect one another). Crucial Aspects:  Computational Complexity  Statistical significance of features in learned models To address these issues we developed:  Sparse Candidate algorithm  Efficient heuristic search that relies on sparseness Choose candidate set for direct influence for each gene Find optimal BN constrained on candidates Iteratively improve candidate set  Bootstrap confidence estimates  Use resampling to generate perturbations of training data.  Use the number of times a feature is repeated among networks learned from these datasets to estimate confidence of Bayesian network features parents in BN candidates 5 Technical Challenges Network Learned 0.9--1.0 0.8--0.9 0.7--0.8 0.6--0.7 0.5--0.6 0.4--0.5 0.0--0.4 7


Download ppt "Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University."

Similar presentations


Ads by Google