A Factor Graph Model for Minimal Gene Set Enrichment Analysis Diana Uskat Computational Biology - Gene Center Munich.

Slides:



Advertisements
Similar presentations
Bayesian Belief Propagation
Advertisements

Linear Models for Microarray Data
Bayesian network for gene regulatory network construction
Exact Inference. Inference Basic task for inference: – Compute a posterior distribution for some query variables given some observed evidence – Sum out.
Lauritzen-Spiegelhalter Algorithm
Detecting active subnetworks in molecular interaction networks with missing data Luke Hunter Texas A&M University SHURP 2007 Student.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
The STRING database Michael Kuhn EMBL Heidelberg.
Belief Propagation by Jakob Metzler. Outline Motivation Pearl’s BP Algorithm Turbo Codes Generalized Belief Propagation Free Energies.
Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning.
Work Process Using Enrich Load biological data Check enrichment of crossed data sets Extract statistically significant results Multiple hypothesis correction.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
How many transcripts does it take to reconstruct the splice graph? Introduction Alternative splicing is the process by which a single gene may be used.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Functional genomics and inferring regulatory pathways with gene expression data.
Yeast Dataset Analysis Hongli Li Final Project Computer Science Department UMASS Lowell.
Functional annotation and network reconstruction through cross-platform integration of microarray data X. J. Zhou et al
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Epistasis Analysis Using Microarrays Chris Workman.
Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester.
Inferring Cellular Networks Using Probabilistic Graphical Models Jianlin Cheng, PhD University of Missouri 2009.
Gaussian Processes for Transcription Factor Protein Inference Neil D. Lawrence, Guido Sanguinetti and Magnus Rattray.
Dependency networks Sushmita Roy BMI/CS 576 Nov 26 th, 2013.
TF Infer A Tool for Probabilistic Inference of Transcription Factor Activities H.M. Shahzad Asif Institute of Adaptive and Neural Computation School of.
TF Infer A Tool for Probabilistic Inference of Transcription Factor Activities H.M. Shahzad Asif Machine Learning Group Department of Computer Science.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Reconstructing Gene Networks Presented by Andrew Darling Based on article  “Research Towards Reconstruction of Gene Networks from Expression Data by Supervised.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
12/07/2008UAI 2008 Cumulative Distribution Networks and the Derivative-Sum-Product Algorithm Jim C. Huang and Brendan J. Frey Probabilistic and Statistical.
Probabilistic Models that uncover the hidden Information Flow in Signalling Networks.
Jesse Gillis 1 and Paul Pavlidis 2 1. Department of Psychiatry and Centre for High-Throughput Biology University of British Columbia, Vancouver, BC Canada.
Probabilistic Models that uncover the hidden Information Flow in Signalling Networks Achim Tresch.
Outline Who regulates whom and when? Model Learning algorithm Evaluation Wet lab experiments Perspective: why does it work? Reg. ACGTGC.
Direct Message Passing for Hybrid Bayesian Networks Wei Sun, PhD Assistant Research Professor SFL, C4I Center, SEOR Dept. George Mason University, 2009.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
‘Omics’ - Analysis of high dimensional Data
Hierarchical Bayesian Model Specification Model is specified by the Directed Acyclic Network (DAG) and the conditional probability distributions of all.
Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.
Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.
A dynamic model for RNA decay by the archaeal exosome: Parameter identification by MCMC Theresa Niederberger Computational Biology - Gene Center Munich.
複数種類のゲノムデータか らのベイズアプローチに基 づく 遺伝子ネットワークの推定 井元 清哉 東京大学医科学研究所 ヒトゲノム解析センター DNA 情報解析分野 2004 年 8 月 5 日 統計サマーセミナー チュートリアル.
Dependency networks Sushmita Roy BMI/CS 576 Nov 25 th, 2014.
Introduction to biological molecular networks
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
DISTIN: Distributed Inference and Optimization in WSNs A Message-Passing Perspective SCOM Team
Achim Tresch Computational Biology Gene Center Munich (The Sound of One-Hand Clapping) Modeling Combinatorial Intervention Effects in Transcription Networks.
Module Networks BMI/CS 576 Mark Craven December 2007.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Bayesian Optimization Algorithm, Decision Graphs, and Occam’s Razor Martin Pelikan, David E. Goldberg, and Kumara Sastry IlliGAL Report No May.
Bayes network inference  A general scenario:  Query variables: X  Evidence (observed) variables and their values: E = e  Unobserved variables: Y 
1 Relational Factor Graphs Lin Liao Joint work with Dieter Fox.
Distributed cooperation and coordination using the Max-Sum algorithm
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Network applications Sushmita Roy BMI/CS 576 Dec 9 th, 2014.
Dependency Networks for Inference, Collaborative filtering, and Data Visualization Heckerman et al. Microsoft Research J. of Machine Learning Research.
Identifying submodules of cellular regulatory networks Guido Sanguinetti Joint work with N.D. Lawrence and M. Rattray.
Hidden Markov Models Achim Tresch MPI for Plant Breedging Research & University of Cologne.
Mining Coherent Dense Subgraphs across Multiple Biological Networks Vahid Mirjalili CSE 891.
Inferring Regulatory Networks from Gene Expression Data BMI/CS 776 Mark Craven April 2002.
Representation, Learning and Inference in Models of Cellular Networks
Christopher A. Penfold Vicky Buchanan-Wollaston Katherine J. Denby And
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Evaluation of inferred networks
Expectation-Maximization & Belief Propagation
Principle of Epistasis Analysis
Presentation transcript:

A Factor Graph Model for Minimal Gene Set Enrichment Analysis Diana Uskat Computational Biology - Gene Center Munich

Diana Uskat - Gene Center Munich2 Problem Outline: Single gene analysis of microarray experiments entails a large multiple testing problem Even after appropriate multiple testing correction, the result is usually a long list of differentially expressed genes Interpretation is difficult by hand Possible improvement: Gene set enrichment analysis 1.Group genes into different biologically meaningful categories (Gene Ontology, KEGG Pathways, Transcription factor targets) 2.Use a statistical method for finding those categories which are enriched for differentially expressed genes Motivation Ontologizer from S. Bauer, J. Gagneur, P. N. Robinson Cutout of Gene Ontology Graph from Ontologizer by S. Bauer, J. Gagneur, P. N. Robinson (NAR 2010) Cutout of Gene Ontology

Diana Uskat - Gene Center Munich3 Established Methods: GSEA (Subramanian, Tamayo) TopGO (Alexa) Globaltest (Goemann, Mansmann) GOStats (Falcon, Gentleman) Drawbacks: There are often 1000’s of overlapping categories, genes can belong to multiple categories  difficult new multiple testing problem Group testing returns often a large number of significant categories  identification of biologically relevant categories difficult Motivation Graph from Ontologizer by S. Bauer, J. Gagneur, P. N. Robinson (NAR 2010) Cutout of Gene Ontology

Diana Uskat - Gene Center Munich4 Minimal Gene Set Enrichment Idea (Bauer, Gagneur et al., Nucleic Acids Research 2010) Search for a sparse explanation, i.e. a minimal number of categories that explain the data (sufficiently well) Use a simplistic probabilistic graphical model relating categories and genes, and do Bayesian inference on the marginal posterior for each category T2 E3E2E1 T1 T3 T2 E3E2E1 T1 T3 Correct explanationCorrect minimal explanation Genes Categories “gene E3 is element of category T3” (coloured means „on“)

Diana Uskat - Gene Center Munich5 Minimal Gene Set Enrichment T2 E3E2E1 T1 T3 D3D2D1 Genes Categories Observations (data) PosteriorLikelihoodPrior The model A Bayesian Network factorization of the full posterior: Main trick: Use a prior favoring sparse solutions

Diana Uskat - Gene Center Munich6 Factor Graphs T2 E3E2E1 T1 T3 D3D2D1 Graphical model (Kschischang IEEE, 2001 ) Bipartite graph with factor nodes and variable nodes Each factor node encodes a function for its neighbouring variables Efficient computation of marginal distribution with the sum-product algorithm (if factor graph is a tree...) Our method: Factor Graphs

Diana Uskat - Gene Center Munich7 Factor Graphs T2 E3E2E1 T1 T3 D3D2D1 f1f2f3 Graphical model ( Kschischang IEEE, 2001 ) Bipartite graph with factor nodes and variable nodes Each factor node encodes a function its neighbouring variables Efficient computation of marginal distribution with the sum-product algorithm (if factor graph is a tree...) Pr(D|E) given by dataset

Diana Uskat - Gene Center Munich Factor Graphs T2 E3E2E1 T1 T3 D3D2D1 g1 f1f2f3 g2g3g6 g4g5 Graphical model (Kschischang IEEE, 2001) Bipartite graph with factor nodes and variable nodes Each factor node encodes a function its neighbouring variables Efficient computation of marginal distribution with the sum-product algorithm (if factor graph is a tree...) E only active if at least one parent active 7

Diana Uskat - Gene Center Munich7 Factor Graphs T2 E3E2E1 T1 T3 D3D2D1 g1 f1f2f3 g2g3g6 g4g5 fTfT Graphical model ( Kschischang IEEE, 2001 ) Bipartite graph with factor nodes and variable nodes Each factor node encodes a function its neighbouring variables Efficient computation of marginal distribution with the sum-product algorithm (if factor graph is a tree...) with

Diana Uskat - Gene Center Munich8 Estimation Methods for Factor Graphs T2 E3E2E1 T1 T3 D3D2D1 g1 f1f2f3 g2g3g6 g4g5 fTfT Computation of posterior for T,E: Message-Passing Algorithm: Sum- Product-Algorithm Stops at correct result after one round if graph has a tree structure No guarantees if graph has cycles (e.g., oscillation may occur), however works well in practice Principle: Start in leaf nodes Message propagation: –variable to factor node („Sum“) –factor to variable node („Product“) Termination: Compute the marginal distribution of the variable nodes

Diana Uskat - Gene Center Munich9 Application: Yeast Salt Stress Categories: Transcritption factors (with their targets) instead of GO categories Given: –List of transcription factors with their corresponding genes –List of genes (their p-values) from a yeast salt stress experiment Question: Which transcription factors are active during salt stress? Task: Find a set of transcription factors that are most likely to be active TF1 TF2 g1 g2 g3 g4 g5 “g2 is target of TF2”

Diana Uskat - Gene Center Munich10 Results ~2.000 genes 118 transcription factors Graph obtained from re-analysis of Harbison TF binding data (Nat, 2004) by MacIsaac et al. (BMC Bioinformatics, 2006)

Diana Uskat - Gene Center Munich10 Results ~2.000 genes 118 transcription factors Graph obtained from re-analysis of Harbison TF binding data (Nat, 2004) by MacIsaac et al. (BMC Bioinformatics, 2006) Previously known transcription factors involved in salt stress (Capaldi et al., Nat.Gen 2008, Wu and Chen, Bioinform Biol Insights. 2009) Differentially phosphorylated transcription factors (Soufi et al., Mol.Biosyst 2009) YML081W DAL81 STB4 HSF1 UME6 SNT2 RGT1 MET28 MSN2 GAL4 SKO1

Diana Uskat - Gene Center Munich11 Summary and Outlook Todo: scalability and speed Lists of (meaningful) gene sets are better than lists of genes Search for biologically meaningful explanations requires a new minmal model (MGSE) for gene set enrichment analysis We use factor graphs for parameter estimation Wide application to GO analysis, TF-target analysis, Pathway enrichment

Diana Uskat - Gene Center Munich12 Acknowledgments Gene Center Munich: Achim Tresch, Theresa Niederberger, Björn Schwalb, Sebastian Dümcke Collaborating Partners: Gene Center Munich: Patrick Cramer, Christian Miller, Daniel Schulz, Dietmar Martin, Andreas Mayer EMBL Heidelberg: Julien Gagneur(talk nov. 2009, working group conference of the GMDS „AG Statistische Methoden in der Bioinformatik, Munich“)