Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
ICONIP 2005 Improve Naïve Bayesian Classifier by Discriminative Training Kaizhu Huang, Zhangbing Zhou, Irwin King, Michael R. Lyu Oct
CS479/679 Pattern Recognition Dr. George Bebis
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Yue Han and Lei Yu Binghamton University.
Model Assessment, Selection and Averaging
A Probabilistic Dynamical Model for Quantitative Inference of the Regulatory Mechanism of Transcription Guido Sanguinetti, Magnus Rattray and Neil D. Lawrence.
What is Statistical Modeling
Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.
GENIE – GEne Network Inference with Ensemble of trees Van Anh Huynh-Thu Department of Electrical Engineering and Computer Science, Systems and Modeling,
Mutual Information Mathematical Biology Seminar
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Ensemble Learning: An Introduction
Who am I and what am I doing here? Allan Tucker A brief introduction to my research
Making the Most of Small Sample High Dimensional Micro-Array Data Allan Tucker, Veronica Vinciotti, Xiaohui Liu; Brunel University Paul Kellam; Windeyer.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
1 Harvard Medical School Transcriptional Diagnosis by Bayesian Network Hsun-Hsien Chang and Marco F. Ramoni Children’s Hospital Informatics Program Harvard-MIT.
Bayesian integration of biological prior knowledge into the reconstruction of gene regulatory networks Dirk Husmeier Adriano V. Werhli.
From Genes to Populations: The Intelligent Data Analysis of Biological Data Allan Tucker School of Information Systems Computing and Mathematics, Brunel.
CHARACTERIZING UNCERTAINTY FOR MODELING RESPONSE TO TREATMENT Tel-Aviv University Faculty of Exact Sciences Department of Statistics and Operations Research.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Reconstructing Gene Networks Presented by Andrew Darling Based on article  “Research Towards Reconstruction of Gene Networks from Expression Data by Supervised.
Mean Field Inference in Dependency Networks: An Empirical Study Daniel Lowd and Arash Shamaei University of Oregon.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
CLASSIFICATION: Ensemble Methods
Estimating Component Availability by Dempster-Shafer Belief Networks Estimating Component Availability by Dempster-Shafer Belief Networks Lan Guo Lane.
Error & Uncertainty: II CE / ENVE 424/524. Handling Error Methods for measuring and visualizing error and uncertainty vary for nominal/ordinal and interval/ratio.
Uncertainty Management in Rule-based Expert Systems
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
Journal Club Meeting Sept 13, 2010 Tejaswini Narayanan.
By: Amira Djebbari and John Quackenbush BMC Systems Biology 2008, 2: 57 Presented by: Garron Wright April 20, 2009 CSCE 582.
Making Time: Pseudo Time-Series for the Temporal Analysis of Cross-Section Data Emma Peeling, Allan Tucker Centre for Intelligent Data Analysis Brunel.
IMPROVED RECONSTRUCTION OF IN SILICO GENE REGULATORY NETWORKS BY INTEGRATING KNOCKOUT AND PERTURBATION DATA Yip, K. Y., Alexander, R. P., Yan, K. K., &
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Classification Ensemble Methods 1
1-1 Copyright © 2014, 2011, and 2008 Pearson Education, Inc.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
September 28, 2000 Improved Simultaneous Data Reconciliation, Bias Detection and Identification Using Mixed Integer Optimization Methods Presented by:
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
DISCUSSION Using a Literature-based NMF Model for Discovering Gene Functional Relationships Using a Literature-based NMF Model for Discovering Gene Functional.
Predicting Consensus Ranking in Crowdsourced Setting Xi Chen Mentors: Paul Bennett and Eric Horvitz Collaborator: Kevyn Collins-Thompson Machine Learning.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Combining heterogeneous data to reverse engineer regulatory networks
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
From Genes to Populations: The Intelligent Data Analysis of
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Presentation transcript:

Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University, London. UB8 3PH

Intelligent Data Analysis IDA attempts to deal with data explosion to discover patterns and knowledge from data Typical analysis tasks: Clustering Classification Feature Selection Prediction and Forecasting Structure identification

Bayesian Networks An IDA method to model a domain using probabilities Easily interpreted by non-statisticians Can be used to combine existing knowledge with data Essentially use independence assumptions to model the joint distribution of a domain

Informative Priors To build BNs we can also use prior structures and probabilities These are then updated with data Usually uniform (equal probability) Informative Priors used to incorporate existing knowledge into BNs

Microarray Data Major source of data for gene expression activity Technology takes measurements over 1000s of genes simultaneously Gene Regulatory Networks (GRNs) model how genes interact Eliciting reliable GRNs from data key to understanding biological mechanisms

But... Reliability issues that surround microarray gene expression data Mechanisms in different systems & species Can we build GRN models that have enhanced performance, based on a richer and/or broader collection of data than a single microarray dataset?

The talk Incorporating literature priors Consensus networks Models of Increasing Complexity Interspecies analysis

Literature-based priors Information about biomedical concepts such as genes summarized using concept profiling (Jelier et al., 2007; Schuemie et al., 2007a) Combine information from several databases, including Entrez Gene, Uniprot, and the Saccharomyces Genome Database Concept profile is a vector of concepts with weights Weight represents uncertainty between occurrence of one concept and another (2009) Steele, E., Tucker, A., 't Hoen, P.A.C. and Schuemie, M.J., Literature-Based Priors for Gene Regulatory Networks, Bioinformatics 25 (14) :

Literature-based priors Perform Pearson correlation on concpet profiles of genes to create a literature matrix Translate correlations into probs using confidence scores. Represents prob that a particular correlation was not drawn from the distribution of random gene-pair correlations Not equal to probability that edge exists – see Segal et al. (2002) and Efron (2007) Incorporate as a prior into BIC score: BIC = w log P(S) + log P(S|D) k log(n)

The Experiments Test our approach on synthetic networks generated using differential equations, yeast studies and e coli studies with known regulatory structures Report on ROC analysis: True Positives: links that are correctly id False positives: links that are incorrectly id False Negatives: links that are missed True Negatives: links that are correctly missed Also predictive power using CV

Yeast and E-Coli Network Analysis Issues with circularity when validating

Predictive accuracy 

A literature prior weight of between 0.4 and 0.6 appears best choice to identify relevant regulatory edges on human data for mechanisms involving Muscular Dystrophy Higher prior weights lead to inclusion of too many edges (literature associations not of regulatory nature) A lower weight than the optimum prior weights found for yeast and E. coli Perhaps because less literature on the human organism whereas yeast and E. coli are both well-studied. Literature Priors Conclusions

Consensus Bayesian Networks Different platforms involve different biases: e.g. Oligonucleotide estimates of absolute value of expression whereas cDNA measures relative differences between genes. Previous research established comparing datasets using standard normalisation is difficult and not straightforward An attempt to combine multiple microarray data sources through post-learning aggregation Steele, E. Tucker A. “Consensus and Meta-analysis regulatory networks for combining multiple microarray gene expression datasets”, Journal of Biomedical Informatics 41(6), pp , 2008

Consensus Bayes Networks

Consensus Bayesian Networks Bootstrapping on each dataset to generate robust networks with confidence Threshold the confidence and generate a PDAG (due to equivalence classes) Consensus looks for edges with enough support in the input networks Edge direction is based upon voting of inputs – or left undirected if there is no consensus or if cycles cannot be resolved

Consensus Bayes Networks

E Coli

Yeast

Weighting networks Steele, E. and Tucker, A., Selecting and Weighting Data for Building Consensus Gene Regulatory Networks, Advances in Intelligent Data Analysis VIII: 8th International Symposium on Intelligent Data Analysis (IDA 2009). Lecture Notes in Computer Science, volume 5772: , 2009

c) Models of Increasing Complexity Specification of three muscle differentiation datasets (2010) Anvar, S.Y., t' Hoen, P.A.C. and Tucker, A., The Identification of Informative Genes from Multiple Datasets with Increasing Complexity, BMC Bioinformatics 11 : 32

MIC Select one dataset for training Others become test sets Score mean and variance of SSE using CV and indpt test sets Use these to rank genes

MIC - Datasets All concerned with the differentiation of cells into the muscle (Myogenic) lineage In-vitro system mimics the formation of new muscle fibres in-vivo Cao uses embryonic fibroblasts, others use tumor cell line that has the potential for differentiation into different lineages (mainly muscle and bone) Cao use MyoD and MyoG to force cell differentiation (others use serum starvation) Sartorelli includes different treatments that affect timing and efficiency

MIC Select genes using one dataset (black) at a time and compare average CV error rate of BN classifier learnt on same dataset and validated on the other two datasets independently (grey). Cao does well on CV but overfits Tomzczak does well on both

MIC Select 100 informative (KS test), and 50 uninformative genes. Train BN classifier on Tomczak and test on Sartorelli. Rank genes according to average error rate. Score average improvement or deterioration of Myogenesis- Related, Top 100 and 50 random selected genes in Sartorelli Compare our method with rankings generated by concordance model.

MIC Conclusions Highly predictive and consistent genes from pool of differentially expressed genes, across independent datasets are more likely to be fundamentally involved in the biological process under study Results imply that gene regulatory networks identified in simpler systems can be used to model more complex biological systems

MIC Conclusions e.g. muscle differentiation: myogenesis-related network is difficult to derive from in vivo experiments due to presence of multiple cell types and higher biological variation But may become evident after initial training of the network on the cleaner in vitro experiments

Inter-species Mechanisms

Summary Explored a number of novel techniques for buidling more Reliable GRNS Incorporating exogenous knowledge in the form of BN Priors constructed from biological abstracts Consensus algorithms for post-learning aggregation of data / networks Models of increasing complexity for identifying genes that are more confidently associated with a biological process Future work – extending MIC to inter-organism mechanisms

Thanks Dr Emma Steele, previously Brunel Mr Yahya Anvar & Dr Peter-Bram ‘t Hoen, Leiden University Medical School, Netherlands