Genomic Data Integration

Slides:



Advertisements
Similar presentations
Network integration and function prediction: Putting it all together Slides courtesy of Curtis Huttenhower Harvard School of Public Health Department.
Advertisements

1 Harvard Medical School Mapping Transcription Mechanisms from Multimodal Genomic Data Hsun-Hsien Chang, Michael McGeachie, and Marco F. Ramoni Children.
MitoInteractome : Mitochondrial Protein Interactome Database Rohit Reja Korean Bioinformation Center, Daejeon, Korea.
Network integration and function prediction: Putting it all together Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
University at BuffaloThe State University of New York Young-Rae Cho Department of Computer Science and Engineering State University of New York at Buffalo.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Supervised and unsupervised methods for large scale genomic data integration Curtis Huttenhower Harvard School of Public Health Department of.
Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastasis Jonsson.
Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003.
Gene expression analysis summary Where are we now?
Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower
Yeast Dataset Analysis Hongli Li Final Project Computer Science Department UMASS Lowell.
Scalable data mining for functional genomics and metagenomics
Scalable data mining for functional genomics and metagenomics Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Modularity in Biological networks.  Hypothesis: Biological function are carried by discrete functional modules.  Hartwell, L.-H., Hopfield, J. J., Leibler,
Large scale functional data mining: What can we find in the data we have? Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Large scale genomic data mining Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Comparative Expression Moran Yassour +=. Goal Build a multi-species gene-coexpression network Find functions of unknown genes Discover how the genes.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Bayesian integration of biological prior knowledge into the reconstruction of gene regulatory networks Dirk Husmeier Adriano V. Werhli.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Answering biological questions using large genomic data collections Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Beyond the Human Genome Project Future goals and projects based on findings from the HGP.
GTL Facilities Computing Infrastructure for 21 st Century Systems Biology Ed Uberbacher ORNL & Mike Colvin LLNL.
Networks and Interactions Boo Virk v1.0.
Large scale genomic data mining Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
HUMAN-MOUSE CONSERVED COEXPRESSION NETWORKS PREDICT CANDIDATE DISEASE GENES Ala U., Piro R., Grassi E., Damasco C., Silengo L., Brunner H., Provero P.
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Large scale genomic data integration for functional genomics and metagenomics Curtis Huttenhower Harvard School of Public Health Department of.
Genomes To Life Biology for 21 st Century A Joint Initiative of the Office of Advanced Scientific Computing Research and Office of Biological and Environmental.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.
Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.
Large scale genomic data integration for functional metagenomics Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
1 Machine Learning for Functional Genomics I Matt Hibbs
EB3233 Bioinformatics Introduction to Bioinformatics.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
DNAmRNAProtein Small molecules Environment Regulatory RNA How a cell is wired The dynamics of such interactions emerge as cellular processes and functions.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
High throughput biology data management and data intensive computing drivers George Michaels.
Biological Network Analysis
Seojin Bang. The goal of this review paper is.. To address problems and computational solutions that arise in analysis of omics data. To highlight fundamental.
Network integration and function prediction: Putting it all together
Networks and Interactions
1. SELECTION OF THE KEY GENE SET 2. BIOLOGICAL NETWORK SELECTION
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Discovery and Dissemination
Protein association networks with STRING
Babak Alipanahi1, Andrew Delong, Matthew T Weirauch & Brendan J Frey
Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology Vipin Kumar William Norris Professor and Head, Department of Computer.
Scalable Machine Learning
Large Scale Data Integration
Discovery and Dissemination
Genomic Data Manipulation
A User’s Guide to GO: Structural and Functional Annotation
Genes to Function to Therapeutics
Evaluation of inferred networks
Network Inference Chris Holmes Oxford Centre for Gene Function, &,
SEG5010 Presentation Zhou Lanjun.
Volume 43, Issue 3, Pages (September 2015)
Principle of Epistasis Analysis
Network biology An introduction to STRING and Cytoscape
Integrative omic approaches for the study of host–pathogen interactions Integrative omic approaches for the study of host–pathogen interactions (A) Proteomic.
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Presentation transcript:

Genomic Data Integration Curtis Huttenhower 07-10-10 Harvard School of Public Health Department of Biostatistics

A Definition of Integrative Data Mining Prior knowledge Genomic data Gene ↓ Function Gene ↓ Data ↓ Function Function ↓

Machine Learning for Data Integration 100Ms gene pairs → G1 G2 + G4 G9 … G3 G6 - G7 G8 G5 ? 0.9 0.7 0.1 0.2 0.8 0.5 0.05 0.6 ← 1Ks datasets P(G2-G5|Data) = 0.85 High Correlation Low Frequency High Correlation Low Coloc. Not coloc. Frequency + = Similar Dissim. Frequency High Similarity Low

Machine Learning for Data Integration Jansen 2003 Troyanskaya 2003 Functional Relationship Golub 1999 Butte 2000 Whitfield 2002 Hansen 1998

Alternative Data Integration Frameworks Lee 2004 Lanckriet 2004 Aerts 2006

Functional Networks Global interaction network Metabolism network function.princeton.edu/hefalmp string-db.org funcnet.eu homes.esat.kuleuven.be/~bioiuser/endeavour Metabolism network Conserved network Kidney network

Biological Networks: Clusters, Hubs, Bottlenecks, and Flow

Biological Networks: Network Motifs Bi-fan WGD and evolvability Feedback memory Positive auto-regulation delay Negative auto-regulation speed + stability Coherent feed-forward filter www.weizmann.ac.il/mcb/UriAlon/groupNetworkMotifSW.html mavisto.ipk-gatersleben.de theinf1.informatik.uni-jena.de/~wernicke/motifs Incoherent feed-forward pulse Milo 2002 Alon 2007

Predicting Gene Function Predicted relationships between genes High Confidence Low Cell cycle genes

Predicting Gene Function Predicted relationships between genes High Confidence Low Cell cycle genes

Predicting Gene Function Huttenhower 2009 Predicted relationships between genes High Confidence Low These edges provide a measure of how likely a gene is to specifically participate in the process of interest. Cell cycle genes Pena-Castillo 2008 Rodrigues 2007

Comprehensive Validation of Computational Predictions Hess, 2009 Hibbs, 2009 Genomic data Prior knowledge Computational Predictions of Gene Function SPELL Hibbs et al 2007 bioPIXIE Myers et al 2005 MEFIT Retraining Genes predicted to function in mitochondrion organization and biogenesis New known functions for correctly predicted genes Could go (-) Laboratory Experiments Petite frequency Growth curves Confocal microscopy

Evaluating the Performance of Computational Predictions Huttenhower, 2009 Genes involved in mitochondrion organization and biogenesis 106 Original GO Annotations 135 Under-annotations 82 Novel Confirmations, First Iteration 17 Novel Confirmations, Second Iteration 340 total: >3x previously known genes in ~5 person-months Could go (-)

Evaluating the Performance of Computational Predictions Huttenhower, 2009 Genes involved in mitochondrion organization and biogenesis Computational predictions from large collections of genomic data can be accurate despite incomplete or misleading gold standards, and they continue to improve as additional data are incorporated. 106 Original GO Annotations 95 Under-annotations 40 Confirmed Under-annotations 80 Novel Confirmations First Iteration 17 Novel Confirmations Second Iteration 340 total: >3x previously known genes in ~5 person-months Could go (-)

Functional Mapping: Mining Integrated Networks Predicted relationships between genes High Confidence Low The strength of these relationships indicates how cohesive a process is. Chemotaxis

Functional Mapping: Mining Integrated Networks Predicted relationships between genes High Confidence Low Chemotaxis

Functional Mapping: Mining Integrated Networks Predicted relationships between genes High Confidence Low The strength of these relationships indicates how associated two processes are. Chemotaxis Flagellar assembly

Functional Mapping: Associations Between Gene Sets Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance Edges Associations between processes Moderately Strong Very Strong

Functional Mapping: Associations Between Gene Sets Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance Edges Associations between processes Moderately Strong Very Strong Borders Data coverage of processes Sparsely Covered Well Covered

Functional Mapping: Associations Between Gene Sets Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance Edges Associations between processes Moderately Strong Very Strong Nodes Cohesiveness of processes Below Baseline Baseline (genomic background) Very Cohesive Borders Data coverage of processes Sparsely Covered Well Covered

Functional Maps: Focused Data Summarization ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA Data integration summarizes an impossibly huge amount of experimental data into an impossibly huge number of predictions; what next?

Functional Maps: Focused Data Summarization ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA How can a biologist take advantage of all this data to study his/her favorite gene/pathway/disease without losing information? Functional mapping Very large collections of genomic data Specific predicted molecular interactions Pathway, process, or disease associations Underlying experimental results and functional activities in data

Efficient Computation For Biological Discovery Massive datasets and genomes require efficient algorithms and implementations. Sleipnir C++ library for computational functional genomics Data types for biological entities Microarray data, interaction data, genes and gene sets, functional catalogs, etc. etc. Network communication, parallelization Efficient machine learning algorithms Generative (Bayesian) and discriminative (SVM) And it’s fully documented! huttenhower.sph.harvard.edu/sleipnir It’s also speedy: microbial data integration computation takes <3hrs.

Thanks! Curtis Huttenhower Harvard School of Public Health Department of Biostatistics http://huttenhower.sph.harvard.edu

Meta-Analysis for Data Integration Evangelou 2007 + =

Meta-Analysis for Data Integration Evangelou 2007 Simple regression: All datasets are equally accurate Random effects: Variation within and among datasets and interactions