Answering biological questions using large genomic data collections Curtis Huttenhower 10-05-09 Harvard School of Public Health Department of Biostatistics.

Slides:



Advertisements
Similar presentations
Network integration and function prediction: Putting it all together Slides courtesy of Curtis Huttenhower Harvard School of Public Health Department.
Advertisements

Biological pathway and systems analysis An introduction.
MitoInteractome : Mitochondrial Protein Interactome Database Rohit Reja Korean Bioinformation Center, Daejeon, Korea.
Network integration and function prediction: Putting it all together Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
A view of life Chapter 1. Properties of Life Living organisms: – are composed of cells – are complex and ordered – respond to their environment – can.
Gene Ontology John Pinney
Computational Modelling of Biological Pathways Kumar Selvarajoo
Introduction: Themes in the Study of Life
Learning rule-based models from gene expression time profiles annotated with Gene Ontology terms Jan Komorowski and Astrid Lägreid.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Supervised and unsupervised methods for large scale genomic data integration Curtis Huttenhower Harvard School of Public Health Department of.
Systems Biology Biological Sequence Analysis
Gene expression analysis summary Where are we now?
Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Yeast Dataset Analysis Hongli Li Final Project Computer Science Department UMASS Lowell.
Scalable data mining for functional genomics and metagenomics
Scalable data mining for functional genomics and metagenomics Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Integrated analysis of regulatory and metabolic networks reveals novel regulatory mechanisms in Saccharomyces cerevisiae Speaker: Zhu YANG 6 th step, 2006.
Systems Biology Biological Sequence Analysis
BACKGROUND E. coli is a free living, gram negative bacterium which colonizes the lower gut of animals. Since it is a model organism, a lot of experimental.
Large scale functional data mining: What can we find in the data we have? Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
Systems Biology Biological Sequence Analysis
Large scale genomic data mining Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
GTL User Facilities Facility II: Whole Proteome Analysis Michelle V. Buchanan.
Genome of the week - Deinococcus radiodurans Highly resistant to DNA damage –Most radiation resistant organism known Multiple genetic elements –2 chromosomes,
Bayesian integration of biological prior knowledge into the reconstruction of gene regulatory networks Dirk Husmeier Adriano V. Werhli.
The Science of Life Biology unifies much of natural science
Unit 1: The Language of Science  communicate and apply scientific information extracted from various sources (3.B)  evaluate models according to their.
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
Shankar Subramaniam University of California at San Diego Data to Biology.
PREVIEW 1 ST SIX WEEKS – 5 WEEKS LONG 2 ND SIX WEEKS – 5 WEEKS LONG 3 RD SIX WEEKS – 6 WEEKS LONG 2 WEEKS OF TESTING SEMESTER ENDS BEFORE CHRISTMAS.
Compare and contrast prokaryotic and eukaryotic cells.[BIO.4A] October 2014Secondary Science - Biology.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Beyond the Human Genome Project Future goals and projects based on findings from the HGP.
Chapter 13. The Impact of Genomics on Antimicrobial Drug Discovery and Toxicology CBBL - Young-sik Sohn-
GTL Facilities Computing Infrastructure for 21 st Century Systems Biology Ed Uberbacher ORNL & Mike Colvin LLNL.
Networks and Interactions Boo Virk v1.0.
Large scale genomic data mining Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
Claim, Evidence, Reasoning and Experimental Design Review.
Large scale genomic data integration for functional genomics and metagenomics Curtis Huttenhower Harvard School of Public Health Department of.
Genomes To Life Biology for 21 st Century A Joint Initiative of the Office of Advanced Scientific Computing Research and Office of Biological and Environmental.
Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.
Large scale genomic data integration for functional metagenomics Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
Decoding the Network Footprint of Diseases With increasing availability of data, there is significant activity directed towards correlating genomic, proteomic,
1 Machine Learning for Functional Genomics I Matt Hibbs
While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.
DNAmRNAProtein Small molecules Environment Regulatory RNA How a cell is wired The dynamics of such interactions emerge as cellular processes and functions.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
Research Problem In one sentence, describe the problem that is the focus of your classroom research project about student learning: There is a growing.
Inflammatory Bowel Diseases November 19, 2007 NCDD Meeting Chair: Daniel K. Podolsky, MD Vice Chair: Eugene B. Chang, MD.
High throughput biology data management and data intensive computing drivers George Michaels.
Microarray: An Introduction
(1) Genotype-Tissue Expression (GTEx) Largest systematic study of genetic regulation in multiple tissues to date 53 tissues, 500+ donors, 9K samples, 180M.
Network integration and function prediction: Putting it all together
1. SELECTION OF THE KEY GENE SET 2. BIOLOGICAL NETWORK SELECTION
Genomic Data Integration
Large Scale Data Integration
Genomic Data Manipulation
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Department of Chemical Engineering
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
From Mendel to Genomics
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Jan – Dec RuminOmics Connecting the animal genome, the intestinal microbiome and nutrition to enhance the efficiency of ruminant.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

Answering biological questions using large genomic data collections Curtis Huttenhower Harvard School of Public Health Department of Biostatistics

A Definition of Computational Functional Genomics 2 Genomic data Prior knowledge Data ↓ Function ↓ Function Gene ↓ Gene ↓ Function

MEFIT: A Framework for Functional Genomics 3 BRCA1BRCA20.9 BRCA1RAD510.8 RAD51TP … Related Gene Pairs High Correlation Low Correlation Frequency MEFIT

MEFIT: A Framework for Functional Genomics 4 BRCA1BRCA20.9 BRCA1RAD510.8 RAD51TP … BRCA2SOX20.1 RAD51FOXP20.2 ACTR1H6PD0.15 … Related Gene Pairs Unrelated Gene Pairs High Correlation Low Correlation Frequency MEFIT

MEFIT: A Framework for Functional Genomics 5 Golub 1999 Butte 2000 Whitfield 2002 Hansen 1998 Functional Relationship

MEFIT: A Framework for Functional Genomics 6 Golub 1999 Butte 2000 Whitfield 2002 Hansen 1998 Functional Relationship Biological Context Functional area Tissue Disease …

Functional Interaction Networks 7 MEFIT Global interaction network Autophagy network Vacuolar transport network Translation network Currently have data from 30,000 human experimental results, 15,000 expression conditions + 15,000 diverse others, analyzed for 200 biological functions and 150 diseases

Predicting Gene Function 8 Cell cycle genes Predicted relationships between genes High Confidence Low Confidence

Predicting Gene Function 9 Predicted relationships between genes High Confidence Low Confidence Cell cycle genes

Predicting Gene Function 10 Predicted relationships between genes High Confidence Low Confidence These edges provide a measure of how likely a gene is to specifically participate in the process of interest.

Functional Associations Between Contexts 11 Predicted relationships between genes High Confidence Low Confidence The average strength of these relationships indicates how cohesive a process is. Cell cycle genes

Functional Associations Between Contexts 12 Predicted relationships between genes High Confidence Low Confidence Cell cycle genes

Functional Associations Between Contexts 13 DNA replication genes The average strength of these relationships indicates how associated two processes are. Predicted relationships between genes High Confidence Low Confidence Cell cycle genes

Functional Associations Between Processes 14 Edges Associations between processes Very Strong Moderately Strong Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance

Functional Associations Between Processes 15 Edges Associations between processes Very Strong Moderately Strong Borders Data coverage of processes Well Covered Sparsely Covered Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance

Functional Associations Between Processes 16 Edges Associations between processes Very Strong Moderately Strong Nodes Cohesiveness of processes Below Baseline (genomic background) Very Cohesive Borders Data coverage of processes Well Covered Sparsely Covered Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance AHP1 DOT5 GRX1 GRX2 … APE3 LAP4 PAI3 PEP4 …

HEFalMp: Predicting human gene function 17 HEFalMp

HEFalMp: Predicting human genetic interactions 18 HEFalMp

HEFalMp: Analyzing human genomic data 19 HEFalMp

HEFalMp: Understanding human disease 20 HEFalMp

Validating Human Predictions 21 Autophagy Luciferase (Negative control) ATG5 (Positive control) LAMP2RAB11A Not Starved (Autophagic) Predicted novel autophagy proteins 5½ of 7 predictions currently confirmed With Erin Haley, Hilary Coller

Comprehensive Validation of Computational Predictions 22 Genomic data Computational Predictions of Gene Function MEFIT SPELL Hibbs et al 2007 bioPIXIE Myers et al 2005 Genes predicted to function in mitochondrion organization and biogenesis Laboratory Experiments Petite frequency Growth curves Confocal microscopy New known functions for correctly predicted genes Retraining With David Hess, Amy Caudy Prior knowledge

Evaluating the Performance of Computational Predictions Original GO Annotations Genes involved in mitochondrion organization and biogenesis 135 Under-annotations 82 Novel Confirmations, First Iteration 17 Novel Confirmations, Second Iteration 340 total: >3x previously known genes in ~5 person-months

Evaluating the Performance of Computational Predictions Original GO Annotations Genes involved in mitochondrion organization and biogenesis 95 Under-annotations 40 Confirmed Under-annotations 80 Novel Confirmations First Iteration 17 Novel Confirmations Second Iteration 340 total: >3x previously known genes in ~5 person-months Computational predictions from large collections of genomic data can be accurate despite incomplete or misleading gold standards, and they continue to improve as additional data are incorporated.

Functional Maps: Focused Data Summarization 25 ACGGTGAACGTACA GTACAGATTACTAG GACATTAGGCCGTA TCCGATACCCGATA Data integration summarizes an impossibly huge amount of experimental data into an impossibly huge number of predictions; what next?

Functional Maps: Focused Data Summarization 26 ACGGTGAACGTACA GTACAGATTACTAG GACATTAGGCCGTA TCCGATACCCGATA How can a researcher take advantage of all this data to study his/her favorite gene/pathway/disease without losing information? Functional mapping Very large collections of genomic data Specific predicted molecular interactions Pathway, process, or disease associations Underlying experimental results and functional activities in data

Thanks! 27 Interested? I’m accepting students and postdocs! Hilary Coller Erin Haley Tsheko Mutungu Olga Troyanskaya Matt Hibbs Chad Myers David Hess Edo Airoldi Florian Markowetz Shuji Ogino Charlie Fuchs

Next Steps: Microbial Communities Data integration is off to a great start in humans –Complex communities of distinct cell types –Very sparse prior knowledge Concentrated in a few specific areas –Variation across populations –Critical to understand mechanisms of disease 29

Next Steps: Microbial Communities What about microbial communities? –Complex communities of distinct species/strains –Very sparse prior knowledge Concentrated in a few specific species/strains –Variation across populations –Critical to understand mechanisms of disease 30

Next Steps: Microbial Communities 31 ~120 available expression datasets ~70 species Weskamp et al 2004 Flannick et al 2006 Kanehisa et al 2008 Tatusov et al 1997 Data integration works just as well in microbes as it does in humans We know an awful lot about some microorganisms and almost nothing about others Purely sequence-based and purely network-based tools for function transfer both fall short We need data integration to take advantage of both and mine out useful biology!

Next Steps: Functional Metagenomics Metagenomics: data analysis from environmental samples –Microflora: environment includes us! Another data integration problem –Must include datasets from multiple organisms Another context-specificity problem –Now “context” can also mean “species” What questions can we answer? –How do human microflora interact with diabetes, obesity, oral health, antibiotics, aging, … –What’s shared within community X? What’s different? What’s unique? –What’s perturbed in disease state Y? One organism, or many? Host interactions? –Current methods annotate ~50% of synthetic data, <5% of environmental data 32