Large scale genomic data mining Curtis Huttenhower 11-14-09 Harvard School of Public Health Department of Biostatistics.

Slides:



Advertisements
Similar presentations
Microarray statistical validation and functional annotation
Advertisements

Molecular Systems Biology 3; Article number 140; doi: /msb
Network integration and function prediction: Putting it all together Slides courtesy of Curtis Huttenhower Harvard School of Public Health Department.
Network integration and function prediction: Putting it all together Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
Global Mapping of the Yeast Genetic Interaction Network Tong et. al, Science, Feb 2004 Presented by Bowen Cui.
August 19, 2002Slide 1 Bioinformatics at Virginia Tech David Bevan (BCHM) Lenwood S. Heath (CS) Ruth Grene (PPWS) Layne Watson (CS) Chris North (CS) Naren.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Supervised and unsupervised methods for large scale genomic data integration Curtis Huttenhower Harvard School of Public Health Department of.
Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastasis Jonsson.
Gene expression analysis summary Where are we now?
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
Gene Co-expression Network Analysis BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University.
Scalable data mining for functional genomics and metagenomics
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
ONCOMINE: A Bioinformatics Infrastructure for Cancer Genomics
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Andrey Alexeyenko M edical E pidemiology and B iostatistics Gene network approach in epidemiology.
Large scale functional data mining: What can we find in the data we have? Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Computational Approaches in Epigenomics Guo-Cheng Yuan Department of Biostatistics and Computational Biology Dana-Farber Cancer Institute Harvard School.
Fuzzy K means.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Large scale genomic data mining Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Comparative Expression Moran Yassour +=. Goal Build a multi-species gene-coexpression network Find functions of unknown genes Discover how the genes.
Presented by Karen Xu. Introduction Cancer is commonly referred to as the “disease of the genes” Cancer may be favored by genetic predisposition, but.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Bayesian integration of biological prior knowledge into the reconstruction of gene regulatory networks Dirk Husmeier Adriano V. Werhli.
Unit 1: The Language of Science  communicate and apply scientific information extracted from various sources (3.B)  evaluate models according to their.
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
Whole Genome Expression Analysis
Identification of network motifs in lung disease Cecily Swinburne Mentor: Carol J. Bult Ph.D. Summer 2007.
EnrichNet: network-based gene set enrichment analysis Presenter: Lu Liu.
Answering biological questions using large genomic data collections Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Beyond the Human Genome Project Future goals and projects based on findings from the HGP.
GTL Facilities Computing Infrastructure for 21 st Century Systems Biology Ed Uberbacher ORNL & Mike Colvin LLNL.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
BIOMARKERS Diagnostics and Prognostics. OMICS Molecular Diagnostics: Promises and Possibilities, p. 12 and 26.
Genomes and Their Evolution. GenomicsThe study of whole sets of genes and their interactions. Bioinformatics The use of computer modeling and computational.
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
Computational biology of cancer cell pathways Modelling of cancer cell function and response to therapy.
Large scale genomic data integration for functional genomics and metagenomics Curtis Huttenhower Harvard School of Public Health Department of.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
Complementarity of network and sequence information in homologous proteins March, Department of Computing, Imperial College London, London, UK 2.
Genomes To Life Biology for 21 st Century A Joint Initiative of the Office of Advanced Scientific Computing Research and Office of Biological and Environmental.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.
Large scale genomic data integration for functional metagenomics Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
Gene Expression and Networks. 2 Microarray Analysis Supervised Methods -Analysis of variance -Discriminate analysis -Support Vector Machine (SVM) Unsupervised.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
CBioPortal Web resource for exploring, visualizing, and analyzing multidimentional cancer genomics data.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
High throughput biology data management and data intensive computing drivers George Michaels.
1 Modelling and Simulation EMBL – Beyond Molecular Biology Physics Computational Biology Chemistry Medicine.
Different microarray applications Rita Holdhus Introduction to microarrays September 2010 microarray.no Aim of lecture: To get some basic knowledge about.
Network integration and function prediction: Putting it all together
Genomic Data Integration
Large Scale Data Integration
Genomic Data Manipulation
Diagnostics and Prognostics
Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.
Presentation transcript:

Large scale genomic data mining Curtis Huttenhower Harvard School of Public Health Department of Biostatistics

Greatest Biological Discoveries? 2

Are We There Yet? 3 How much biology is out there? How much have we found? How fast are we finding it? Human Proteins with Annotated Biological Roles Age-Adjusted Citation Rates for Major Sequencing Projects Species Diversity of Environmental Samples Schloss and Handelsman, 2006 # Distinct Roles Matt Hibbs

# Distinct Roles Matt Hibbs Are We There Yet? 4 How much biology is out there? How much have we found? How fast are we finding it? Human Proteins with Annotated Biological Roles Age-Adjusted Cost per Citation for Major Sequencing Projects Species Diversity of Environmental Samples Schloss and Handelsman, 2006 Lots! Not nearly all Not fast enough Our job is to create computational microscopes: To ask and answer specific biomedical questions using millions of experimental results Our job is to create computational microscopes: To ask and answer specific biomedical questions using millions of experimental results

Outline 5 1. Methodology: Algorithms for mining genome-scale datasets 2. Microscopic: Microbial communities and functional metagenomics 3. Macroscopic: Functional genomic data in a large prospective cohort

A Framework for Functional Genomics 6 High Similarity Low Similarity High Correlation Low Correlation G1 G2 + G4 G9 + … G3 G6 - G7 G8 - … G2 G5 ? …0.10.2…0.8 +-…--…+ 0.5… …0.6 High Correlation Low Correlation Frequency Coloc.Not coloc. Frequency SimilarDissim. Frequency P(G2-G5|Data) = Ms gene pairs → ← 1Ks datasets

A Framework for Functional Genomics 7 Golub 1999 Butte 2000 Whitfield 2002 Hansen 1998 Functional Relationship

Predicted Functional Interaction Networks 8 Global interaction network Metabolism networkFibroblast networkColon cancer network Currently have data from 30,000 human experimental results, 15,000 expression conditions + 15,000 diverse others, analyzed for 200 biological functions and 150 diseases

Functional Mapping: Mining Integrated Networks 9 Predicted relationships between genes High Confidence Low Confidence The average strength of these relationships indicates how cohesive a process is. Cell cycle genes

Functional Mapping: Mining Integrated Networks 10 Predicted relationships between genes High Confidence Low Confidence Cell cycle genes

Functional Mapping: Mining Integrated Networks 11 DNA replication genes The average strength of these relationships indicates how associated two processes are. Predicted relationships between genes High Confidence Low Confidence Cell cycle genes

Functional Mapping: Scoring Functional Associations 12 How can we formalize these relationships? Any sets of genes G 1 and G 2 in a network can be compared using four measures: Edges between their genes Edges within each set The background edges incident to each set The baseline of all edges in the network Stronger connections between the sets increase association. Stronger within self-connections or nonspecific background connections decrease association.

Functional Mapping: Bootstrap p-values Scoring functional associations is great… …how do you interpret an association score? –For gene sets of arbitrary sizes? –In arbitrary graphs? –Each with its own bizarre distribution of edges? 13 Empirically! # Genes Histograms of FAs for random sets For any graph, compute FA scores for many randomly chosen gene sets of different sizes. Null distribution is approximately normal with mean 1. Standard deviation is asymptotic in the sizes of both gene sets. Maps FA scores to p-values for any gene sets and underlying graph. Null distribution σ s for one graph

Functional Mapping: Functional Associations Between Processes 14 Edges Associations between processes Very Strong Moderately Strong Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance

Functional Mapping: Functional Associations Between Processes 15 Edges Associations between processes Very Strong Moderately Strong Borders Data coverage of processes Well Covered Sparsely Covered Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance

Functional Mapping: Functional Associations Between Processes 16 Edges Associations between processes Very Strong Moderately Strong Nodes Cohesiveness of processes Below Baseline (genomic background) Very Cohesive Borders Data coverage of processes Well Covered Sparsely Covered Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance

Functional Maps: Focused Data Summarization 17 ACGGTGAACGTACA GTACAGATTACTAG GACATTAGGCCGTA TCCGATACCCGATA Data integration summarizes an impossibly huge amount of experimental data into an impossibly huge number of predictions; what next?

Functional Maps: Focused Data Summarization 18 ACGGTGAACGTACA GTACAGATTACTAG GACATTAGGCCGTA TCCGATACCCGATA How can a biologist take advantage of all this data to study his/her favorite gene/pathway/disease without losing information? Functional mapping Very large collections of genomic data Specific predicted molecular interactions Pathway, process, or disease associations Underlying experimental results and functional activities in data

HEFalMp: Predicting Human Gene Function 19 HEFalMp

HEFalMp: Predicting Human Genetic Interactions 20 HEFalMp

HEFalMp: Analyzing Human Genomic Data 21 HEFalMp

HEFalMp: Understanding Human Disease 22 HEFalMp

Outline Methodology: Algorithms for mining genome-scale datasets 2. Microscopic: Microbial communities and functional metagenomics 3. Macroscopic: Functional genomic data in a large prospective cohort

Microbial Communities and Functional Metagenomics Metagenomics: data analysis from environmental samples –Microflora: environment includes us! Pathogen collections of “single” organisms form similar communities Another data integration problem –Must include datasets from multiple organisms What questions can we answer? –What pathways/processes are present/over/under- enriched in a newly sequences microbe/community? –What’s shared within community X? What’s different? What’s unique? –How do human microflora interact with diabetes, obesity, oral health, antibiotics, aging, … –Current functional methods annotate ~50% of synthetic data, <5% of environmental data 24 With Jacques Izard, Wendy Garrett

Data Integration for Microbial Communities 25 ~350 available expression datasets ~25 species Weskamp et al 2004 Flannick et al 2006 Kanehisa et al 2008 Tatusov et al 1997 Data integration should work just as well in microbes as it does in yeast and humans We know an awful lot about some microorganisms and almost nothing about others Sequence-based and network-based tools for function transfer both work in isolation We can use data integration to leverage both and mine out additional biology

Functional Maps for Functional Metagenomics 26 YG17 YG16 YG15 YG10 YG6 YG9 YG8 YG5 YG11 YG7 YG12 YG13 YG14 YG2 YG1 YG4 YG3 KO8 KO 4 KO5 KO7 KO9 KO 6 KO2 KO3 KO1 KO1: YG1, YG2, YG3 KO2: YG4 KO3: YG6 … ECG1, ECG2 PAG1 ECG3, PAG2 …

Functional Maps for Functional Metagenomics 27

Validating Orthology-Based Functional Mapping 28 Does unweighted data integration predict functional relationships? What is the effect of “projecting” through an orthologous space? Recall log(Precision/Random) KEGG GO Recall log(Precision/Random) Recall log(Precision/Random) GO Unsupervised integration Individual datasets Recall log(Precision/Random) Individual datasets KEGG Unsupervised integration

Validating Orthology-Based Functional Mapping 29 YG17 YG16YG15 YG10 YG6 YG9 YG8 YG5 YG11 YG7 YG12 YG13 YG14 YG2 YG1 YG4 YG3 Holdout set, uncharacterized “genome” Random subsets, characterized “genomes”

Validating Orthology-Based Functional Mapping 30

KEGG GO Validating Orthology-Based Functional Mapping 31 Can subsets of the yeast genome predict a heldout subset’s functional maps? Can subsets of the yeast genome predict a heldout subset’s interactome? What have we learned? Yeast is incredibly well-curated KEGG tends to be more specific than GO Predicting interactomes by projecting through functional maps works decently in the absolute best case

Functional Maps for Functional Metagenomics 32 Now, what happens if you do this for characterized microbes? ~10 (somewhat) well-characterized species 1-35 datasets each Integrate within species Evaluate using KEGG Then cross-validate by holding out species Recall log(Precision/Random) KEGG Unsupervised integrations Check back soon for more results, preliminary data on metagenomes

Efficient Computation For Biological Discovery Massive datasets and genomes require efficient algorithms and implementations. 33 Sleipnir C++ library for computational functional genomics Data types for biological entities Microarray data, interaction data, genes and gene sets, functional catalogs, etc. etc. Network communication, parallelization Efficient machine learning algorithms Generative (Bayesian) and discriminative (SVM) And it’s fully documented! It’s also speedy: improves on Bayes Net Toolbox by ~22x in memory usage and up to >100x in runtime.

Efficient Computation For Biological Discovery Massive datasets and genomes require efficient algorithms and implementations. 34 Sleipnir C++ library for computational functional genomics Data types for biological entities Microarray data, interaction data, genes and gene sets, functional catalogs, etc. etc. Network communication, parallelization Efficient machine learning algorithms Generative (Bayesian) and discriminative (SVM) And it’s fully documented! 8 hours 1 minute 30 years 2 months 18 hours Original processing time Current processing time 2.5 hours

Outline Methodology: Algorithms for mining genome-scale datasets 2. Microscopic: Microbial communities and functional metagenomics 3. Macroscopic: Functional genomic data in a large prospective cohort

Current Work: Molecular Mechanisms in a Colorectal Cancer Cohort 36 With Shuji Ogino, Charlie Fuchs ~3,100 gastrointestinal subjects ~3,800 tissue samples ~1,450 colon cancer samples ~1,150 CpG island methylation ~1,200 LINE-1 methylation ~700 TMA immuno- histochemistry ~2,100 cancer mutation tests Health Professionals Follow-Up Study Nurse’s Health Study LINE-1 Methylation Repetitive element making up ~20% of mammalian genomes Very easy to assay methylation level (%) Good proxy for whole-genome methylation level DASL Gene Expression Gene expression analysis from paraffin blocks Thanks to Todd Golub, Yujin Hoshida ~775 gene expression

Molecular Subtypes of Colorectal Cancer: Stem Cell Programs and Proliferation 37 Chr. 19 rearrangement, membrane receptors/channels HSC signature Neural/ESC signature Angiogenesis, proliferation BRCA interactors, chrom. stability factors Cell cycle regulation C1 C2C3C4 Nonnegative matrix factorization Tumors → ← Genes

Molecular Subtypes of Colorectal Cancer: Stem Cell Programs and Proliferation 38 Subramanian et al, Neural Stem Cell Signature Hematopoeitic Stem Cell Signature Embryonic Stem Cell Signature Chr. 19q BAX CD133 + Bcl-X(L) CD44 + CD166 Hypotheses? Two main pathways to proliferation: HSC program + BAX ESC/NSC program Two main pathways to deregulation: Angiogenesis + chrom. instability Cell cycle disruption (MSI?) Note that these regulatory programs do not appear to correspond with demographics or common pathologic markers… Testing now for correlation with outcome.

Epigenetics of Colorectal Cancer: LINE-1 methylation levels 39 ρ = 0.718, p < 0.01 Ogino et al, 2008 Lower LINE-1 methylation associates with poor colon cancer prognosis. LINE-1 methylation varies remarkably between individuals… …but it is highly correlated within individuals. What does it all mean?? What is the biological mechanism linking LINE-1 methylation to colon cancer?

Epigenetics of Colorectal Cancer: LINE-1 methylation levels 40 ρ = 0.718, p < 0.01 Ogino et al, 2008 Lower LINE-1 methylation associates with poor colon cancer prognosis. LINE-1 methylation varies remarkably between individuals… …but it is highly correlated within individuals. This suggests a genetic effect. This suggests a copy number variation. This suggests linkage to a cancer-related pathway. Is anything different about these outliers? What is the biological mechanism linking LINE-1 methylation to colon cancer?

Epigenetics of Colorectal Cancer: LINE-1 methylation levels 41 What is the biological mechanism linking LINE-1 methylation to colon cancer? Preliminary Data 10 genes differentially expressed even using simple methods 1/3 are from the same family with known GI tumor prognostic value 1/3 are X-chromosome testis/cancer-specific antigens 1/2 fall in same cytogenic band, which is also a known CNV hotspot HEFalMp links to a cascade of antigens/membrane receptors/TFs Cell adhesion p-value ≈ 0, moderate correlation in many cancer arrays GSEA pulls out a wide range of proliferation up (E2F), immune response down; need to regress out prognosis confounds Check back in a couple of months!

Outline Methodology: Algorithms for mining genome-scale datasets 2. Microscopic: Microbial communities and functional metagenomics 3. Macroscopic: Functional genomic data in a large prospective cohort Bayesian system for genomic data integration HEFalMp system for human data analysis and integration Functional mapping to statistically summarize large data collections Integration for microbial communities and metagenomics Network alignment and mapping for microbial community analysis Sleipnir software for efficient large scale data mining Demographic/molecular/ genomic data for ~1,000 colorectal cancers Ongoing analysis of gene activity and LINE-1 methylation

Thanks! Interested? We’re recruiting students and postdocs! Biostatistics Department Interested? We’re recruiting students and postdocs! Biostatistics Department Hilary Coller Erin Haley Tsheko Mutungu Olga Troyanskaya Matt Hibbs Chad Myers David Hess Edo Airoldi Florian Markowetz Shuji Ogino Charlie Fuchs Jacques Izard Wendy Garrett