Download presentation
Presentation is loading. Please wait.
1
Genomic Data Integration
Curtis Huttenhower Harvard School of Public Health Department of Biostatistics
2
A Definition of Integrative Data Mining
Prior knowledge Genomic data Gene ↓ Function Gene ↓ Data ↓ Function Function ↓
3
Machine Learning for Data Integration
100Ms gene pairs → G1 G2 + G4 G9 … G3 G6 - G7 G8 G5 ? 0.9 0.7 0.1 0.2 0.8 0.5 0.05 0.6 ← 1Ks datasets P(G2-G5|Data) = 0.85 High Correlation Low Frequency High Correlation Low Coloc. Not coloc. Frequency + = Similar Dissim. Frequency High Similarity Low
4
Machine Learning for Data Integration
Jansen 2003 Troyanskaya 2003 Functional Relationship Golub 1999 Butte 2000 Whitfield 2002 Hansen 1998
5
Alternative Data Integration Frameworks
Lee 2004 Lanckriet 2004 Aerts 2006
6
Functional Networks Global interaction network Metabolism network
function.princeton.edu/hefalmp string-db.org funcnet.eu homes.esat.kuleuven.be/~bioiuser/endeavour Metabolism network Conserved network Kidney network
7
Biological Networks: Clusters, Hubs, Bottlenecks, and Flow
8
Biological Networks: Network Motifs
Bi-fan WGD and evolvability Feedback memory Positive auto-regulation delay Negative auto-regulation speed + stability Coherent feed-forward filter mavisto.ipk-gatersleben.de theinf1.informatik.uni-jena.de/~wernicke/motifs Incoherent feed-forward pulse Milo 2002 Alon 2007
9
Predicting Gene Function
Predicted relationships between genes High Confidence Low Cell cycle genes
10
Predicting Gene Function
Predicted relationships between genes High Confidence Low Cell cycle genes
11
Predicting Gene Function
Huttenhower 2009 Predicted relationships between genes High Confidence Low These edges provide a measure of how likely a gene is to specifically participate in the process of interest. Cell cycle genes Pena-Castillo 2008 Rodrigues 2007
12
Comprehensive Validation of Computational Predictions
Hess, 2009 Hibbs, 2009 Genomic data Prior knowledge Computational Predictions of Gene Function SPELL Hibbs et al 2007 bioPIXIE Myers et al 2005 MEFIT Retraining Genes predicted to function in mitochondrion organization and biogenesis New known functions for correctly predicted genes Could go (-) Laboratory Experiments Petite frequency Growth curves Confocal microscopy
13
Evaluating the Performance of Computational Predictions
Huttenhower, 2009 Genes involved in mitochondrion organization and biogenesis 106 Original GO Annotations 135 Under-annotations 82 Novel Confirmations, First Iteration 17 Novel Confirmations, Second Iteration 340 total: >3x previously known genes in ~5 person-months Could go (-)
14
Evaluating the Performance of Computational Predictions
Huttenhower, 2009 Genes involved in mitochondrion organization and biogenesis Computational predictions from large collections of genomic data can be accurate despite incomplete or misleading gold standards, and they continue to improve as additional data are incorporated. 106 Original GO Annotations 95 Under-annotations 40 Confirmed Under-annotations 80 Novel Confirmations First Iteration 17 Novel Confirmations Second Iteration 340 total: >3x previously known genes in ~5 person-months Could go (-)
15
Functional Mapping: Mining Integrated Networks
Predicted relationships between genes High Confidence Low The strength of these relationships indicates how cohesive a process is. Chemotaxis
16
Functional Mapping: Mining Integrated Networks
Predicted relationships between genes High Confidence Low Chemotaxis
17
Functional Mapping: Mining Integrated Networks
Predicted relationships between genes High Confidence Low The strength of these relationships indicates how associated two processes are. Chemotaxis Flagellar assembly
18
Functional Mapping: Associations Between Gene Sets
Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance Edges Associations between processes Moderately Strong Very Strong
19
Functional Mapping: Associations Between Gene Sets
Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance Edges Associations between processes Moderately Strong Very Strong Borders Data coverage of processes Sparsely Covered Well Covered
20
Functional Mapping: Associations Between Gene Sets
Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance Edges Associations between processes Moderately Strong Very Strong Nodes Cohesiveness of processes Below Baseline Baseline (genomic background) Very Cohesive Borders Data coverage of processes Sparsely Covered Well Covered
21
Functional Maps: Focused Data Summarization
ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA Data integration summarizes an impossibly huge amount of experimental data into an impossibly huge number of predictions; what next?
22
Functional Maps: Focused Data Summarization
ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA How can a biologist take advantage of all this data to study his/her favorite gene/pathway/disease without losing information? Functional mapping Very large collections of genomic data Specific predicted molecular interactions Pathway, process, or disease associations Underlying experimental results and functional activities in data
23
Efficient Computation For Biological Discovery
Massive datasets and genomes require efficient algorithms and implementations. Sleipnir C++ library for computational functional genomics Data types for biological entities Microarray data, interaction data, genes and gene sets, functional catalogs, etc. etc. Network communication, parallelization Efficient machine learning algorithms Generative (Bayesian) and discriminative (SVM) And it’s fully documented! huttenhower.sph.harvard.edu/sleipnir It’s also speedy: microbial data integration computation takes <3hrs.
24
Thanks! Curtis Huttenhower Harvard School of Public Health
Department of Biostatistics
26
Meta-Analysis for Data Integration
Evangelou 2007 + =
27
Meta-Analysis for Data Integration
Evangelou 2007 Simple regression: All datasets are equally accurate Random effects: Variation within and among datasets and interactions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.