Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genomic Data Manipulation

Similar presentations


Presentation on theme: "Genomic Data Manipulation"— Presentation transcript:

1 Genomic Data Manipulation
Curtis Huttenhower TA: Fah Sathirapongsasuti Office hours: R 2:00-3:00 MW 3:30-5:20 Office hours: F 3:30-5:20 Harvard School of Public Health Department of Biostatistics

2 Genomic Data Manipulation
1/3 methods: Quantitative methods (mini-stats) Programming (Python) 1/3 applications: DNA sequence data Microarrays Proteomics and metabolomics Interaction networks 1/3 papers and projects: Journal club Final project 10% participation 50% problem sets (10) 15% presentations 25% final project

3 What tools enable biological discoveries?
Our job is to create computational microscopes: To ask and answer specific biomedical questions using millions of experimental results

4 A computational definition of functional genomics
Prior knowledge Genomic data Gene Function Gene Data Function Function

5 A framework for functional genomics
100Ms gene pairs → G1 G2 + G4 G9 G3 G6 - G7 G8 G5 ? 0.9 0.7 0.1 0.2 0.8 0.5 0.05 0.6 ← 1Ks datasets P(G2-G5|Data) = 0.85 High Correlation Low Frequency High Correlation Low Let. Not let. Frequency + = Similar Dissim. Frequency High Similarity Low

6 Functional network prediction and analysis
Global interaction network Carbon metabolism network Extracellular signaling network Gut community network

7 Predicting gene function
Predicted relationships between genes High Confidence Low Cell cycle genes

8 Predicting gene function
Predicted relationships between genes High Confidence Low Cell cycle genes

9 Predicting gene function
Predicted relationships between genes High Confidence Low These edges provide a measure of how likely a gene is to specifically participate in the process of interest. Cell cycle genes

10 Comprehensive validation of computational predictions
With David Hess, Amy Caudy Genomic data Prior knowledge Computational Predictions of Gene Function SPELL Hibbs et al 2007 bioPIXIE Myers et al 2005 MEFIT Retraining Genes predicted to function in mitochondrion organization and biogenesis New known functions for correctly predicted genes Could go (-) Laboratory Experiments Petite frequency Growth curves Confocal microscopy

11 Evaluating the performance of computational predictions
Genes involved in mitochondrion organization and biogenesis 106 Original GO Annotations 135 Under-annotations 82 Novel Confirmations, First Iteration 17 Novel Confirmations, Second Iteration 340 total: >3x previously known genes in ~5 person-months Could go (-)

12 Evaluating the performance of computational predictions
Genes involved in mitochondrion organization and biogenesis Computational predictions from large collections of genomic data can be accurate despite incomplete or misleading gold standards, and they continue to improve as additional data are incorporated. 106 Original GO Annotations 95 Under-annotations 40 Confirmed Under-annotations 80 Novel Confirmations First Iteration 17 Novel Confirmations Second Iteration 340 total: >3x previously known genes in ~5 person-months Could go (-)

13

14 Fig. 3. Comparison of Sargasso Sea scaffolds to Crenarchaeal clone 4B7.
Comparison of Sargasso Sea scaffolds to Crenarchaeal clone 4B7. Predicted proteins from 4B7 and the scaffolds showing significant homology to 4B7 by tBLASTx are arrayed in positional order along the x and y axes. Colored boxes represent BLASTp matches scoring at least 25% similarity and with an e value of better than 1e-5. Black vertical and horizontal lines delineate scaffold borders. J C Venter et al. Science 2004;304:66-74 Published by AAAS

15

16 Fig. 7. Phylogenetic tree of rhodopsinlike genes in the Sargasso Sea data along with all homologs of these genes in GenBank. Phylogenetic tree of rhodopsinlike genes in the Sargasso Sea data along with all homologs of these genes in GenBank. The sequences are colored according to the type of sample in which they were found: blue, cultured species; yellow, sequences from uncultured organisms in other environmental samples; and red, sequences from uncultured species in the Sargasso Sea. The tree was divided into what we propose are distinct subfamilies of sequences, which are labeled on the right. The tree was constructed as follows: (i) All homologs of halorhodopsin were identified in the predicted proteins from the Sargasso Sea assemblies using BLASTp searches with representatives of previously identified halorhodpsinlike protein families as query sequences. (ii) All sequences greater than 75 amino acids in length were aligned to each other using CLUSTALw, and a neighbor-joining phylogenetic tree was inferred using the protdist and neighbor programs of Phylip. J C Venter et al. Science 2004;304:66-74 Published by AAAS

17 Aerobic, microaerobic and anaerobic communities

18 Model of microbial biomarkers

19

20


Download ppt "Genomic Data Manipulation"

Similar presentations


Ads by Google