Genomic Data Manipulation Thinking about data visually Curtis Huttenhower chuttenh@hsph.harvard.edu http://huttenhower.sph.harvard.edu/bio508 01-27-14 Harvard School of Public Health Department of Biostatistics
The usual suspects Bar plot = discrete # of discrete values Stripchart = discrete # of small # of continuous values Boxplot = discrete # of large # of continuous values Histogram = discretized bins of counts Density plot = continuous interpolation of counts Scatter plot = pairs of continuous values Line plot = function of continuous values
Small changes, big differences Boxplots can be decorated as... Beeswarm plots = mashup of boxplot + stripchart Violin plots = mashup of boxplot + density plot Scatter plots can be decorated as... Sunflower plot = mashup of scatter + histogram 2D density plot = mashup of scatter + density
Fig. 3. Comparison of Sargasso Sea scaffolds to Crenarchaeal clone 4B7. Comparison of Sargasso Sea scaffolds to Crenarchaeal clone 4B7. Predicted proteins from 4B7 and the scaffolds showing significant homology to 4B7 by tBLASTx are arrayed in positional order along the x and y axes. Colored boxes represent BLASTp matches scoring at least 25% similarity and with an e value of better than 1e-5. Black vertical and horizontal lines delineate scaffold borders. J C Venter et al. Science 2004;304:66-74 Published by AAAS
Only one of many ways to think about DNA sequence data...
(Almost) everything can be clustered into a tree, even DNA sequences Fig. 7. Phylogenetic tree of rhodopsinlike genes in the Sargasso Sea data along with all homologs of these genes in GenBank. (Almost) everything can be clustered into a tree, even DNA sequences Phylogenetic tree of rhodopsinlike genes in the Sargasso Sea data along with all homologs of these genes in GenBank. The sequences are colored according to the type of sample in which they were found: blue, cultured species; yellow, sequences from uncultured organisms in other environmental samples; and red, sequences from uncultured species in the Sargasso Sea. The tree was divided into what we propose are distinct subfamilies of sequences, which are labeled on the right. The tree was constructed as follows: (i) All homologs of halorhodopsin were identified in the predicted proteins from the Sargasso Sea assemblies using BLASTp searches with representatives of previously identified halorhodpsinlike protein families as query sequences. (ii) All sequences greater than 75 amino acids in length were aligned to each other using CLUSTALw, and a neighbor-joining phylogenetic tree was inferred using the protdist and neighbor programs of Phylip. J C Venter et al. Science 2004;304:66-74 Published by AAAS
Aerobic, microaerobic and anaerobic communities But not every tree is a clustering
Model of microbial biomarkers Why are networks so popular in biology?
Don’t be afraid to get creative when representing data! Fast and Furious 6 (!?!) Man of Steel Hunger Games Iron Man 3 Thor http://xach.com/moviecharts/2013.html Hunger Games Avengers Dark Knight Rises Twilight XXVII
Wordles
Looking at data – it’s not just fun, it’s important, too! Anscombe's quartet Four 11-pair datasets with the same... X mean, X standard deviation, Y mean, Y standard deviation, Correlation, and regression coefficients μ(x)=9 σ(x)=11 μ(y)=7.5 σ(y)=4.1 ρ=0.816 y=3+0.5x Looking at data – it’s not just fun, it’s important, too!