Genomic Data Manipulation Thinking about data visually

Slides:



Advertisements
Similar presentations
Multiple Sequence Alignment Dynamic Programming. Multiple Sequence Alignment VTISCTGSSSNIGAG  NHVKWYQQLPG VTISCTGTSSNIGS  ITVNWYQQLPG LRLSCSSSGFIFSS.
Advertisements

Bioinformatics Tutorial I BLAST and Sequence Alignment.
Molecular Evolution Revised 29/12/06
Structural bioinformatics
Sequence Similarity Searching Class 4 March 2010.
Bioinformatics and Phylogenetic Analysis
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
© Wiley Publishing All Rights Reserved. Searching Sequence Databases.
Environmental Genome Shotgun Sequencing of the Sargasso Sea
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Protein Sequence Alignment and Database Searching.
Probes can be designed in an evolutionary hierarchy.
Figure S1_Yao Qin et al. Figure S1 Occurrence and distribution of trihelix family in different plant species. Red branches in the cladogram indicate that.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
Sequence-based Similarity Module (BLAST & CDD only ) & Horizontal Gene Transfer Module (Ortholog Neighborhood & GC content only)
Discovering the Correlation Between Evolutionary Genomics and Protein-Protein Interaction Rezaul Kabir and Brett Thompson
Big Picture Of ≈1.7 million species classified so far, roughly 6000 are microbes True number of microbes is obviously larger than 6000 “Imagine if our.
Anis Karimpour-Fard ‡, Ryan T. Gill †,
Basic Local Alignment Search Tool BLAST Why Use BLAST?
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.
Phylogenetics.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
What is BLAST? Basic BLAST search What is BLAST?
Environmental Genome Shotgun Sequencing of the Sargasso Sea Venter et. al (2004) Presented by Ken Vittayarukskul Steven S. White.
Bioinformatics What is a genome? How are databases used? What is a phylogentic tree?
What is BLAST? Basic BLAST search What is BLAST?
Introduction to Bioinformatics Resources for DNA Barcoding
Fig. 1. — The life cycle of S. papillosus. (A) The life cycle of S
Basics of BLAST Basic BLAST Search - What is BLAST?
Basics of Comparative Genomics
Research in Computational Molecular Biology , Vol (2008)
Phylogenetic analysis of CADs and oxidoreductases involved in specialized metabolism (ORSMs). Phylogenetic analysis of CADs and oxidoreductases involved.
B3- Olympic High School Bioinformatics
Genomic Data Manipulation
Bioinformatics and BLAST
Complex phylogenetic relationships among sand-dwelling Malawi cichlids
Prediction of IgE-binding epitopes by means of allergen surface comparison and correlation to cross-reactivity  Fabio Dall'Antonia, PhD, Anna Gieras,
Genes to Trees Daniel Ayres and Adam Bazinet
Genomic Insights into the Immune System of the Sea Urchin
Recombination between Palindromes P5 and P1 on the Human Y Chromosome Causes Massive Deletions and Spermatogenic Failure  Sjoerd Repping, Helen Skaletsky,
Basic Local Alignment Search Tool
Loss of ACF1 affects transcription prominently in inactive chromatin.
Chapter 19 Molecular Phylogenetics
The Release 5.1 Annotation of Drosophila melanogaster Heterochromatin
Basic Local Alignment Search Tool (BLAST)
Integrative Multi-omic Analysis of Human Platelet eQTLs Reveals Alternative Start Site in Mitofusin 2  Lukas M. Simon, Edward S. Chen, Leonard C. Edelstein,
Volume 12, Issue 6, Pages (December 2003)
Volume 5, Issue 4, Pages e4 (October 2017)
The Complete Genome Sequence of Escherichia coli K-12
Genomes with Fe-S cluster assembly-related genes.
Global visualization of antigen and epitope discovery.
Michal Levin, Tamar Hashimshony, Florian Wagner, Itai Yanai 
Figure 1. Schematic illustration of CSN and NDM construction and our statistic model. (A) CSN and NDM construction. (i) ... Figure 1. Schematic illustration.
Volume 128, Issue 6, Pages (March 2007)
(A) Tiled view of an ESOM map constructed using all 51 metagenome bins assembled from the samples collected in this study, with the white square encompassing.
Volume 133, Issue 7, Pages (June 2008)
Basics of Comparative Genomics
Masayuki Matsumoto, Masahiko Takada  Neuron 
Basic Local Alignment Search Tool
Phylogenetic analysis and amino acid sequences comparison of HO endonucleases. Phylogenetic analysis and amino acid sequences comparison of HO endonucleases.
The genomic landscape of a HeLa cell line.
Neighbor-joining tree of the 262 S
Fig. 3 Transcriptional correlates of improved health in TRF
Distribution of the diversity numbers of antibiotic resistance and virulence factor protein families by environmental metagenomes’ protein family richness.
Distinct subtypes of CAFs are detected in human PDAC
Fig. 5 Clustering of the distal gut microbiome, the C
Presentation transcript:

Genomic Data Manipulation Thinking about data visually Curtis Huttenhower chuttenh@hsph.harvard.edu http://huttenhower.sph.harvard.edu/bio508 01-27-14 Harvard School of Public Health Department of Biostatistics

The usual suspects Bar plot = discrete # of discrete values Stripchart = discrete # of small # of continuous values Boxplot = discrete # of large # of continuous values Histogram = discretized bins of counts Density plot = continuous interpolation of counts Scatter plot = pairs of continuous values Line plot = function of continuous values

Small changes, big differences Boxplots can be decorated as... Beeswarm plots = mashup of boxplot + stripchart Violin plots = mashup of boxplot + density plot Scatter plots can be decorated as... Sunflower plot = mashup of scatter + histogram 2D density plot = mashup of scatter + density

Fig. 3. Comparison of Sargasso Sea scaffolds to Crenarchaeal clone 4B7. Comparison of Sargasso Sea scaffolds to Crenarchaeal clone 4B7. Predicted proteins from 4B7 and the scaffolds showing significant homology to 4B7 by tBLASTx are arrayed in positional order along the x and y axes. Colored boxes represent BLASTp matches scoring at least 25% similarity and with an e value of better than 1e-5. Black vertical and horizontal lines delineate scaffold borders. J C Venter et al. Science 2004;304:66-74 Published by AAAS

Only one of many ways to think about DNA sequence data...

(Almost) everything can be clustered into a tree, even DNA sequences Fig. 7. Phylogenetic tree of rhodopsinlike genes in the Sargasso Sea data along with all homologs of these genes in GenBank. (Almost) everything can be clustered into a tree, even DNA sequences Phylogenetic tree of rhodopsinlike genes in the Sargasso Sea data along with all homologs of these genes in GenBank. The sequences are colored according to the type of sample in which they were found: blue, cultured species; yellow, sequences from uncultured organisms in other environmental samples; and red, sequences from uncultured species in the Sargasso Sea. The tree was divided into what we propose are distinct subfamilies of sequences, which are labeled on the right. The tree was constructed as follows: (i) All homologs of halorhodopsin were identified in the predicted proteins from the Sargasso Sea assemblies using BLASTp searches with representatives of previously identified halorhodpsinlike protein families as query sequences. (ii) All sequences greater than 75 amino acids in length were aligned to each other using CLUSTALw, and a neighbor-joining phylogenetic tree was inferred using the protdist and neighbor programs of Phylip. J C Venter et al. Science 2004;304:66-74 Published by AAAS

Aerobic, microaerobic and anaerobic communities But not every tree is a clustering

Model of microbial biomarkers Why are networks so popular in biology?

Don’t be afraid to get creative when representing data! Fast and Furious 6 (!?!) Man of Steel Hunger Games Iron Man 3 Thor http://xach.com/moviecharts/2013.html Hunger Games Avengers Dark Knight Rises Twilight XXVII

Wordles

Looking at data – it’s not just fun, it’s important, too! Anscombe's quartet Four 11-pair datasets with the same... X mean, X standard deviation, Y mean, Y standard deviation, Correlation, and regression coefficients μ(x)=9 σ(x)=11 μ(y)=7.5 σ(y)=4.1 ρ=0.816 y=3+0.5x Looking at data – it’s not just fun, it’s important, too!