The Computational Biology of Genetically Diverse Assemblages Allen Rodrigo 1, Frederic Bertels 1, Mehul Rathod 2, Sean Irvine 2, John Cleary 2,3, Peter.

Slides:

Advertisements

Similar presentations

Marius Nicolae Computer Science and Engineering Department

Advertisements

PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.

BLAST Sequence alignment, E-value & Extreme value distribution.

Metabarcoding 16S RNA targeted sequencing

A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.

 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.

Next-generation sequencing

Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.

Next Generation Sequencing, Assembly, and Alignment Methods

Bioinformatics and Phylogenetic Analysis

Project Proposals Due Monday Feb. 12 Two Parts: Background—describe the question Why is it important and interesting? What is already known about it? Proposed.

Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.

1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.

Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.

Sequence alignment, E-value & Extreme value distribution

Materials and Methods Abstract Conclusions Introduction 1. Korber B, et al. Br Med Bull 2001; 58: Rambaut A, et al. Nat. Rev. Genet. 2004; 5:

Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.

The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.

Sequencing a genome and Basic Sequence Alignment

Metagenomics Binning and Machine Learning

Metagenomic Analysis Using MEGAN4

An Introduction to Bioinformatics

Molecular Microbial Ecology

Todd J. Treangen, Steven L. Salzberg

From Metagenomic Sample to Useful Visual Anna Shcherbina 01/10/ Anna Shcherbina Bioinformatics Challenge Day 02/02/2013 From Metagenomic Sample to.

H = -Σp i log 2 p i. SCOPI Each one of the many microbial communities has its own structure and ecosystem, depending on the body environment it exists.

Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD

BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.

SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu

The use of short-read next generation sequences to recover the evolutionary histories in multi-individual samples Systematic biology presentation Yuantong.

Sequencing a genome and Basic Sequence Alignment

BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.

Construction of Substitution Matrices

Metagenomic Analysis Using MEGAN4 Peter R. Hoyt Director, OSU Bioinformatics Graduate Certificate Program Matthew Vaughn iPlant, University of Texas Super.

Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.

A Tutorial of Sequence Matching in Oracle Haifeng Ji* and Gang Qian** * Oklahoma City Community College ** University of Central Oklahoma.

Initial sequencing and analysis of the human genome Averya Johnson Nick Patrick Aaron Lerner Joel Burrill Computer Science 4G October 18, 2005.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.

Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.

Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.

Anna Shcherbina Bioinformatics Challenge Day 01/10/2013 De novo assembly from clinical sample This work is sponsored by the Defense Threat Reduction Agency.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

Accurate estimation of microbial communities using 16S tags

Construction of Substitution matrices

Statistical Tests We propose a novel test that takes into account both the genes conserved in all three regions ( x 123 ) and in only pairs of regions.

Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College

A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.

MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res

Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.

Computational Characterization of Short Environmental DNA Fragments Jens Stoye 1, Lutz Krause 1, Robert A. Edwards 2, Forest Rohwer 2, Naryttza N. Diaz.

Looking Within Human Genome King abdulaziz university Dr. Nisreen R Tashkandy GENOMICS ; THE PIG PICTURE.

Metagenomic Species Diversity.

Introduction to Bioinformatics Resources for DNA Barcoding

Seminar in Bioinformatics (236818)

Comparative metagenomics quantifying similarities between environments

Research in Computational Molecular Biology , Vol (2008)

BLAST Anders Gorm Pedersen & Rasmus Wernersson.

Genomes and Their Evolution

H = -Σpi log2 pi.

Metagenomics Microbial community DNA extraction

Taxonomic identification and phylogenetic profiling

Sequence alignment, E-value & Extreme value distribution

Toward Accurate and Quantitative Comparative Metagenomics

Presentation transcript:

The Computational Biology of Genetically Diverse Assemblages Allen Rodrigo 1, Frederic Bertels 1, Mehul Rathod 2, Sean Irvine 2, John Cleary 2,3, Peter Tsai 1 1 The Allan Wilson Centre for Molecular Ecology and Evolution and the Bioinformatics Institute New Zealand, University of Auckland 2 NetValue Ltd 3 Department of Computer Science, University of Waikato

Metagenomics The study of the genetics of diverse assemblages of (micro)organisms from natural environments is called metagenomics. Metagenomic studies… – Utilise new high-throughput sequencing technologies – Typically include unknown organisms and novel genes – Will generate large amounts of genetic data – Can be performed in a range of environments – Requires significant computational resources and new algorithms – Have the potential to revolutionize the way we think about the genetic makeup of the environment

The New Icons

Source: J. Craig Venter Institute

Preliminary Results of the GOS Study 2000 new protein “types” – Many viral proteins – New occurrences of proteins in previously unrecorded taxonomic groups >6000 new open reading frames (potential protein coding sequences)

Metagenomics of Communities at Neighbouring Thermal Vents Rarefaction Curves Higher taxa Species Huber et al, 2007, Science 318:

The Marine Viromes Project

Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA, et al. (2006) The Marine Viromes of Four Oceanic Regions. PLoS Biol 4(11): e368 doi: /journal.pbio

Community Comparisons If the primary purpose is to relate community structure to environment, space or time then: – We need to quantify the similarities between different communities – So that we can relate these similarities to the environmental, temporal or spatial similarities.

Community Comparisons The bottleneck in these analyses is the identification of each sequence in the sample. Sequences may be amplicons of single loci or environmental shotgun sequences.

New Sequencing Technologies Roche, Illumina, and Applied Biosystems have released next-generation sequencers that produce large quantities of sequence information. – Millions of shotgun fragments, each between 25nt-250nt long – nt in a single run (within days) Other technologies will follow.

Community Comparisons The bottleneck in these analyses is the identification of each sequence in the sample. The challenge is to either Find algorithms that can speed up this process Free ourselves of the process

Identifying The Species Present Using BLAST takes time. However, new tools are presently available. Used SLIMSearch ( – Proprietory search algorithm based on word matching – Disclosure: I am on the SAB!

Identifying The Species Present Simulations: Select random 60 genomes from the set of 546 fully- sequenced bacterial genomes Compute the number of reads for each genome in the 60 following the log normal distribution as above – 250nt reads, 0.7x coverage (distributed over 60 genomes using a log normal distribution mean = 2, standard deviation = 3.3) – Approx. 600,000 reads Set error at 0.5% – generated by random selection from the genome and appropriate mutation(90% indels 10% substitutions) Time SLIMSearch and BLASTN with each set as query against 546 genomes

Identifying The Species Present BLASTN (sec) = 68.6hrs SLIMSearch (sec) = 6 mins computer configuration – TAHI 2 x Dual core opteron 2212 (2.0 GHz), 8 GB RAM, 1 TB (2 x 500GB), Debian AMD64 4.0(Etch), DELL Poweredge 1435

What About Identifying The Species Present?

Community Comparisons The bottleneck in these analyses is the identification of each sequence in the sample. Sequences may be amplicons of single loci or environmental shotgun sequences. The challenge is to either Find algorithms that can speed up this process Free ourselves of the process

Identification-Free Comparisons We have chosen to explore the use of alignment-free methods. These can be classed into 2 broad types: Similarity of word frequency spectra Compression-type procedures

Similarity of Word Frequency Spectra Define a word-length, k. For each taxon/sequence, identify the frequencies of all possible k-words. Compare frequency spectra between pairs using an appropriate distance metric. – Metrics tend to differ based on how they normalise word frequencies, the distances used, and how expected frequencies are calculated. Dates back to Blaisdell (1986).

Compression-based Methods Some sophisticated maths, but a very simple idea. What is the “compressibility” of two datasets when they are combined, relative to the sum of their individual “compressibilities”? – How much shared information is there between two datasets? Previous work has shown some nice phylogenetic properties.

Alignment-free Comparisons We applied word frequency and compression algorithms to datasets consisting of: – 16S complete rDNA sequences of 35 bacteria spanning a wide range of phyla and with a range of GC-contents from the Ribosomal Database Project (Maidak et al, 1997). – the same 16S rDNA sequences, cut into random short fragments of length 250 (+/-50) each with 3X coverage, using the program READSIM (source: readsim/welcome.html) with a relatively high error rate of approximately 4% readsim/welcome.html – full genomes of the same bacteria as in (a).

Alignment-free Comparisons Pairwise ML distances between the original sequences were obtained with PAUP* using models of substitution obtained with ModelTest. 22 compression algorithms used – Ferragina et al. (2007) – Distances computed using Universal Compression Dissimilarity distance: Frequencies of k-words ( ) were compared using Manhattan or Euclidean distances.

Compression Algorithms: Distance comparisons with complete 16S rDNA

Word Algorithms: Distance comparisons with complete 16S rDNA A) Manhattan word length 4 B) Euclidean word length 4 C) Euclidean word length 6 D) Manhattan word length 6 E) Manhattan word length 8 F) Euclidean word length 8 G) Manhattan word length 7 H) Euclidean word length 5 I) Manhattan word length 5 J) Euclidean word length 7

Compression Algorithms: Distance comparisons with short-read 16S rDNA

Word Algorithms: Distance comparisons with short-read 16S rDNA A) Manhattan word length 4 B) Euclidean word length 4 C) Euclidean word length 6 D) Manhattan word length 6 E) Manhattan word length 8 F) Euclidean word length 8 G) Manhattan word length 7 H) Euclidean word length 5 I) Manhattan word length 5 J) Euclidean word length 7

Compression Algorithms: Distance comparisons with complete genomes

Word Algorithms: Distance comparisons with complete genomes A) Manhattan word length 4 B) Euclidean word length 4 C) Euclidean word length 6 D) Manhattan word length 6 E) Manhattan word length 8 F) Euclidean word length 8 G) Manhattan word length 7 H) Euclidean word length 5 I) Manhattan word length 5 J) Euclidean word length 7

Problems and Challenges It appears that we are able to use compression and word-frequency methods with a single locus. With whole genomes, these methods break down. – Lateral gene transfer – GC content differences across the genome – Numbers of repeats

Can we use alignment-free methods to quantify the similarity of communities for which only a single locus has been sequenced? Simulations – 100 communities – Each with 10 randomly-selected bacterial species’ 16SrRNA – Log-normal species frequency distribution

Alignment-free Community Comparisons

Provisional Conclusions Alignment-free methods hold promise for the rapid estimation of pairwise distances between amplicons and NGS from single species or communities They work less well with whole genomes. Advancements in search/identification strategies may negate the necessity for these fast methods.

Acknowledgements NZ-France Dumont D’Urville Fund