ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

Slides:



Advertisements
Similar presentations
Introduction to molecular dating methods. Principles Ultrametricity: All descendants of any node are equidistant from that node For extant species, branches,
Advertisements

Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
An Introduction to Phylogenetic Methods
Uncertainty and confidence intervals Statistical estimation methods, Finse Friday , 12.45–14.05 Andreas Lindén.
Probabilistic Modeling of Molecular Evolution Using Excel, AgentSheets, and R Jeff Krause (Shodor)
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
GENE TREES Abhita Chugh. Phylogenetic tree Evolutionary tree showing the relationship among various entities that are believed to have a common ancestor.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetic reconstruction
Molecular Evolution Revised 29/12/06
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
Genetica per Scienze Naturali a.a prof S. Presciuttini Human and chimpanzee genomes The human and chimpanzee genomes—with their 5-million-year history.
Finding Orthologous Groups René van der Heijden. What is this lecture about? What is ‘orthology’? Why do we study gene-ancestry/gene-trees (phylogenies)?
Bioinformatics and Phylogenetic Analysis
Dispersal models Continuous populations Isolation-by-distance Discrete populations Stepping-stone Island model.
Tree Evaluation Tree Evaluation. Tree Evaluation A question often asked of a data set is whether it contains ‘significant cladistic structure’, that is.
Bas E. Dutilh Phylogenomics Using complete genomes to determine the phylogeny of species.
28-Way vertebrate alignment and conservation track in the UCSC Genome Browser Journal club Dec. 7, 2007.
Finding Orthologous Groups René van der Heijden. What is this lecture about? What is ‘orthology’? Why do we study gene-ancestry/gene-trees (phylogenies)?
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Today Concepts underlying inferential statistics
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Todd J. Treangen, Steven L. Salzberg
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Binary Encoding and Gene Rearrangement Analysis Jijun Tang Tianjin University University of South Carolina (803)
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Phylogenetic trees School B&I TCD Bioinformatics May 2010.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
3- RIBOSOMAL RNA GENE RECONSTRUCITON  Phenetics Vs. Cladistics  Homology/Homoplasy/Orthology/Paralogy  Evolution Vs. Phylogeny  The relevance of the.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Construction of Substitution Matrices
Calculating branch lengths from distances. ABC A B C----- a b c.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.
Bootstraps and Jackknives Hal Whitehead BIOL4062/5062.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Phylogeny Ch. 7 & 8.
De novo assembly validation
Construction of Substitution matrices
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Molecular Evolution. Study of how genes and proteins evolve and how are organisms related based on their DNA sequence Molecular evolution therefore is.
Population sequencing using short reads: HIV as a case study Vladimir Jojic et.al. PSB 13: (2008) Presenter: Yong Li.
Bioinformatics Overview
Introduction to Bioinformatics Resources for DNA Barcoding
Evolutionary genomics can now be applied beyond ‘model’ organisms
Phylogenetic basis of systematics
Distance based phylogenetics
Pipelines for Computational Analysis (Bioinformatics)
Multiple Alignment and Phylogenetic Trees
Methods of molecular phylogeny
Patterns in Evolution I. Phylogenetic
Inferring phylogenetic trees: Distance and maximum likelihood methods
Molecular Evolution.
Dr Tan Tin Wee Director Bioinformatics Centre
Chapter 19 Molecular Phylogenetics
Comparing read recruitment, de novo, and insertion tree strategies for phylogenetic diversity computation. Comparing read recruitment, de novo, and insertion.
Phylogenetic comparison among selected Pasteurella multocida and Haemophilus influenzae species with completed genome sequences. Phylogenetic comparison.
Presentation transcript:

ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

Traditional methods for building phylogeny Requirements: High coverage Assembly Detection of putative orthologous genes Alignment Phylogeny from tiny portion of the whole genome Genome scale multi-sequence alignment is difficult

Alignment-free methods for building phylogeny Typically from assembled genomes De novo assembly with short reads? Mainly on closely related prokaryotic genomes No confidence assessment (e.g. bootstrapping)

Overview Assembly and Alignment-Free method (AAF) Calculate phylogenetic distances using whole genome short read sequencing data Method validation Genome complexity Different genome sizes Sequencing errors Range of sequencing coverage 12 mammal species 21 tropical tree species Comparision with andi

AAF method Calculate pairwise genetic distances between each sample using the number of evolutionary changes between their genomes, which are represented by the number of k-mers that differ between genomes. Phylogenetic relationships among the genomes are then reconstructed from the pairwise distance matrix

AAF method - Evolutionary model The probability that no mutation will occur within a given k-mer between species A and B is exp(−kd). If only substitutions occurred, all k-mers are unique, then all the species will have the same total number of k-mers, n t, and the maximum likelihood estimate of exp(−kd) is n s /n t. Mutations will decrease the number of shared k-mers, n s, between species relative to the total number of k-mers, n t Insertion: loss of (k – 1) or gain of (l + k – 1) k-mers Deletion: loss of (l+k – 1) or gain of (k – 1) k-mers Greater effect

K-mer sensitivity and homoplasy No assembly -> not all indels identified If k-mer covers multiple substitutions Shorter k-mers -> better sensitivity Shorter k-mers -> same k-mers from evolutionary different regions Homoplasy

K-mer homoplasy k=15 Genome size > 5x10 8 => same k-mers randomly in other species May incorrectly inflate the proportion of shared k-mers The optimal k for phylogenetic reconstruction is the k which is just large enough to greatly reduce k-mer homoplasy for a given genome size

phph Prediction of the ratio n s /n t Large genomes and small k p h = 1 all possible k-mers occur in both species. This problem is exac- erbated if GC content is biased, which will inflate the average similarity in genomic k-mer composition. GC content Sufficiently large k will overcome homoplasy, regardless of the evolutionary distance between species.

Mathematical prediction

Random ancestral sequence

Real (non-random) sequence

Assembly-free Sampling error caused by low genome coverage The actual number of k-mers will be under-represented given low sequencing coverage Sequencing errors Loss of true k-mers and the gain of false k-mers Filtering = remove singletons

Seq errors p=observed/true Coverage 5-8 sufficient to observe all true k-mers when filtering => Tip corrections

Filter only singletons?

Bootstrapping Nonparametric bootstrap 1) Resample original reads with replacement 2) “Block bootstrap” – take rows with probabilty 1/k OR Two-stage parametric bootstrap Estimate the variances in distances between species caused by sampling and evolutionary variation Independent of genome size

Bushbaby (galago)

Tarsier

Recently published phylogeny of primates

Assembled genomes, k=19

Assembled genomes, k=21

Simulated reads

Real data – tropical trees Intsia palembanica

Advantages Low coverage requirements Low computational demands 12 primates 25GB RAM, 12 threads Limitations Loss of k-mer sensitivity Deep nodes Location of mutations

Distance computing for 73 Escherichia strains AAF = 1h 48min andi 21 min

AAF andi