ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

Traditional methods for building phylogeny Requirements: High coverage Assembly Detection of putative orthologous genes Alignment Phylogeny from tiny portion of the whole genome Genome scale multi-sequence alignment is difficult

Alignment-free methods for building phylogeny Typically from assembled genomes De novo assembly with short reads? Mainly on closely related prokaryotic genomes No confidence assessment (e.g. bootstrapping)

Overview Assembly and Alignment-Free method (AAF) Calculate phylogenetic distances using whole genome short read sequencing data Method validation Genome complexity Different genome sizes Sequencing errors Range of sequencing coverage 12 mammal species 21 tropical tree species Comparision with andi

AAF method Calculate pairwise genetic distances between each sample using the number of evolutionary changes between their genomes, which are represented by the number of k-mers that differ between genomes. Phylogenetic relationships among the genomes are then reconstructed from the pairwise distance matrix

AAF method - Evolutionary model The probability that no mutation will occur within a given k-mer between species A and B is exp(−kd). If only substitutions occurred, all k-mers are unique, then all the species will have the same total number of k-mers, n t, and the maximum likelihood estimate of exp(−kd) is n s /n t. Mutations will decrease the number of shared k-mers, n s, between species relative to the total number of k-mers, n t Insertion: loss of (k – 1) or gain of (l + k – 1) k-mers Deletion: loss of (l+k – 1) or gain of (k – 1) k-mers Greater effect

K-mer sensitivity and homoplasy No assembly -> not all indels identified If k-mer covers multiple substitutions Shorter k-mers -> better sensitivity Shorter k-mers -> same k-mers from evolutionary different regions Homoplasy

K-mer homoplasy k=15 Genome size > 5x10 8 => same k-mers randomly in other species May incorrectly inflate the proportion of shared k-mers The optimal k for phylogenetic reconstruction is the k which is just large enough to greatly reduce k-mer homoplasy for a given genome size

phph Prediction of the ratio n s /n t Large genomes and small k p h = 1 all possible k-mers occur in both species. This problem is exac- erbated if GC content is biased, which will inflate the average similarity in genomic k-mer composition. GC content Sufficiently large k will overcome homoplasy, regardless of the evolutionary distance between species.

Mathematical prediction

Random ancestral sequence

Real (non-random) sequence

Assembly-free Sampling error caused by low genome coverage The actual number of k-mers will be under-represented given low sequencing coverage Sequencing errors Loss of true k-mers and the gain of false k-mers Filtering = remove singletons

Seq errors p=observed/true Coverage 5-8 sufficient to observe all true k-mers when filtering => Tip corrections

Filter only singletons?

Bootstrapping Nonparametric bootstrap 1) Resample original reads with replacement 2) “Block bootstrap” – take rows with probabilty 1/k OR Two-stage parametric bootstrap Estimate the variances in distances between species caused by sampling and evolutionary variation Independent of genome size

Bushbaby (galago)

Tarsier

Recently published phylogeny of primates

Assembled genomes, k=19

Assembled genomes, k=21

Simulated reads

Real data – tropical trees Intsia palembanica

Advantages Low coverage requirements Low computational demands 12 primates 25GB RAM, 12 threads Limitations Loss of k-mer sensitivity Deep nodes Location of mutations

Distance computing for 73 Escherichia strains AAF 32+76 = 1h 48min andi 21 min

AAF andi

ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

Similar presentations

Presentation on theme: "ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

Similar presentations

Presentation on theme: "ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon."— Presentation transcript:

Similar presentations

About project

Feedback