ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon
Traditional methods for building phylogeny Requirements: High coverage Assembly Detection of putative orthologous genes Alignment Phylogeny from tiny portion of the whole genome Genome scale multi-sequence alignment is difficult
Alignment-free methods for building phylogeny Typically from assembled genomes De novo assembly with short reads? Mainly on closely related prokaryotic genomes No confidence assessment (e.g. bootstrapping)
Overview Assembly and Alignment-Free method (AAF) Calculate phylogenetic distances using whole genome short read sequencing data Method validation Genome complexity Different genome sizes Sequencing errors Range of sequencing coverage 12 mammal species 21 tropical tree species Comparision with andi
AAF method Calculate pairwise genetic distances between each sample using the number of evolutionary changes between their genomes, which are represented by the number of k-mers that differ between genomes. Phylogenetic relationships among the genomes are then reconstructed from the pairwise distance matrix
AAF method - Evolutionary model The probability that no mutation will occur within a given k-mer between species A and B is exp(−kd). If only substitutions occurred, all k-mers are unique, then all the species will have the same total number of k-mers, n t, and the maximum likelihood estimate of exp(−kd) is n s /n t. Mutations will decrease the number of shared k-mers, n s, between species relative to the total number of k-mers, n t Insertion: loss of (k – 1) or gain of (l + k – 1) k-mers Deletion: loss of (l+k – 1) or gain of (k – 1) k-mers Greater effect
K-mer sensitivity and homoplasy No assembly -> not all indels identified If k-mer covers multiple substitutions Shorter k-mers -> better sensitivity Shorter k-mers -> same k-mers from evolutionary different regions Homoplasy
K-mer homoplasy k=15 Genome size > 5x10 8 => same k-mers randomly in other species May incorrectly inflate the proportion of shared k-mers The optimal k for phylogenetic reconstruction is the k which is just large enough to greatly reduce k-mer homoplasy for a given genome size
phph Prediction of the ratio n s /n t Large genomes and small k p h = 1 all possible k-mers occur in both species. This problem is exac- erbated if GC content is biased, which will inflate the average similarity in genomic k-mer composition. GC content Sufficiently large k will overcome homoplasy, regardless of the evolutionary distance between species.
Mathematical prediction
Random ancestral sequence
Real (non-random) sequence
Assembly-free Sampling error caused by low genome coverage The actual number of k-mers will be under-represented given low sequencing coverage Sequencing errors Loss of true k-mers and the gain of false k-mers Filtering = remove singletons
Seq errors p=observed/true Coverage 5-8 sufficient to observe all true k-mers when filtering => Tip corrections
Filter only singletons?
Bootstrapping Nonparametric bootstrap 1) Resample original reads with replacement 2) “Block bootstrap” – take rows with probabilty 1/k OR Two-stage parametric bootstrap Estimate the variances in distances between species caused by sampling and evolutionary variation Independent of genome size
Bushbaby (galago)
Tarsier
Recently published phylogeny of primates
Assembled genomes, k=19
Assembled genomes, k=21
Simulated reads
Real data – tropical trees Intsia palembanica
Advantages Low coverage requirements Low computational demands 12 primates 25GB RAM, 12 threads Limitations Loss of k-mer sensitivity Deep nodes Location of mutations
Distance computing for 73 Escherichia strains AAF = 1h 48min andi 21 min
AAF andi