Download presentation
Presentation is loading. Please wait.
Published byVivian Bradford Modified over 8 years ago
1
ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon
2
Traditional methods for building phylogeny Requirements: High coverage Assembly Detection of putative orthologous genes Alignment Phylogeny from tiny portion of the whole genome Genome scale multi-sequence alignment is difficult
3
Alignment-free methods for building phylogeny Typically from assembled genomes De novo assembly with short reads? Mainly on closely related prokaryotic genomes No confidence assessment (e.g. bootstrapping)
4
Overview Assembly and Alignment-Free method (AAF) Calculate phylogenetic distances using whole genome short read sequencing data Method validation Genome complexity Different genome sizes Sequencing errors Range of sequencing coverage 12 mammal species 21 tropical tree species Comparision with andi
5
AAF method Calculate pairwise genetic distances between each sample using the number of evolutionary changes between their genomes, which are represented by the number of k-mers that differ between genomes. Phylogenetic relationships among the genomes are then reconstructed from the pairwise distance matrix
6
AAF method - Evolutionary model The probability that no mutation will occur within a given k-mer between species A and B is exp(−kd). If only substitutions occurred, all k-mers are unique, then all the species will have the same total number of k-mers, n t, and the maximum likelihood estimate of exp(−kd) is n s /n t. Mutations will decrease the number of shared k-mers, n s, between species relative to the total number of k-mers, n t Insertion: loss of (k – 1) or gain of (l + k – 1) k-mers Deletion: loss of (l+k – 1) or gain of (k – 1) k-mers Greater effect
7
K-mer sensitivity and homoplasy No assembly -> not all indels identified If k-mer covers multiple substitutions Shorter k-mers -> better sensitivity Shorter k-mers -> same k-mers from evolutionary different regions Homoplasy
8
K-mer homoplasy k=15 Genome size > 5x10 8 => same k-mers randomly in other species May incorrectly inflate the proportion of shared k-mers The optimal k for phylogenetic reconstruction is the k which is just large enough to greatly reduce k-mer homoplasy for a given genome size
9
phph Prediction of the ratio n s /n t Large genomes and small k p h = 1 all possible k-mers occur in both species. This problem is exac- erbated if GC content is biased, which will inflate the average similarity in genomic k-mer composition. GC content Sufficiently large k will overcome homoplasy, regardless of the evolutionary distance between species.
10
Mathematical prediction
11
Random ancestral sequence
12
Real (non-random) sequence
13
Assembly-free Sampling error caused by low genome coverage The actual number of k-mers will be under-represented given low sequencing coverage Sequencing errors Loss of true k-mers and the gain of false k-mers Filtering = remove singletons
14
Seq errors p=observed/true Coverage 5-8 sufficient to observe all true k-mers when filtering => Tip corrections
15
Filter only singletons?
17
Bootstrapping Nonparametric bootstrap 1) Resample original reads with replacement 2) “Block bootstrap” – take rows with probabilty 1/k OR Two-stage parametric bootstrap Estimate the variances in distances between species caused by sampling and evolutionary variation Independent of genome size
18
Bushbaby (galago)
19
Tarsier
20
Recently published phylogeny of primates
21
Assembled genomes, k=19
22
Assembled genomes, k=21
23
Simulated reads
25
Real data – tropical trees Intsia palembanica
26
Advantages Low coverage requirements Low computational demands 12 primates 25GB RAM, 12 threads Limitations Loss of k-mer sensitivity Deep nodes Location of mutations
27
Distance computing for 73 Escherichia strains AAF 32+76 = 1h 48min andi 21 min
28
AAF andi
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.