The use of short-read next generation sequences to recover the evolutionary histories in multi-individual samples Systematic biology presentation Yuantong Ding Dec. 6
Outline Background Workflow Sequence comparison Tree comparison Summary & future work
Can short-reads successfully recover phylogeny? Next generation sequencing (NGS) Low-cost High-throughput Short-read Multi individual sample Short-reads Reconstructed sequence phylogeny ? BackgroundWorkflowSequence comparison Tree comparisonSummary
Simulation process Original genealogyOriginal haplotypesNJ tree Simulated by SerialSimCoal with coalescent model Consensus sequence Short-reads Simulated by MetaSim with 454 error model Mapping Alignment built by SHRiMP and SSAHA Reconstructed haplotypes Haplotypes reconstructed by ShoRAH NJ tree built by PAUP* Compare tree topology Compare number and similarity of haplotypes BackgroundWorkflowSequence comparison Tree comparisonSummary
6 parameters used Effective population size N Sample size n Mutation rate μ Sequence length l NnμlSr_NSr_l E E E — Number of short-reads Sr_N Length of short-reads Sr_l BackgroundWorkflowSequence comparison Tree comparisonSummary All 486 combination of these parameters were simulated
Different numbers of haplotypes BackgroundWorkflowSequence comparison Tree comparisonSummary
Similar sequences BackgroundWorkflowSequence comparison Tree comparisonSummary
Can reconstructed haplotypes still capture some phylogenetic information? Different haplotypes number impossible to recover the true phylogenetic trees Assuming true haplotypes number of the sample is known Select the most similar reconstructed sequences to build phylogeny tree Calculate symmetric difference BackgroundWorkflowSequence comparison Tree comparisonSummary Cluster (k-mean) reconstructed haplotypes to n groups Build tree with consensus sequence of each group Calculate tree balance statistics
Method for tree comparison A B C B A C (BC) (ABC) (AC) (ABC) symmetric difference = 2 Symmetric difference for rooted and labeled trees Tree balance statistics for rooted and unlabeled trees A N i is the internal nodes number between tip i and root e.g. i=A, N A = 2, Ñ = ( )/5=2.4
Different topology of most similar sequence tree BackgroundWorkflowSequence comparison Tree comparisonSummary
Different balance statistics of k- mean cluster tree BackgroundWorkflowSequence comparison Tree comparisonSummary nN_barI_c orgrecPorgrecP e e e e-09
Summary & future work Reconstructed haplotypes typically failed to estimate the correct number of haplotypes Consequently, it was not possible to recover the true phylogenetic trees. Even assuming we know the true haplotype number, the chance to recover the true tree topology is still small. Other reconstruction method, use multiple reference sequence when mapping…
Reference Anderson, C.N.K., Ramakrishnan, U. et al Serial SimCoal: A population genetic model for data from multiple populations and points in time.. Bioinformatics 21, Johnson, P.L., Slatkin, M., Inference of population genetic parameters in metagenomics: a clean look at messy data. Genome Res 16, Richter, D.C., Ott, F. et al MetaSim—A Sequencing Simulator for Genomics and Metagenomics. PLoS ONE 3, Suzuki, S., Ono, N., Furusawa, C., Ying, B.-W., Yomo, T., Comparison of Sequence Reads Obtained from Three Next-Generation Sequencing Platforms. PLoS ONE 6, e Zagordi, O., Bhattacharya, A. et al ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data. BMC Bioinformatics 12, 119 Metei D., Misko D,. et al SHRiMP2: Sensitive yet Practical Short Read Mapping. Bioinformatics 27, 7 Ning Z, Cox AJ and Mullikin JC SSAHA: a fast search method for large DNA databases. Genome research,