Population sequencing using short reads: HIV as a case study Vladimir Jojic et.al. PSB 13: (2008) Presenter: Yong Li
Overview Background –Population sequencing & metagenomics –Pyrosequencing & classical sequencing The Problem and the challenge –low concentration; short reads; sequencing errors; The model –sequence & frequency reads The EM algorithm Validation
Background Population sequencing & metagenomics –Multiple strain vs. multiple species –HIV drug resistance from rare variants Pyrosequencing & chromatographical –Ultra-deep sequencing, 454 sequencing –Short reads; high error rate; homopolymers –Sensitivity 0.1% vs. 20% To clone or not to clone? –Two protocols to detect mutational variant –Cloning bias; stoichiometry
Genome Res. Wang et al. 17: , 2007 Clonal amplification
Genome Res. Wang et al. 17: , 2007
The computational problem Given: – 454 sequencing reads Get: –Reconstruct the population Sequences (epitome) –Estimate the relative quantity Statistical model
The statistical model (1) Indel frequency Sequencing error parameter
The statistical model (2)
The hidden variable: Model parameters: Observed variable:, t = 1…T EM algorithm ?
Computational tricks One tau Clustering of reads Initialization Determining the number of strains: S –Trails
Validation Data is partially simulated –e is composed of real HIV variants –Artificial values for –x generated from the very probabilistic model with 1% substitution; 2% insertion, 0.5% deletion Two datasets –1. Varied strains frequencies, and coverage –2. Varied mutation density
Discussion High sensitivity compared with chromatography approach –0.1% relative abundance May be applied to metagenomic sequencing Need validation using real date Need comparison with other method
Questions?