Alexander Zelikovsky Computer Science Department

Reconstruction of HCV Quasispecies Spectrum from 454 Life Science Reads
Alexander Zelikovsky Computer Science Department Georgia State University

Outline of HCV Quasispecies
454 Life Sciences Pyrosequencing System Quasispecies Spectrum Reconstruction Problem Solution Method Experimental Results

HCV Quasispecies Many viruses (e.g., HIV, HCV) encode their genome in RNA rather than DNA. RNA viruses are unable to detect and repair mistakes during replication due to the lack of DNA polymerase. Mutations are passed down to descendants, producing a family of related variants of the ancestral genome referred as a quasispecies. Which types of mutations will result in virulent quasispecies?

454 Life Sciences Pyrosequencing
Pyrosequencing =Sequencing by Synthesis. GS FLX Titanium : Divides the source genetic material into reads ( bp) Sequences the reads Assembles the reads into the original genome via software The software was originally designed to sequence a single organism. We need a software that assembles reads to multiple genomes!

Quasispecies Spectrum Reconstruction Problem
Given a collection of 454 reads taken from a quasispecies population of unknown size and distribution, Reconstruct the quasispecies spectrum quasispecies sequences quasispecies frequencies

Algorithmic Flow

Solution Overview Align reads to the consensus sequence
Build a read graph representing the space of all possible read assemblies in the problem instance Find candidates quasispecies corresponding to paths in the read graph Estimate candidate frequencies by maximum likelihood using EM

Alignment Every position of each sequence in the population is covered by multiple reads. Consensus (or reference) sequence is available. Alignment to consensus with minimum score is unique minimizes Hamming distances penalizes indels more than mismatches

454 Sequencing Error Detection
454 Pyrosequencing error rate = 0.04% of bp T:1.1, A:0.1, C:0.9, G:0.1, T:1.6, A:0.0, C:0.4, G:1.0 => TCTTG. An error occurs when the signal intensity is more than 0.5 from the true value.

Processing of 454 reads Deletions in reads: D Reference: …AGGGTGCGAAG…
Read: …AGGGT-CGAAG… => …AGGGTDCGAAG… Insertions into reference: ignore them or expand reference. Imputation of missing values

Imputation of missing values
Allele X in position n and allele Y in position m are a “tag box” for m if correlation r2(Xn,Ym) ≥ 0.8. For each missing value, choose allele value based on the top associated “tag boxes”.

Read Graph: Vertices Subread = completely contained in some read with ≤ n mismatches. Superread = not a subread  the vertex in the read graph. The frequency of the superread is the number of its subreads plus one. 454 reads with 1 mismatch: 9161 (27764) 454 reads with 6 mismatches: 1942 (27764)

Read Graph: Edges Edge b/w two vertices if there is an overlap between superreads and they agree on their overlap with ≤ m mismatches. Auxiliary vertices: source and sink Transitive reduction If no mismatches  any source-sink path is a distinct candidate sequence

Edge Cost Choose the most probable source-sink path through each vertex. Cost measures the uncertainty that two superreads belong to the same quasispecies. Overhang is the difference in start positions of two overlapping superreads. r l

Candidate Path Selection
The Shortest Path per vertex No correct candidate quasispecies are lost with increasing steepness! The Max Bandwidth Path per vertex (k=0)

Path to Sequence Each path corresponds to the candidate sequence.
Build coarse sequence out of path’s superreads: For each position: >70%-majority if it exists, otherwise N Replace coarse sequence with weighted consensus obtained on all reads. For read r of length l and sequence s of length L, weight is a probability of r belong to s with k mismatches: where t is a number of allowed mutations per sequence 3. Select unique sequences out of constructed sequences.

Path to Sequence - mismatches
No mismatches in superreads and overlaps => all sequences are unique In data provided by P. Balfe, there are 156 unique sequences out of 940 paths. Repetition of the same sequence = implicit evidence that the sequence is among quasispecies.

Expectation Maximization
Bipartite graph: Qq is a candidate with frequency fq Rr is a read with observed frequency or Weight hq,r of edge connecting read with candidate has weight hq,r = probability that read r is produced by quasispecies q with k mismatches E step: M step:

Simulations 44 real quasispecies sequences (1739 bp long) from the E1E2 region of Hepatitis C virus (von Hahn et al. (2006)) Simulated reads: 4 populations sizes: 10, 20, 30, 40 sequences. 3 population distributions: geometric, skewed normal, uniform. The quasispecies population: Number of reads is { 20K, 40K, 60K, 80K, 100K } The read length distribution N(μ,400); μ is varied from 200 to 500.

Results for Simulation
PPV= TP/(TP+FP) and Sensitivity=TP/(TP+FN) PPV and sensitivity for shortest path and max bandwidth path per vertex candidate selection on geometric population of 10 quasispecies with 60K reads.

More Experimental Results
Sensitivity for shortest path/max bandwidth path on geometric and uniform population of 10 quasispecies with 100K and 1Mln reads.

Experimental Results Relative entropy:
RE vs. number of quasispecies (60K reads, geometric), number of reads (10 qsp under geometric distribution) and population types (60K reads, 10 qsp).

454-READ HCV Data (Courtesy P. Balfe)
30927 reads from 5.2Kbp length region from the E1E2 region of Hepatitis C virus (Peter Balfe) Segemehl software: 27764 reads 45230 deletions in the reads (approximately reads has at least one deletion) 2416 N’s in total (2033 reads with at least one N) 70641 insertions in the reads ( reads has at least one insertion)

HCV Data: Real reads

Experimental Results

Experimental Results We remove all insertions (vs consensus) in the reads. We allow for up to 6 mismatches to cluster reads and up to 15 mismatches in overlap. Out of 940 candidate max-bandwidth paths per-vertex, there are only 156 distinct sequence candidates (means different in at least one bp). If we allow up 6 mismatches, then there are only 58 distinct candidates.

NJ Tree for 12 most Frequent quasispecies

Distance Between Top 12 Qsp for Different Combinations of Mistmatches

Statistical Validation
Frequency % of Instance Average Frequency k=0 k=1 k=2 k=10 1 0.0888 0.44 0.52 0.64 0.93 0.0338 0.0375 0.0433 0.0618 2 0.0266 0.34 0.46 0.59 0.69 0.0092 0.0133 0.0171 0.0199 3 0.0229 0.45 0.79 0.91 0.96 0.0093 0.0215 0.0239 0.0257 4 0.0228 0.39 0.43 0.87 0.0101 0.0112 0.0119 0.0231 5 0.0173 0.74 0.78 0.85 0.88 0.0189 0.02 0.021 0.0217 6 0.56 0.83 0.0145 0.0151 0.0203 0.0216 7 0.0167 0.68 0.72 0.75 0.97 0.0174 0.0185 0.0197 8 0.53 0.0136 0.0178 0.023 9 0.48 0.57 0.98 0.0132 0.014 0.0159 0.0241 10 0.51 0.66 0.82 0.0149 0.0177 0.0214 0.0246 Number of times and average frequency of repeating 10 most frequent sequences with set of reads reduced by 10\%.

454-Error Correction Have troubles with stop codons –
all constructed quasispecies sequence had them (Manual) Resolution method for quasispecies sequences (qsp): Find the frame with start codon and motif MSTNP … for the qsp and the reference Find the first stop-codon position in qsp Align the amino-acid translations of qsp and the reference In the alignment go left from the stop-codon until the correct alignment – find first nucleotide monomer to the left Try to extend and to reduce the monomer by one base and choose the one which matches the reference Resolution method for quasispecies sequences (qsp) Do the same but with reads with extended reduced monomers

Dependence on mutation rate

Acknowledgements PhD students (GSU, CS Department)
Irina Astrovskaya (graduating Fall 2010) Kelly Westbrrooks (graduated Spring 2009) now at Life Technolgies – working on SOLID Bassam Tork Serghei Mangul Ion Mandoiu (Uconn, CS Department)

Thank you for your attention!

Alexander Zelikovsky Computer Science Department

Similar presentations

Presentation on theme: "Alexander Zelikovsky Computer Science Department"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Alexander Zelikovsky Computer Science Department

Similar presentations

Presentation on theme: "Alexander Zelikovsky Computer Science Department"— Presentation transcript:

Similar presentations

About project

Feedback