VARiD: A Variation Detection Framework for Color-space and Letter- space platforms By A.V. Dalca, S. M. Rumble, S. Levy, M. Brudno Presented by Velian Pandeliev
VARiD Overview Purpose: Variation Detection (SNP, indel) Pitch: First to use both colour-space and letter-space data Principle: Hidden Markov Model with Forward-Backward algorithm Platform: 454/Roche, Solexa, ABI SOLiD Pros: Can work with unconverted sets of both formats simultaneously Performance: linear in length of reference, great on mixed format data
ABI SOLiD Basics Reads bases two at a time Outputs one of four colours based on transition state machine:
ABI SOLiD Properties Read errors and SNPs present differently. Reference:
ABI SOLiD Properties Read errors and SNPs present differently. Reference: Error:
ABI SOLiD Properties Read errors and SNPs present differently. Reference: Error: SNP:
ABI SOLiD Properties A read error propagates through the rest of the sequence on translation to letter-space
Consequences Colour-space encoding is better suited to calling SNPs than letter-space encoding In letter-space data, errors do not propagate through to the rest of the read Wouldn’t it be great to have a SNP calling framework that could use both kinds of data!?
VARiD A Hidden Markov Model for Variation Detection In general, HMM’s have the following elements: -States (hidden) -Transitions (probabilities of reaching any particular state from the previous one) -Emissions (observed outputs)
Building a Basic HMM States: pairs of consecutive letter- space positions: S = {AA, AT, AC, AG TT, TA, TC, TG CC, CA, CT, CG GG, GA, GT, GC}
Building a Basic HMM Transitions: since consecutive states share a nucleotide, probabilities are defined as follows: P(transition WX YZ) = frequency(Z) if X=Y 0if X≠Y
Building a Basic HMM Emissions: a letter and a colour from donor reads at each state. E.g. P(emission = c|state = CA) = q(c|CA) = 1 – 3εif c is 1 εif c is 0, 2, 3 for colour space
Building a Basic HMM Emissions: a letter and a colour from donor reads at each state. E.g. P(emission = n|state = CA) = q(n|CA) = 1 – 3ξif n is A ξif n is C, G, T for letter space
Building a Basic HMM Emission probabilities from all reads: P(emissions = E|state = s) = which combines colour and letter space data
Building a Basic HMM Detecting variation is accomplished through finding the maximum likelihood state for each position in the genotype (the donor) and comparing it against the reference nucleotide.
Building a Basic HMM Source: Dalca, A. & Brudno, M. (Poster) By running the Forward-Backward algorithm on the HMM, a probability distribution is obtained from the possible states and a base is called (in bold).
Extensions The HMM described above is quite simple and only calls a single nucleotide for each position. VARiD extends the model to detect heterozygous SNPs, as well as to handle indels.
Microindels To deal with microindels (<5 bp) in the sample, gap states are required: E.g. [A G] (would emit colour 2) -4 dummy ‘gap’ nucleotides are defined, one for A, C, G, T -[A G] = {(A, gap-A), (gap-A, gap-A), (gapA-gap-A), (gap-A,G)} Colour 2
Microindels Requires 24 more states: -(X, gapX)x 4 -(gapX, gapX)x 4 -(gapX,Y)x16 -Total (incl. orig.) 40 states
Heterozygous SNPs For diploid samples, each state has to account for heterozygous differences Each state in VARiD’s HMM is a unique combination of two of the original 40 states (obtained by S x S) 40 2 = 1600 states!
Features Keeps track of quality scores and positions within a read to augment HMM error rates (ε, ξ) for greater accuracy Post-processing ensures that all heterozygous SNP calls are supported by enough reads
Features Source: Original paper
Features First T in a read is NOT part of the sequence.
Features First T is NOT part of the genotype! VARiD eliminates linker remnant without having to translate fully
VALiDation 260kb from the human genome Sequenced with ABI SOLiD and 454/Roche Reference obtained through Sanger reads Artificial datasets created with varying amounts of coverage Tested in colour-space alone (against Corona), letter-space alone (against gigaBayes) with various aligners and with a combination of data
VALiDation Measures: True Positives (correctly identified SNPs) False Positives (SNPs not in Sanger set) Precision (TP as fraction of all predictions) Recall (TP as fraction of Sanger set SNPs)
VALiDation Colour space only In colour space, VARiD had slightly higher precision than the Corona caller on AB- mapped reads, but had comparable and slightly lower recall. Using VARiD with SHRiMP produced a higher recall rate, but a lower precision when compared to VARiD + AB mapper. (no significance statistics were presented)
VALiDation Letter Space Only In letter space, gigaBayes + mosaik perfomed better than VARiD (using the same mosaik mapper) with low coverage, but fell behind in higher coverage. VARiD + SHRiMP did better than VARiD + mosaik in both low and high coverage, and clearly outperformed gigaBayes at 20x coverage
VALiDation Mixed space VARiD’s true strength lies in being able to combine colour- and letter-space reads and to perform better on them than on cost- equivalent letter-only or colour-only data:
Issues No statistical significance presented on performance improvement Experimental size relatively small (260kb) Not ideal for low coverage data Would be interesting to see how VARiD performs on more diverse data sets (more/fewer SNPs, indels, etc.)
Issues No statistical significance presented on performance improvement Experimental size relatively small (260kb) Not ideal for low coverage data Would be interesting to see how VARiD performs on more diverse data sets (more/fewer SNPs, indels, etc.) Any more?
The End.
References Dalca, A.V., Rumble, S.M., Levy, S., Brudno, M. VARiD: A Variation Detection Framework for Color-space and Letter- space platforms (in progress) Dalca, A.V. & Brudno, M. VARiD: Variation Detection in Color- space and Letter-space (poster) Hidden Markov model. (2010, Février 2). In Wikipedia, The Gratuit Encyclopedia. Retrieved 13:24, Février 10, 2010, from model&oldid= Rumble, S.M., Lacroute, P., Dalca, A.V., Fiume, M. Sidow, A. and Brudno, M. (2009) SHRiMP: Accurate mapping of short color-space reads. PLoS Comput Biol.