Download presentation
Presentation is loading. Please wait.
Published byAusten Hubbard Modified over 9 years ago
1
Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data Gabriel Ilie 1, Alex Zelikovsky 2 and Ion Măndoiu 1 1 CSE Department, University of Connecticut 2 CS Department, Georgia State University
2
MassCLEAVE assay for MS-based nucleic acid sequence analysis
3
Signed relative errors assumed to follow a normal distribution with mean 0, standard deviation σ for masses and σ’ for intensities Two types of error incurred when matching compomer c to peak of mass m and intensity i(m): – Relative mass error – Relative intensity error: Error model
4
Notations Σ= {A,C,G, T } = DNA alphabet CS σ (s) = compomer spectrum of a sequence s digested at cut base σ CS(s) = (CS σ (s) ) σϵΣ = the compomer spectra obtained by performing all four cleavage reactions on s MS σ = mass spectrum obtained by base-specific cleavage of the unknown target at cut base σ MS = (MS σ (s) ) σϵΣ = the MS spectra obtained by performing all four cleavage reactions m 0 = minimum detectable mass σ = standard deviation of signed relative measurement errors in the masses σ' = standard deviation of signed relative measurement errors in the intensities = user specified tolerance parameter
5
Problem formulation Given: Mass spectra MS Reference sequence r including position of PCR primers Maximum edit distance D Standard deviations σ and σ’, tolerance parameter Find: Target sequence t flanked by PCR primers that a.is within edit distance D of r, and b.yields a matching of compomers of CS(t) to masses of MS with minimum total relative error
6
Naïve Algorithm Exhaustive search – Generate all sequences within an edit distance of D of the reference, and – Compute the minimum total relative error for matching the compomers of each of these sequences to the masses in MS. The number of candidate sequences grows exponentially with D
7
3-Stage Algorithm 1.Identify regions of the reference sequence that are unambiguously supported by MS data – High probability to be present in the unknown target sequence 2.Branch-and-bound approach to fill in remaining gaps – Generates set of candidate sequences with compomers supported by MS data 3.Compute candidate sequences with minimum total relative error – Min-cost flow problem currently solved as linear program – With or without intensities
8
First stage: finding strongly supported regions of the reference Chebyshev’s inequality: A detectable compomer c ϵ CS σ (s) is strongly matched to mass m ϵ MS σ (s) if: where ε = σ / 0.5 is set based on a user specified tolerance
9
First stage: finding strongly supported regions of the reference A strong match between compomer c and mass m is unambiguous if: – c has multiplicity of 1 in reference – c can be strongly matched only to m – m can be strongly matched only to c The set M of unambiguous matches can be found efficiently by binary search
10
First stage: finding strongly supported regions of the reference which are normally distributed with mean 0 and standard deviation σ /i 0.5 If Chebyshev’s inequality fails for index i, match(c i, m i ) is removed from M (c 1, m 1 ),..., (c n, m n ) = unambiguous matches for cut base σ, indexed in non-decreasing order of relative errors We iteratively apply Chebyshev’s inequality with tolerance to the running means of signed relative errors,
11
First stage: finding strongly supported regions of the reference A position in the reference sequence has strong support if – All detectable compomers overlapping it can be strongly matched, and – At least one of these matches is in M (unambiguous + not removed) Positions in PCR primers automatically marked as having strong support
12
Second stage: generating candidate targets by branch-and-bound Reference regions with strong support assumed to be present in target Gaps filled one base at a time, in left-to-right order, using branch-and-bound – Choice order: reference base, substitutions, deletion, insertions – Chebyshev test with tolerance applied to running means of signed relative errors of closest matches Search pruned when test fails or more than D mutations
13
Third stage: scoring candidates by linear programming Objective: – Minimize total relative error Variables: – For each c ϵ CS σ and m ϵ MS σ, x c,m is set to 1 if c is matched to m, 0 otherwise (integrality follows from total unimodularity) Constraints: – No missing peaks: each detectable compomer c ϵ CS σ (t) must be matched to one mass in MS σ – No extraneous peaks: each mass m ϵ MS σ must be matched to at least one detectable compomer c ϵ CS σ (t)
14
LP w/o intensities
15
LP with intensities
16
Simulation setup Reference length: 100-500 bp Reference sequences/targets – D=1: 10 random references, all sequences at edit distance 1 used as targets – D=2,3: 100 random reference-target pairs Error free MS data: σ = σ’ = 0 Noisy MS data: σ = 0.0001, σ’ =0-1 Tolerance parameter: = 0.01
17
Precision and Recall actual target predicted target(s) tp (true positive) Prediction is unique & correct fp (false positive) Prediction is unique & incorrect fn (false negative) Prediction is not unique
18
Branch-and-bound vs. Naïve (F-measure for D=1, error free data, w/o intensities)
19
Branch-and-bound speed-up (D=1, error free data, w/o intensities)
20
Results on noisy data (F-measure, D=1, σ = 0.0001, w/o intensities)
21
Effect of the number of mutations (F-measure, σ = 0.0001, w/o intensities)
22
Do intensities help? (F-measure, σ = 0.0001, 1 substitution)
23
Do intensities help? (F-measure, σ = 0.0001)
24
Ongoing Work Experiments on EPLD clone data – Branch-and-bound relaxation + penalty in LP objective to handle missing/extraneous peaks – Intensity data normalization: correct for mass and base composition effects
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.