Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data Gabriel Ilie 1, Alex Zelikovsky 2 and Ion Măndoiu 1 1 CSE Department,

Slides:



Advertisements
Similar presentations
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.
Advertisements

JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Multi-Project Reticle Floorplanning and Wafer Dicing Andrew B. Kahng 1 Ion I. Mandoiu 2 Qinke Wang 1 Xu Xu 1 Alex Zelikovsky 3 (1) CSE Department, University.
Primer Selection Methods for Detection of Genomic Inversions and Deletions via PAMP Bhaskar DasGupta, University of Illinois at Chicago Jin Jun, and Ion.
Sequencing and Sequence Alignment
Data Processing Algorithms for Analysis of High Resolution MSMS Spectra of Peptides with Complex Patterns of Posttranslational Modifications Shenheng Guan.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Chapter 10: Iterative Improvement
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Justin Kennedy, Ion Mandoiu, Bogdan Pasaniuc CSE Department, University of Connecticut.
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
The restriction mapping problem revisited Gopal Pandurangan and H. Ramesh Journal of Computer and System Sciences 526~544(2002)
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.
May 25, GSU Biotech Symposium1 Minimum PCR Primer Set Selection with Amplification Length and Uniqueness Constraints Ion Mandoiu University of.
Genome sequencing and assembling
Inverse Alignment CS 374 Bahman Bahmani Fall 2006.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Chapter 5 Multiple Sequence Alignment.
CpE- 310B Engineering Computation and Simulation Dr. Manal Al-Bzoor
Efficient Gathering of Correlated Data in Sensor Networks
The Multiplicative Weights Update Method Based on Arora, Hazan & Kale (2005) Mashor Housh Oded Cats Advanced simulation methods Prof. Rubinstein.
Confidence Intervals 1 Chapter 6. Chapter Outline Confidence Intervals for the Mean (Large Samples) 6.2 Confidence Intervals for the Mean (Small.
Confidence Intervals for the Mean (Large Samples) Larson/Farber 4th ed 1 Section 6.1.
Confidence Intervals for the Mean (σ known) (Large Samples)
INF380 - Proteomics-91 INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS The MS/MS identification problem can be formulated.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
CS CM124/224 & HG CM124/224 DISCUSSION SECTION (JUN 6, 2013) TA: Farhad Hormozdiari.
Laxman Yetukuri T : Modeling of Proteomics Data
Brian Macpherson Ph.D, Professor of Statistics, University of Manitoba Tom Bingham Statistician, The Boeing Company.
Construction of Substitution Matrices
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
1 Multiple Regression A single numerical response variable, Y. Multiple numerical explanatory variables, X 1, X 2,…, X k.
Unit 6 Confidence Intervals If you arrive late (or leave early) please do not announce it to everyone as we get side tracked, instead send me an .
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Stable Multi-Target Tracking in Real-Time Surveillance Video
Section 6.1 Confidence Intervals for the Mean (Large Samples) Larson/Farber 4th ed.
1 Random Disambiguation Paths Al Aksakalli In Collaboration with Carey Priebe & Donniell Fishkind Department of Applied Mathematics and Statistics Johns.
September 28, 2000 Improved Simultaneous Data Reconciliation, Bias Detection and Identification Using Mixed Integer Optimization Methods Presented by:
Hypothesis Testing Steps for the Rejection Region Method State H 1 and State H 0 State the Test Statistic and its sampling distribution (normal or t) Determine.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Classification Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 24, 2015.
CSE280Stefano/Hossein Project: Primer design for cancer genomics.
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
Chapter Confidence Intervals 1 of 31 6  2012 Pearson Education, Inc. All rights reserved.
1 Chapter 5 Branch-and-bound Framework and Its Applications.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Chapter 6 Confidence Intervals 1 Larson/Farber 4th ed.
KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko.
ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko.
Multiple sequence alignment (msa)
Alexander Zelikovsky Computer Science Department
Jin Zhang, Jiayin Wang and Yufeng Wu
Chapter 6 Confidence Intervals.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
CSE 589 Applied Algorithms Spring 1999
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Confidence Intervals for the Mean (Large Samples)
Chapter 6 Confidence Intervals.
Fragment Assembly 7/30/2019.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data Gabriel Ilie 1, Alex Zelikovsky 2 and Ion Măndoiu 1 1 CSE Department, University of Connecticut 2 CS Department, Georgia State University

MassCLEAVE assay for MS-based nucleic acid sequence analysis

Signed relative errors assumed to follow a normal distribution with mean 0, standard deviation σ for masses and σ’ for intensities Two types of error incurred when matching compomer c to peak of mass m and intensity i(m): – Relative mass error – Relative intensity error: Error model

Notations Σ= {A,C,G, T } = DNA alphabet CS σ (s) = compomer spectrum of a sequence s digested at cut base σ CS(s) = (CS σ (s) ) σϵΣ = the compomer spectra obtained by performing all four cleavage reactions on s MS σ = mass spectrum obtained by base-specific cleavage of the unknown target at cut base σ MS = (MS σ (s) ) σϵΣ = the MS spectra obtained by performing all four cleavage reactions m 0 = minimum detectable mass σ = standard deviation of signed relative measurement errors in the masses σ' = standard deviation of signed relative measurement errors in the intensities  = user specified tolerance parameter

Problem formulation Given: Mass spectra MS Reference sequence r including position of PCR primers Maximum edit distance D Standard deviations σ and σ’, tolerance parameter  Find: Target sequence t flanked by PCR primers that a.is within edit distance D of r, and b.yields a matching of compomers of CS(t) to masses of MS with minimum total relative error

Naïve Algorithm Exhaustive search – Generate all sequences within an edit distance of D of the reference, and – Compute the minimum total relative error for matching the compomers of each of these sequences to the masses in MS. The number of candidate sequences grows exponentially with D

3-Stage Algorithm 1.Identify regions of the reference sequence that are unambiguously supported by MS data – High probability to be present in the unknown target sequence 2.Branch-and-bound approach to fill in remaining gaps – Generates set of candidate sequences with compomers supported by MS data 3.Compute candidate sequences with minimum total relative error – Min-cost flow problem currently solved as linear program – With or without intensities

First stage: finding strongly supported regions of the reference Chebyshev’s inequality: A detectable compomer c ϵ CS σ (s) is strongly matched to mass m ϵ MS σ (s) if: where ε = σ /  0.5 is set based on a user specified tolerance 

First stage: finding strongly supported regions of the reference A strong match between compomer c and mass m is unambiguous if: – c has multiplicity of 1 in reference – c can be strongly matched only to m – m can be strongly matched only to c The set M of unambiguous matches can be found efficiently by binary search

First stage: finding strongly supported regions of the reference which are normally distributed with mean 0 and standard deviation σ /i 0.5 If Chebyshev’s inequality fails for index i, match(c i, m i ) is removed from M (c 1, m 1 ),..., (c n, m n ) = unambiguous matches for cut base σ, indexed in non-decreasing order of relative errors We iteratively apply Chebyshev’s inequality with tolerance  to the running means of signed relative errors,

First stage: finding strongly supported regions of the reference A position in the reference sequence has strong support if – All detectable compomers overlapping it can be strongly matched, and – At least one of these matches is in M (unambiguous + not removed) Positions in PCR primers automatically marked as having strong support

Second stage: generating candidate targets by branch-and-bound Reference regions with strong support assumed to be present in target Gaps filled one base at a time, in left-to-right order, using branch-and-bound – Choice order: reference base, substitutions, deletion, insertions – Chebyshev test with tolerance  applied to running means of signed relative errors of closest matches Search pruned when test fails or more than D mutations

Third stage: scoring candidates by linear programming Objective: – Minimize total relative error Variables: – For each c ϵ CS σ and m ϵ MS σ, x c,m is set to 1 if c is matched to m, 0 otherwise (integrality follows from total unimodularity) Constraints: – No missing peaks: each detectable compomer c ϵ CS σ (t) must be matched to one mass in MS σ – No extraneous peaks: each mass m ϵ MS σ must be matched to at least one detectable compomer c ϵ CS σ (t)

LP w/o intensities

LP with intensities

Simulation setup Reference length: bp Reference sequences/targets – D=1: 10 random references, all sequences at edit distance 1 used as targets – D=2,3: 100 random reference-target pairs Error free MS data: σ = σ’ = 0 Noisy MS data: σ = , σ’ =0-1 Tolerance parameter:  = 0.01

Precision and Recall actual target predicted target(s) tp (true positive) Prediction is unique & correct fp (false positive) Prediction is unique & incorrect fn (false negative) Prediction is not unique

Branch-and-bound vs. Naïve (F-measure for D=1, error free data, w/o intensities)

Branch-and-bound speed-up (D=1, error free data, w/o intensities)

Results on noisy data (F-measure, D=1, σ = , w/o intensities)

Effect of the number of mutations (F-measure, σ = , w/o intensities)

Do intensities help? (F-measure, σ = , 1 substitution)

Do intensities help? (F-measure, σ = )

Ongoing Work Experiments on EPLD clone data – Branch-and-bound relaxation + penalty in LP objective to handle missing/extraneous peaks – Intensity data normalization: correct for mass and base composition effects