Computing Xcorr exact p values

Computing Xcorr exact p values
Jeff Howbert August 14, 2013

XCorr is not well calibrated
Relative ranking: among the matches for a single spectrum, correct peptide gets a high rank XCorr performs well Absolute ranking: among the (pooled) matches from all spectra, correct peptides get high ranks Usual basis for computing confidence measures of matches For this, scores must be calibrated: matches with similar scores from different spectra have similar level of confidence XCorr performs much less well

XCorr is not well calibrated

Approaches to calibration problem
Post-process raw PSM scores PeptideProphet, Percolator, etc. Model discriminates correct from incorrect PSMs Built using all PSMs Uses features related to peptides and spectra other than MS/MS masses and intensities Calculate p value for each PSM Relative to distribution of possible scores for that spectrum only Normalizes for differing number of peaks, candidate peptides, etc. Compare p values across spectra in place of raw PSM scores

Exact p values via dynamic programming
Given: An additive score function S A precursor mass m Use dynamic programming to calculate counts of scores si (smin  si  smax) over all possible peptides with mass m Counts provide a non-parametric null distribution of scores Can calculate p value of a PSM by comparing its score to this distribution

Computing XCorr Observed spectrum Theoretical spectrum VNIQEELGK
for each peptide bond: b ion y ion neutral losses stage 1 sqrt stage 2 normalize regions stage 3 cross-corr. penalty dot product XCorr score

Refactoring XCorr Move generation of b ion, y ion, neutral loss peaks from theoretical spectrum to processing of observed spectrum Multiply observed spectrum with b/y/neutral loss fingerprint centered at each of masses from 1 to m Measures evidence for backbone cleavage at each mass position Small modification of standard XCorr to make refactored XCorr additive

fingerprint of b / y / neutral losses
Refactoring XCorr observed spectrum theoretical spectrum stage 3 VNIQEELGK fingerprint of b / y / neutral losses centered at mi = 347 sum of evidence for cleavage at mi = 347 binary markers of backbone cleavage dot product vector of cleavage evidence “Xcorr” score

Correlation of Crux and refactored XCorr

Dynamic programming on XCorr
Count of partial peptides with score si and mass mj is sum of counts of partial peptides with one less residue score si – se where se is measure of cleavage evidence at mj where pa is probability of occurrence of amino acid a (position specific)

Score distributions are dependent on spectrum and selected precursor mass

Transformation of XCorr p values
Sidak correction: Corrects for multiple testing (similar to Bonferroni) Accounts for fact that database search considers hundreds of peptides for each spectrum

Comparison of scoring functions experimental design
All experimental parameters matched as exactly as possible (e.g. missed cleavages, peptide search window, precursor isotopes, …) Protein fasta file predigested to target peptide set with trypsin/P. Decoy peptide set generated from shuffling target peptides; redundant/overlapping peptides removed. Size ratio of target vs. decoy peptide sets ~ 1.0. Spectrum file set up so every search engine forced to find matches to identical set of spectrum-charge combinations. Scoring function performance compared via absolute ranking (q-value) curves, generated via target/decoy analysis.

Spectrum-level performance of scoring functions
worm-PAnDA MS1 high resolution MS2 low resolution yeast-01 MS1 / MS2 low resolution

Spectrum-level performance of scoring functions
human heart 12 MudPIT runs MS1 / MS2 low resolution

Peptide-level performance of scoring functions
worm-PAnDA MS1 high resolution MS2 low resolution yeast-01 MS1 / MS2 low resolution

Peptide-level performance of scoring functions
human heart 12 MudPIT runs MS1 / MS2 low resolution

Other applications of XCorr exact p values
Produce better calibrated peptide-level p values In hypothesis-driven approach, to calculate best spectrum match for each peptide in database (collaboration with MacCoss lab) As model for dynamic programming on other score functions (collaboration with Bilmes lab)

Supplemental slides

Exact p values improve calibration of XCorr

p value

Uses of XCorr exact p values
Use target and decoy exact p values as calibrated pseudo-scores to calculate q values in target/decoy framework Use target exact p values: to calculate PSM q values without reference to decoys, using Benjamini-Hochberg formula in hypothesis-driven approach, to calculate best p value for SPMs (spectrum-peptide matches) for each candidate peptide

Calculating XCorr exact p values
CHALLENGE: calculating distribution of scores for a single spectrum

Shotgun proteomics experiment Peptide-spectrum match (PSM) by database search
100 K+ predicted peptides (targets) 100 K+ shuffled peptides (decoys) 10-50 K tandem mass spectra * IPDPMK MPPLDK AMFGFK AIMHTK TAIMHK DIVLLK IVILDK LDIVIK LIDIVK LVDLIK VLDIIK VLDLLK VLLLDK IVELVK PPFVIK IIIVSR LIVLSR LSLLVR LVSILR SLIIVR VVTLLR IMIPAR ILTLIK LLTLLK LMPVLK MIPLVK * IMPPDK MPLDPK AGFMFK AIMHTK TMAHIK DLVILK IVLDLK IVLLDK LDVIIK LIDIVK LVLDLK VDILIK VLDLLK IELVVK PPFVIK IISVIR LLVSLR LSILVR LSVLIR SLIIVR VLTLVR IIPMAR IILLTK LLLLTK LPVMLK MVILPK PSM scores 0.1743 0.4491 0.0270 0.1250 0.1643 * 0.1031 0.0102 0.5600 0.1255 0.4771 0.0034 0.1875 PSM scores 0.3135 0.0116 0.0230 0.1907 0.1303 0.3874 0.0900 0.3078 0.0462 0.0185 0.3812 0.0714

XCorr exact p values via Weibull fitting
Fit Weibull curve to distribution of Xcorr scores for all matches to a given spectrum (except top match) Calculate p value for a given match from CDF of fitted curve

Fragment ion nomenclature
N-terminal (prefix) fragment = b ion b – H2O, b – NH3, a (b – CO), a – H2O, etc. C-terminal (suffix) fragment = y ion y – H2O, y – NH3, etc. NOTE: for a given peptide mass m mb + my = m + 2

Dynamic programming: exact count of peptides for each mass in a range
Given for this problem: Mass range in which to calculate exact counts, e.g. 1 to 2000 A set of amino acids and their (integer) residual masses mG = 57, mA = 71, … , mW = 186 Dynamic programming works for problems where each answer (including final one) can be calculated from results of smaller subproblems. Need to define a recursion. Count of peptides with mass m is sum of counts of peptides with one less residue:

Dynamic programming: exact count of peptides for each mass in a range
H - X1 - X2 - X3 - … - Xn-2 - Xn-1 - Xn - OH … 1 m=18 m=2000 O( m  |AA| ) ~ sec.

Count of partial peptides with score si and mass mj is sum of counts of partial peptides with one less residue score si – se where se is measure of cleavage evidence at mj

Last column holds distribution of score counts for all peptides with m = 150 O( m  ( smax – smin )  | AA | ) ~ 1 sec for m = 1500

Computational complexity = O( m  ( smax – smin )  | AA | ) Runtime per spectrum  1 sec for m = 1500

Integerizing XCorr For dynamic programming to work, both masses and scores must be discretized (integerized), because they will be used as indices into the dynamic programming array. There are ways to do this badly …

Incorporating dynamic programming / p values into database search
As follow-on to standard Crux processing. For each spectrum: Choose best PSM according to rank of XCorr scores Calculate p value of this PSM OR As integral part of database search. For each spectrum: Calculate p values of all PSMs Choose best PSM according to rank of p values

Top-ranked PSMs from XCorr and exact p values not necessarily the same

Exact p values improve calibration of XCorr best PSM according to rank of XCorr scores

Exact p values improve calibration of XCorr best PSM according to rank of p values

Fine-tuning dynamic programming on XCorr 10000 random PSMs (all ranks)
Initial implementation of dynamic programming weighted all possible peptides equally p values for shuffled tryptic peptides not quite right for null distribution

Adjusting counts according to amino acid probabilities made things worse

But adjusting counts using position-specific amino acid probabilities appropriate for tryptic peptides solved the problem

Top-ranked decoy PSMs from exact p values show proper null distribution

Using optimal bin offset improves search using exact p values

Calibrations from exact p values and Percolator are synergistic

Peptide backbone H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1
N-terminus C-terminus AA residuei-1 AA residuei AA residuei+1

Peptide fragmentation
collision induced dissociation (CID) H+ H...-HN-CH-CO NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 Prefix Fragment Suffix Fragment Peptides tend to fragment along the backbone Fragments can also lose neutral chemical groups like NH3 and H2O

Example of spectrum graph for NDEMK

Prefix and suffix fragment ions

Example of conditional probability table
Pos(m) - the region of the cleavage site in the peptide ( 0 - first fifth, 4 - last fifth). TABELE: Prob(b | Pos(m) , y ) Pos(m) y b Prob Log Prob 1 0 zero zero zero low zero medium zero high low zero low low low medium low high medium zero medium low medium medium medium high high zero

Flanking amino acid equivalence sets
0 X-X (default) 1 Pro-X 2 X-Pro 3 Gly-X 4 X-Gly 5 Arg/Lys-X 6 His-X 7 X-His 8 Asp/Glu-X 9 X-Asp/Glu 10 Ile/Leu/Val-X 11 X-Ile/Leu/Val 12 Ser/Thr-X 13 X-Ser/Thr 14 Asn-X 15 X-Asn

Computing Xcorr exact p values

Similar presentations

Presentation on theme: "Computing Xcorr exact p values"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computing Xcorr exact p values

Similar presentations

Presentation on theme: "Computing Xcorr exact p values"— Presentation transcript:

Similar presentations

About project

Feedback