Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computing Xcorr exact p values

Similar presentations


Presentation on theme: "Computing Xcorr exact p values"— Presentation transcript:

1 Computing Xcorr exact p values
Jeff Howbert August 14, 2013

2 XCorr is not well calibrated
Relative ranking: among the matches for a single spectrum, correct peptide gets a high rank XCorr performs well Absolute ranking: among the (pooled) matches from all spectra, correct peptides get high ranks Usual basis for computing confidence measures of matches For this, scores must be calibrated: matches with similar scores from different spectra have similar level of confidence XCorr performs much less well

3 XCorr is not well calibrated

4 Approaches to calibration problem
Post-process raw PSM scores PeptideProphet, Percolator, etc. Model discriminates correct from incorrect PSMs Built using all PSMs Uses features related to peptides and spectra other than MS/MS masses and intensities Calculate p value for each PSM Relative to distribution of possible scores for that spectrum only Normalizes for differing number of peaks, candidate peptides, etc. Compare p values across spectra in place of raw PSM scores

5 Exact p values via dynamic programming
Given: An additive score function S A precursor mass m Use dynamic programming to calculate counts of scores si (smin  si  smax) over all possible peptides with mass m Counts provide a non-parametric null distribution of scores Can calculate p value of a PSM by comparing its score to this distribution

6 Computing XCorr Observed spectrum Theoretical spectrum VNIQEELGK
for each peptide bond: b ion y ion neutral losses stage 1 sqrt stage 2 normalize regions stage 3 cross-corr. penalty dot product XCorr score

7 Refactoring XCorr Move generation of b ion, y ion, neutral loss peaks from theoretical spectrum to processing of observed spectrum Multiply observed spectrum with b/y/neutral loss fingerprint centered at each of masses from 1 to m Measures evidence for backbone cleavage at each mass position Small modification of standard XCorr to make refactored XCorr additive

8 fingerprint of b / y / neutral losses
Refactoring XCorr observed spectrum theoretical spectrum stage 3 VNIQEELGK fingerprint of b / y / neutral losses centered at mi = 347 sum of evidence for cleavage at mi = 347 binary markers of backbone cleavage dot product vector of cleavage evidence “Xcorr” score

9 Correlation of Crux and refactored XCorr

10 Dynamic programming on XCorr
Count of partial peptides with score si and mass mj is sum of counts of partial peptides with one less residue score si – se where se is measure of cleavage evidence at mj where pa is probability of occurrence of amino acid a (position specific)

11 Score distributions are dependent on spectrum and selected precursor mass

12 Transformation of XCorr p values
Sidak correction: Corrects for multiple testing (similar to Bonferroni) Accounts for fact that database search considers hundreds of peptides for each spectrum

13 Comparison of scoring functions experimental design
All experimental parameters matched as exactly as possible (e.g. missed cleavages, peptide search window, precursor isotopes, …) Protein fasta file predigested to target peptide set with trypsin/P. Decoy peptide set generated from shuffling target peptides; redundant/overlapping peptides removed. Size ratio of target vs. decoy peptide sets ~ 1.0. Spectrum file set up so every search engine forced to find matches to identical set of spectrum-charge combinations. Scoring function performance compared via absolute ranking (q-value) curves, generated via target/decoy analysis.

14 Spectrum-level performance of scoring functions
worm-PAnDA MS1 high resolution MS2 low resolution yeast-01 MS1 / MS2 low resolution

15 Spectrum-level performance of scoring functions
human heart 12 MudPIT runs MS1 / MS2 low resolution

16 Peptide-level performance of scoring functions
worm-PAnDA MS1 high resolution MS2 low resolution yeast-01 MS1 / MS2 low resolution

17 Peptide-level performance of scoring functions
human heart 12 MudPIT runs MS1 / MS2 low resolution

18 Other applications of XCorr exact p values
Produce better calibrated peptide-level p values In hypothesis-driven approach, to calculate best spectrum match for each peptide in database (collaboration with MacCoss lab) As model for dynamic programming on other score functions (collaboration with Bilmes lab)

19 Supplemental slides

20 Exact p values improve calibration of XCorr

21 p value

22 Uses of XCorr exact p values
Use target and decoy exact p values as calibrated pseudo-scores to calculate q values in target/decoy framework Use target exact p values: to calculate PSM q values without reference to decoys, using Benjamini-Hochberg formula in hypothesis-driven approach, to calculate best p value for SPMs (spectrum-peptide matches) for each candidate peptide

23 Calculating XCorr exact p values
CHALLENGE: calculating distribution of scores for a single spectrum

24 Shotgun proteomics experiment Peptide-spectrum match (PSM) by database search
100 K+ predicted peptides (targets) 100 K+ shuffled peptides (decoys) 10-50 K tandem mass spectra * IPDPMK MPPLDK AMFGFK AIMHTK TAIMHK DIVLLK IVILDK LDIVIK LIDIVK LVDLIK VLDIIK VLDLLK VLLLDK IVELVK PPFVIK IIIVSR LIVLSR LSLLVR LVSILR SLIIVR VVTLLR IMIPAR ILTLIK LLTLLK LMPVLK MIPLVK * IMPPDK MPLDPK AGFMFK AIMHTK TMAHIK DLVILK IVLDLK IVLLDK LDVIIK LIDIVK LVLDLK VDILIK VLDLLK IELVVK PPFVIK IISVIR LLVSLR LSILVR LSVLIR SLIIVR VLTLVR IIPMAR IILLTK LLLLTK LPVMLK MVILPK PSM scores 0.1743 0.4491 0.0270 0.1250 0.1643 * 0.1031 0.0102 0.5600 0.1255 0.4771 0.0034 0.1875 PSM scores 0.3135 0.0116 0.0230 0.1907 0.1303 0.3874 0.0900 0.3078 0.0462 0.0185 0.3812 0.0714

25 XCorr exact p values via Weibull fitting
Fit Weibull curve to distribution of Xcorr scores for all matches to a given spectrum (except top match) Calculate p value for a given match from CDF of fitted curve

26 Fragment ion nomenclature
N-terminal (prefix) fragment = b ion b – H2O, b – NH3, a (b – CO), a – H2O, etc. C-terminal (suffix) fragment = y ion y – H2O, y – NH3, etc. NOTE: for a given peptide mass m mb + my = m + 2

27 Dynamic programming: exact count of peptides for each mass in a range
Given for this problem: Mass range in which to calculate exact counts, e.g. 1 to 2000 A set of amino acids and their (integer) residual masses mG = 57, mA = 71, … , mW = 186 Dynamic programming works for problems where each answer (including final one) can be calculated from results of smaller subproblems. Need to define a recursion. Count of peptides with mass m is sum of counts of peptides with one less residue:

28 Dynamic programming: exact count of peptides for each mass in a range
H - X1 - X2 - X3 - … - Xn-2 - Xn-1 - Xn - OH 1 m=18 m=2000 O( m  |AA| ) ~ sec.

29 Dynamic programming on XCorr
Count of partial peptides with score si and mass mj is sum of counts of partial peptides with one less residue score si – se where se is measure of cleavage evidence at mj

30 Dynamic programming on XCorr

31 Dynamic programming on XCorr
Last column holds distribution of score counts for all peptides with m = 150 O( m  ( smax – smin )  | AA | ) ~ 1 sec for m = 1500

32 Dynamic programming on XCorr
Computational complexity = O( m  ( smax – smin )  | AA | ) Runtime per spectrum  1 sec for m = 1500

33 Integerizing XCorr For dynamic programming to work, both masses and scores must be discretized (integerized), because they will be used as indices into the dynamic programming array. There are ways to do this badly …

34 Incorporating dynamic programming / p values into database search
As follow-on to standard Crux processing. For each spectrum: Choose best PSM according to rank of XCorr scores Calculate p value of this PSM OR As integral part of database search. For each spectrum: Calculate p values of all PSMs Choose best PSM according to rank of p values

35 Top-ranked PSMs from XCorr and exact p values not necessarily the same

36 Exact p values improve calibration of XCorr best PSM according to rank of XCorr scores

37 Exact p values improve calibration of XCorr best PSM according to rank of p values

38 Fine-tuning dynamic programming on XCorr 10000 random PSMs (all ranks)
Initial implementation of dynamic programming weighted all possible peptides equally p values for shuffled tryptic peptides not quite right for null distribution

39 Fine-tuning dynamic programming on XCorr 10000 random PSMs (all ranks)
Adjusting counts according to amino acid probabilities made things worse

40 Fine-tuning dynamic programming on XCorr 10000 random PSMs (all ranks)
But adjusting counts using position-specific amino acid probabilities appropriate for tryptic peptides solved the problem

41 Top-ranked decoy PSMs from exact p values show proper null distribution

42 Using optimal bin offset improves search using exact p values

43 Calibrations from exact p values and Percolator are synergistic

44 Peptide backbone H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1
N-terminus C-terminus AA residuei-1 AA residuei AA residuei+1

45 Peptide fragmentation
collision induced dissociation (CID) H+ H...-HN-CH-CO NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 Prefix Fragment Suffix Fragment Peptides tend to fragment along the backbone Fragments can also lose neutral chemical groups like NH3 and H2O

46 Example of spectrum graph for NDEMK

47 Prefix and suffix fragment ions

48 Example of conditional probability table
Pos(m) - the region of the cleavage site in the peptide ( 0 - first fifth, 4 - last fifth). TABELE: Prob(b | Pos(m) , y ) Pos(m) y b Prob Log Prob 1 0 zero zero zero low zero medium zero high low zero low low low medium low high medium zero medium low medium medium medium high high zero

49 Flanking amino acid equivalence sets
0 X-X (default) 1 Pro-X 2 X-Pro 3 Gly-X 4 X-Gly 5 Arg/Lys-X 6 His-X 7 X-His 8 Asp/Glu-X 9 X-Asp/Glu 10 Ile/Leu/Val-X 11 X-Ile/Leu/Val 12 Ser/Thr-X 13 X-Ser/Thr 14 Asn-X 15 X-Asn


Download ppt "Computing Xcorr exact p values"

Similar presentations


Ads by Google