False-Discovery-Rate Aware Protein Inference by Generalized Protein Parsimony Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center
Peptide-Spectrum Matches Sigma49 – 32,691 LTQ MS/MS spectra of 49 human protein standards; IPI Human Yeast – 162,420 LTQ MS/MS spectra from a yeast cell lysate; SGD. X!Tandem E-value (no refinement), 1% FDR 2 Spectra used in: Zhang, B.; Chambers, M. C.; Tabb, D. L
Traditional Protein Parsimony Select the smallest set of proteins that explain all identified peptides. Sensible principle, implies Eliminate equivalent/subset proteins Equivalent proteins are problematic: Which one to choose? Unique-protein peptides force the inclusion of proteins into solution True for most tools, even probability based ones Bad consequences for FDR filtered ids 3
Many proteins are easy Eliminate equivalent / dominated proteins Sigma49: 277 → 60 proteins Yeast:1226 → 1085 proteins Many components have a single protein: Sigma49: 52 ( 3 multi-protein) Yeast: 994 (43 multi-protein) "Unique" peptides force protein inclusion Sigma49: 16 single-peptide proteins Yeast: 476 single-peptide proteins 4
Must eliminate redundancy Contained proteins should not be selected 5 37 distinct peptides
Must eliminate redundancy Contained proteins should not be selected Even if they have some probability mass Number of sibling peptides matter less if they are shared Single AA Difference
Must ignore some PSMs A single additional peptide should not force protein into solution 7 Single AA Difference
Example from Yeast "Inosine monophosphate dehydrogenase" 4 gene family Contained proteins should not be selected Single peptide evidence for YML056C
Must ignore some PSMs Improving peptide identification sensitivity makes things worse! False PSMs don't cluster 9 10% 2x Proteins PSMs
Must ignore some PSMs Improving peptide identification sensitivity makes things worse! False PSMs don't cluster 10 Select Proteins to Explain True PSM% PSMs 90%
Must ignore some PSMs How do we choose? Maximize # peptides? Minimize FDR (naïve model)? Maximize # PSMs? 11
Generalized Protein Parsimony Weight peptides by number of PSMs Constrain unique peptides per protein Maximize explained peptides (PSMs) Match PSM filtering FDR to % uncovered PSMs Readily solved by branch-and-bound Permits complex protein/peptide constraints Reduces to traditional protein parsimony 12
Match FDR to uncovered PSMs 13 Traditional Parsimony at 1% FDR: 1085 ( Unique) Proteins
Software Filter multi-acquisition identifications by: FDR, E-value, probability Rewrite PSMs to reflect parsimony analysis PepXML, CSV, Excel Component-wise Peptide-Protein matrix: Selected, Dominant, Equivalent, Contained Selected protein accessions: …plus equivalents 14
Conclusions Many components are clear Doesn't matter what technique is used Traditional techniques do not handle the second protein in a component well A single additional peptide should not force Explain only the true PSM %: Determine protein criteria first Adjust PSM filter until explained peptides match 15