peptidesproteinsgenes protein accessionsharedsharedunique gene nameshareduniqueunique Identified by gene unique peptides Identified by protein and gene unique peptides Not identified Identified by peptides unique on protein and gene level Supplemental Fig. 1 Supplemental Fig. 1: Gene-centric uniqueness calculation. Peptides matching to either one particular protein isoform (green circles, protein unique) or to multiple protein isoform but with the same gene name (blue circles, gene unique) are classified as unique peptides. All others, namely peptides matching to multiple protein isoforms with different gene names (red circles), are classified as shared. Shared peptides were discarded during the protein inference whereas both protein unique and gene unique peptides give rise to the identification of gene products.
Supplemental Fig. 2 A B FE C D Supplemental Fig. 2: Search engine score normalization and variations in score cutoffs to reach 1% PCM FDR. A, B The local peptide length dependent score cutoffs at 5% PSM FDR between Mascot (A) and Andromeda (B) used for the score normalization are vastly different. While the cutoffs determined for Mascot decrease at the beginning and converge at ~17, the cutoffs used for Andromeda decrease constantly. C, D To illustrate vast differences in data quality dependent on technical and biological differences we plotted the score histograms of length normalized Mascot ion scores for (C) a dimethyl labeled tryptic digests of human embryonic stem cells measured by low resolution CID and (D) an unlabeled tryptic digests of the melanoma cell line A375 measured by HCD. To reach 1% PCM FDR, the labeled dataset had to be cut at 0.63 whereas the unlabeled dataset at E,F Differences in data quality require different length normalized score cutoffs to reach 1% PCM FDR. While the range of length normalized score cutoffs is similar for Mascot (E) and Andromeda (F), the shape of the distribution varies
Supplemental Fig. 3 A B Supplemental Fig. 3: Target and decoy PCM saturation. A In contrast to the saturation of proteins when accumulating multiple experiments, the number of unique target PCMs (blue) only shows a slight saturation effect. Furthermore, the numbers of unique decoy PCMs (red) increases linearly with increasing amount of data. B This is mirrored by the global PCM FDR. The sharp increase at ~250 experiments in the PCM FDR is due to an experiment containing multiple LC-MS/MS raw files acquired while optimizing an acquisition method and thus contains highly redundant target PCMs but many random decoy PCMs.
r=# target/# decoy A B C Supplemental Fig. 4: R factor correction. A Using the number of decoy proteins from the classic TDS massively overestimates the number of false positive protein identifications, decoy proteins (red), target proteins (blue). B The R factor is calculated as the ratio between the number of target and decoy hits with a score below 3.6. At this score the local ratio of forward and decoy hits is 1/5. C After applying the R factor correction, the decoy (red) protein distribution agrees better with the target (blue) protein distribution which yields more reasonable protein FDR estimation using the adjusted number of decoy proteins. The distribution of true protein hits (green dashed line), calculated as the difference between the distributions of target and decoy hits is more sensible than for the standard decoy approach, although negative values are observed for low scoring proteins. Supplemental Fig. 4
A B DC Supplemental Fig. 5: Protein FDR estimation using the classic and picked TDS using the sum of best Q-scores as protein score. A Using the sum of best Q-scores of all PCMs matching to a protein as protein score, the number of decoy proteins (red) of the classic TDS massively overestimates the number of false positive protein identifications. Furthermore, the target distribution (blue) shows no bimodal shape and is not well separated from the decoy distribution. B After applying the picked approach, the decoy (red) protein distribution superimposes with the target (blue) protein distribution which allows a more accurate protein FDR estimation. C Comparing the performance of the picked (solid) and classic (dashed) approach when filtering the PCMs on various FDR shows a similar trend as in Figure 3A. With increasing PCM q-value cutoffs, the number of true positive protein identifications (number of target proteins – number of decoy proteins) increases and is comparable between the picked and classic approach. At roughly PCM q-value cutoff, the number of true positive proteins starts to decrease and quickly drops to 0 for the classic approach, whereas true positive proteins IDs increase further and converge at a rather stable plateau in the picked approach. The slight decrease at the end is likely due to accumulation of false positive PCMs which further deteriorates the separation of decoy and target proteins. D The estimated protein FDR of the picked (solid) and classic (dashed) approach mirrors the trend seen in panel C. While the estimated protein FDR increases constantly when increasing the PCM q-value cutoff and eventually reaches 100% in the classic approach, the picked approach starts to rise much later and plateaus at roughly 10%. Supplemental Fig. 5
A B C D Supplemental Fig. 6: Enlarged illustrations of the comparison of the classic and picked TDS from Figure 3. A Even when aggregating small numbers of experiments, the picked (solid) TDS outperforms the classic (dashed) TDS. While the numbers of target proteins (blue) is comparable (marginally higher number of the classic approach) the difference between the number of decoy proteins (red) reported by the classic and picked approach is starting to increase. B The overestimation of false positive proteins by the classic approach is particularly apparent when comparing the number of target (dashed blue) and decoy (dashed red) proteins at the end of the aggregation process. The number of decoy proteins is increasing more rapidly than the number of target proteins and is approaching the same limit. C The picked approach shows a complete opposite effect. The number of decoy proteins reported by the picked approach (solid red) is decreasing because of new evidence (especially at ~1540 experiments) introduced by additional experiments. D The trend explained in panel B and C is mirrored by the estimated protein FDR in the picked (solid) and classic (dashed) TDS. While the protein FDR increases and approaches 100% in the classic approach, the picked approach sows a decrease, potentially reaching close to 0% when adding more data. Supplemental Fig. 6
AB Supplemental Fig. 7: R factor FDR. A R factor correction produces more reasonable protein FDR curves than the standard decoy strategy, the agreement between the picked and R factor approach is not perfect, but better than between either of the approaches and the standard approach. B Number of true protein hits as a function of FDR for the standard, picked and R factor approach. Both the R factor and picked approaches perform better than the standard strategy, with the picked TDS consistently yielding higher coverage. Supplemental Fig. 7
Supplemental Fig. 8 AB Supplemental Fig. 8: Comparison of best and sum Q-score protein scoring of the classic and picked TDS. A When using the best Q- score to score proteins, the number of proteins identified at 1% proteins FDR is increasing in both picked (solid) and classic (dashed) approach, but the picked approach consistently reports higher numbers of proteins. B Using the sum of best Q-scores of all PCMs matching to a protein, the number of proteins identified at 1% protein FDR is first increasing in both picked (solid) and classic (dashed) approach, but is starting to decrease and breaks down at high PCM q-value cutoffs. The picked approach shows a delayed behavior but also overestimates the number of false positive proteins IDs using the decoy proteins. Especially at high PCM q-value cutoffs, the decoy and target protein distribution start to blend into each other (data not shown) and shows almost no separation any more.