Download presentation
Presentation is loading. Please wait.
Published byErik Armstrong Modified over 8 years ago
1
2015/06/03 Park, Hyewon 1
2
Introduction Protein assembly Transforms a list of identified peptides into a list of identified proteins. 2 Duplicate Spectrum recognition Peptide charge state discernment Peptide identification Protein assembly Identification error rate assesment Sample comparison
3
Introduction As one peptide sequence can be mapped to multiple proteins in a database, naïve protein assembly can substantially overstate the number of proteins found in samples. 3
4
Introduction Tools DTASelect Groups together proteins with identical sets of identified peptides and uses a similarity score to describe the relationship between proteins with overlapped peptide identifications. DBParser Classifying and reporting proteins in six hierarchical categories. Used parsimony analysis. Statistical approach by Nesvizhskii et al. Compute probabilities that proteins are present in a sample on the basis of estimated peptide identification probabilities. 4
5
Introduction Parsimony algorithm Substantially reduce the number of proteins reported. Issues No evaluations of the quality of the generated lists of protein identifications. Correct protein identifications would be erroneously filtered out. Algorithmically complex. To be required to describe the complex many-to-many relationships between identified peptides and the proteins that potentially explain their appearance. 5
6
Introduction IDPicker Estimate False Discovery Rates(FDR) from reversed-sequence database search to control the quality of the peptide identifications. Use efficient graph algorithms on the peptide-protein relationships To identify protein clusters with shared peptides To derive the minimal list of proteins. 6
7
IDPicker Designed to assemble confident, parsimonious protein identifications from raw spectral identifications. Three modules Reads the unfiltered peptide identifications from a SQT file. Applied an initial pass of filtering (an FDR of 25% typically). Groups these identifications into appropriate sets. Filters peptides to the final FDR (an FDR of 5% typically). Applied parsimony analysis to discovered proteins and produces reports. 7
8
Deriving Error Estimates for Peptide identification Identification score thresholds that correspond to user- specified FDR. FDR = (2R) / (F + R). F : The numbers of peptide identifications derived from the forward- sequence database. R : The numbers of peptide identifications derived from the reverse- sequence database. 8
9
Deriving Error Estimates for Peptide identification All peptide pass this initial filter are given equal standing. The identifications below the threshold are removed entirely from consideration. 9
10
Bipartite Graph Analysis of the Peptide Identification Data Collect all proteins that could account for peptides. Protein-Peptide mapping is very complex. Can be modeled by bipartite graph. 10
11
Bipartite Graph Analysis of the Peptide Identification Data Bipartite graph An undirected graph. Vertices can be partitioned into two sets such that no edge connects vertices in the same set. 11
12
Bipartite Graph Analysis of the Peptide Identification Data Four steps for the algorithm for bipartite graph. Initialize Collapse Separate Reduce 12
13
Bipartite Graph Analysis of the Peptide Identification Data Initialize The peptide identification data in a bipartite graph. Two sets of vertices. Protein Peptide 13
14
Bipartite Graph Analysis of the Peptide Identification Data Collapse Some protein vertices are connected to exactly the same set of peptide vertices. 14
15
Bipartite Graph Analysis of the Peptide Identification Data Collapse Some protein vertices are connected to exactly the same set of peptide vertices. 15
16
Bipartite Graph Analysis of the Peptide Identification Data Collapse Define meta-protein, and meta-peptide. Meta-protein: A group of discernible proteins based on available evidence. Meta-peptide: A group of discernible peptides based on available evidence. 16
17
Bipartite Graph Analysis of the Peptide Identification Data Collapse After this step, A bipartite graph has two sets of vertices Meta-protein vertices Meta-peptide vertices 17
18
Bipartite Graph Analysis of the Peptide Identification Data Separate Two proteins are independent with regard to protein assembly if they share no peptides directly or indirectly through other proteins. 18
19
Bipartite Graph Analysis of the Peptide Identification Data Separate Decompose the complex bipartite graph into independent subgraphs of proteins with shared peptides. Achieve this through the depth first search. Each connected component represents a meta-protein cluster. 19
20
Bipartite Graph Analysis of the Peptide Identification Data Reduce Generate a minimal list of meta-proteins for each meta-protein cluster. Using a greedy set cover algorithm. 20
21
Bipartite Graph Analysis of the Peptide Identification Data Set cover problem and the greedy set cover algorithm. NP-Complete A heuristic approach. 21
22
Bipartite Graph Analysis of the Peptide Identification Data Set cover problem and the greedy set cover algorithm 22
23
Bipartite Graph Analysis of the Peptide Identification Data Reduce 23
24
Bipartite Graph Analysis of the Peptide Identification Data Reduce 24
25
Bipartite Graph Analysis of the Peptide Identification Data Reduce 25
26
Bipartite Graph Analysis of the Peptide Identification Data Reduce Parsimonious protein list generated. 26
27
Results & Discussion Data sets Sigma49: Human protein mixture. Yeast-Extract: A yeast whole cell extract. Serum-MARS: A human serum proteome. Databases Swiss-Prot(SP): For human. IPI Human(IPI): For human. Saccharomyces Genome Database(SGD): For yeast. Compact species-specific subsets of Swiss-Prot. SPH: For human. SPY: For yeast. 27
28
Results & Discussion Protein List Reduction White bar: Each protein separately whether it can be distinguished from others on the basis of observed peptides or not. Gray bar: The result of grouping indiscernible proteins into meta- proteins. Black bar: Meta-proteins that remain after parsimony analysis. 28
29
Results & Discussion In swiss prot, the protein counts were reduced most. By grouping indiscernible proteins, Sigma49: 39% reduced. Serum-MARS: 24% reduced. Yeast-Extract: 14% reduced. By parsimony analysis, in addition Sigma49: 51% reduced. Serum-MARS: 44% reduced. Yeast-Extract: 3% reduced. 29
30
Results & Discussion Grouping indiscernible proteins and parsimony analysis can improve protein reporting. 30
31
Results & Discussion Removing the redundancy in the protein list. In the Sigma49 runs, the average initial numbers of proteins, (SP, IPI, SPH) = (414,161,59) After the two reductions, the count of meta-proteins.. (SP, IPI, SPH) = (51,49,48) The known proteins in the original Sigma49 sample are 37, 32, and 40. 31
32
Results & Discussion These list reduction strategies cause the resulting protein lists to converge to numbers far closer to the true number of proteins in the sample. 32
33
Results & Discussion Improved Accuracy of Protein Identification Reducing the size of protein lists is useful only if incorrect protein identifiers are the ones being removed. 33
34
Results & Discussion Terms True Positive(TP): Each meta-protein is counted as TP if it included one of the 49 proteins listed as part of the sample. False Positive(FP): Otherwise of true positive. Precision : n TP / (n TP + n FP ) Recall: n TP / n P n P : The number of all proteins in the sample, that is, 49 in this analysis. F1-measures: 2pr / (p + r). p: Precision. r: Recall. 34
35
Results & Discussion Terms PEP1: Retains all meta-proteins. PEP2: Retains only meta-proteins matching to at least two different peptide sequences. PEP1-PARS : Applies the parsimony analysis on PEP1. PEP2-PARS : Applies the parsimony analysis on PEP2. 35
36
Results & Discussion Sigma49 Data set. The bipartite graph approach is highly effective at removing false protein identifications while retaining true identifications. 36
37
Results & Discussion SPH Search Less effective. The parsimony analysis is most useful in removing redundant homologous proteins. It will be most powerful in processing data sets generated by searching multispecies databases, such as Swiss-Prot, IPI. 37
38
Results & Discussion Yeast-Extract data set Unable to measure F1-measrue. Protein content is not defined. True Positive(TP): Each meta-protein is counted as TP if it included at least one protein with the “_YEAST” identifier. False Positive(FP): Otherwise of true positive. 38
39
Results & Discussion Yeast-Extract data set Using PEP2 filtering lost a considerable number of true identifications. Removed yeast proteins could actually be false identifications, as we obviously overestimated the number of true positives 39
40
Results & Discussion IPI search on Serum-MARS data set. A total of 194523 tandem mass spectra. 350648 identifications resulted from the database search. IDPicker filtered these identifications down to 37246 to achieve a 5% FDR for identifications. The software found that 2605 different peptide sequences were represented. These 2605 peptides could be explained by as many as 472 proteins (including reversed sequences) These could be reduced to 339 distinguishable meta-proteins and subsequently to 189 meta-proteins after parsimony analysis. 40
41
Results & Discussion IPI search on Serum-MARS data set. IDPicker produces a tabular list of the proteins. 41
42
Results & Discussion IPI search on Serum-MARS data set. Association tables revealing which meta-proteins map to which meta- peptides. 42
43
Results & Discussion IPI search on Serum-MARS data set. A graphic illustrating the relationship among the five proteins and seven meta-peptides. 43
44
Results & Discussion IPI search on Serum-MARS data set. Association tables revealing which meta-proteins map to which meta- peptides. (After the parsimonious analysis) 44
45
Results & Discussion IPI search on Serum-MARS data set. Association tables revealing which meta-proteins map to which meta- peptides. (After the parsimonious analysis) 45
46
Results & Discussion Grouping functionally related proteins. Clustering proteins by their shared peptides. Below, there are reported the number of proteins in each cluster both with and without parsimony applied for the top five clusters. X -> Y X: The number of proteins before the parsimonious analysis. Y: The number of proteins after the parsimonious analysis. 46
47
Conclusion The bipartite graph is a useful model for representing peptide identification data in LC-MS/MS proteomics. It provides efficiency, accuracy, and transparency in deriving a minimal protein list from peptide identifications. The bipartite graph analysis was highly efficient in removing false protein identifications while retaining true identifications. It groups functionally related proteins together through clustering proteins with shared sequences and, thus, helps users to examine results more efficiently. 47
48
IDPicker Designed to assemble confident, parsimonious protein identifications from raw spectral identifications. Three modules Reads the unfiltered peptide identifications from a SQT file. Applied an initial pass of filtering (an FDR of 25% typically). Groups these identifications into appropriate sets. Filters peptides to the final FDR (an FDR of 5% typically). Applied parsimony analysis to discovered proteins and produces reports. 48
49
IDPicker In 2.0 version.. Multiple score combination. New partitioning strategy. 49
50
IDPicker In 2.0 version.. Multiple score combination. Improves the peptide identification by combining multiple scores reported by database search engines. Users can specify which scoring metrics are to be included from their search results. S = w 1 s 1 + … + w n s n Weights User defined. (Static) Automatically determined using a Monte Carlo simulation method. (Dynamic) 50
51
Results & Discussion Combining multiple scores from a search engine. 51
52
IDPicker In 2.0 version.. New partitioning strategy. NTT & Z state partitioning Peptides of different NTT or peptide charge values are likely to produce scores in different ranges. 9 separate peptide classes based on.. NTT: 0,1,2. Z state: 1+, 2+, 3+. The distinct score thresholds for each class. 52
53
Results & Discussion The effect of peptide partitioning was determined for three different search strategies: Fully tryptic, Semitryptic Unconstrained The following four different peptide partition styles were tested for each database search strategy: (A) no partitioning, (B) Z state (1+, 2+, or 3+) only (C) NTT (0, 1, or 2) only, (D) both Z state and NTT. 53
54
Results & Discussion NTT & Z state partitioning in 2.0 version. Tandem mass spectra from a whole cell lysate data set (“DLD1 LTQ”) A human serum data set (“Serum Orbi”) 54
55
IDPicker A pipeline of tools. Designed to assemble confident, parsimonious protein identifications from raw spectral identifications. Three modules Reads the unfiltered peptide identifications from a SQT file. Applied an initial pass of filtering (an FDR of 25% typically). Groups these identifications into appropriate sets. Filters peptides to the final FDR (an FDR of 5% typically). Applied parsimony analysis to discovered proteins and produces reports. 55
56
IDPicker In 2.0 version.. A novel filter to remove spurious protein identifications from multispecies. Adds a new protein to the minimal list of protein identifications only if it contributes a specified number of distinct peptide identifications that are not already explained by other proteins. 56
57
IDPicker In 2.0 version.. A novel filter to remove spurious protein identifications from multispecies. Adds a new protein to the minimal list of protein identifications only if it contributes a specified number of distinct peptide identifications that are not already explained by other proteins. 57
58
Results & Discussion In 2.0 version.. A novel filter to remove spurious protein identifications from multispecies. Data set: DLD1 LTQ, Serum Orbi. Database: Swiss-Prot Search: Myri Match. 58
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.