Presentation is loading. Please wait.

Presentation is loading. Please wait.

Peptide & Protein Identification by MS/MS

Similar presentations


Presentation on theme: "Peptide & Protein Identification by MS/MS"— Presentation transcript:

1 Peptide & Protein Identification by MS/MS
Introduce basic concepts of MS/MS and How MS/MS spectra can be used for peptide and protein identification Mass spectrometry has been used a lot in biology since the late 1950’s. However it really came into play in the late 1980’s once methods were developed to allow the analysis of large intact (bigger than 1,000 Daltons) molecule. Two soft ionization techniques, Electrospray and Matrix Assisted Laser Desorption led to a huge jump in popularity as did the development of much more compact (bench top rather than whole laboratory) mass spectrometers.

2 The Two Proteomics Philosophies
Traditional’ Proteomics Protein separation, digestion and MS Usually two-dimensional electrophoresis, spot picking and digestion Allows pre-enrichment (organelles), depletion (albumin removal) Advantages: Maintains isoforms, highly parallel, cheap Disadvantages: Lab-to-lab reproducibility, depth of coverage, non-automatable ‘Shotgun’ Proteomics Digest down to peptides, multidimensional separation and MS Usually ion-exchange separation followed by reverse phase HPLC-MS Allows modification specific targetting (isolation of phosphopeptides etc) Advantages: depth of coverage, automatable Disadvantages: uncertainty in identification, slow, expensive Genomics began with the goal of sequencing entire genomes. To accomplish this task, two different sequencing approaches were developed. These methods can be thought of in the following way: Imagine that you have the complete works of an author, written in a language that you studied in school, but never became fluent in. Moreover, the books are in such bad shape that if you open them, they disintegrate. You have two alternatives. You can remove one page at a time, preserve it and decipher it. Or you can open all the books at once and then pick up the fragments of paper and use the words on them to figure out how they fit together. The page-by-page approach to sequencing the human genome was used by the public genome-sequencing consortium. This group first figured out how all the pages fit together and then deciphered all the words on each page. Finally, it assembled the pages back together to produce the whole genome. The advantage of this approach is that it is very precise. The disadvantage is that it takes a long time. The biotechnology company Celera used the other method, called “whole genome shotgun sequencing,” in its competing effort to sequence the human genome. This method is equivalent to figuring out what’s written on all the fragments of paper from all of the volumes and then figuring out how they piece together. To do this procedure effectively requires starting with several copies of each volume so that overlaps among the fragments can be found. The number of original copies is referred to as “coverage.” To produce a high-quality sequence by this method usually requires eight- to tenfold coverage. The disadvantage of this method is that you rarely get the whole sequence to line up. The advantage is that the portion of the sequence that does line up is acquired much more rapidly than via the page-by-page method.

3 Shotgun Proteomics: Peptide Separation
The concept of shotgun proteomics is shown above. Instead of separating the proteins, the entire cell extract is digested with proteases and then the complex mixture is separated. The peptides are eluted from the final separation method, usually reversed-phase chromatography directly into the mass spectrometer where they are automatically subjected to MS/MS analysis. The peptides are identified in a similar way to how proteins are identified. Maybe 10 peptides are entering the mass spectrometer. The MS picks automatically the most intense, isolates it (throwing away the other 9 peptides) and then smashes it into pieces. The mass of the peptide is used to search the database to find all peptides with the same mass. The fragmentation spectra of all these peptides are then calculated and compared to the experimental fragments observed. The best matching peptide sequence is then selected.

4 Are my Peptide ‘Hits’ Correct?

5 Threshold model sort by match score spectrum scores protein peptide
Before PeptideProphet was developed, a threshold model was the standard way of evaluating the peptides matched by a search of MS/MS spectra against a protein database. The threshold model sorts search results by a match score. sort by match score spectrum scores protein peptide

6 Set a threshold SEQUEST XCorr > 2.5 dCn > 0.1 Mascot
Next, a threshold value was set. Different programs have different scoring schemes, so SEQUEST, Mascot, and X!Tandem use different thresholds. Different thresholds may also be needed for different charge states, sample complexity, and database size. SEQUEST XCorr > 2.5 dCn > 0.1 Mascot Score > 45 sort by match score peptide

7 Matches below threshold are dropped
Peptides that are identified with scores above the threshold are considered “correct” matches. Those with scores below the threshold are considered “incorrect”. There is no gray area where something is possibly correct. “correct” “incorrect” peptide

8 Creating a discriminant score
PeptideProphet starts with a discriminant score. If an application uses several scores, (SEQUEST uses Xcorr, DCn, and Sp scores; Mascot uses ion scores plus identity and homology thresholds), these are first converted to a single discriminant score. sort by match score peptide

9 Discriminant score for SEQUEST
ln(XCorr) 8.4* ln(#AAs) 7.4* 0.2*ln(rankSp) 0.3* 0.96 DCn D DMass æ + ç - è ö ø = For example, here’s the formula to combine SEQUEST’s scores into a discriminant score: SEQUEST’s XCorr (correlation score) is corrected for length of the peptide. High correlation is rewarded. SEQUEST’s DCn tells how far the top score is from the rest. Being far ahead of others is rewarded. The top ranked by SEQUEST’s Sp score has ln(rankSp)=0. Lower ranked scores are penalized. Poor mass accuracy (big DMass) is also penalized.

10 Histogram of scores Number of spectra in each bin
Once Peptide Prophet calculates the discriminant scores for all the spectra in a sample, it makes a histogram of these discriminant scores. For example, in the sample shown here, 70 spectra have scores around 2.5. Number of spectra in each bin Discriminant score (D)

11 Mixture of distributions
This histogram shows the distributions of correct and incorrect matches. PeptideProphet assumes that these distributions are standard statistical distributions. Using curve-fitting, PeptideProphet draws the correct and incorrect distributions. “incorrect” Number of spectra in each bin “correct” Discriminant score (D)

12 Discriminating power of Peptide Prophet
Sensitivity: fraction of all correct results passing filter Error Rate: results passing filter that are incorrect Ideal Spot SEQUEST thresholds (from literature) probability model Improved discrimination: more identifications (for the same error rate)

13 Choose an error rate with PeptideProphet
A big advantage is that you can choose any error rate you like, such as 5% for inclusive searches, or 1% for extremely accurate searches. 5% error rate 1% error rate Correctly identifies everything, with no error Keller et al., Anal Chem 2002

14 False discovery rate (FDR)
Always risk of false positive identifications, how many can we tolerate? More accepted identifications -> higher rate of false discoveries. FDR = false IDs / total IDs Threshold for identification acceptance depending on type of study. Peptide FDR of 0.01 is common.

15 Peptide FDR estimation
Equal size database with decoy (random/reverse) entries Decoy entries simulates random hits in the real dataset. FDR = decoy hits / target hits “target-decoy strategy” Example. 10 decoy peptides and 20 target peptides identified above score cutoff. -> FDR 50% = 0.5

16 want to know what proteins are present in the sample ?
Nice, but … want to know what proteins are present in the sample ?

17 Protein Identification by MS/MS
protein sample protein identifications Protein level A B C D A B C Peptide level Database search Tools: -Sequest -Mascott SpectrumMill peptide mixture peptide identifications MS/MS spectrum level MS/MS spectra MS/MS spectra

18 Protein Identification
>sp|P02754|LACB_BOVIN BETA-LACTOGLOBULIN PRECURSOR (BETA-LG) (ALLERGEN BOS D 5) - Bos taurus (Bovine). MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI TPEVDDEALEK : p = 0.96 TPEVDDEALEKFDK : p = 0.96 KPTPEGDLEILLQK : p = 0.83 LSFNPTQLEEQCHI : p = 0.65 LSFNPTQLEEQCHI : p = 0.76 sp|P02754|LACB_BOVIN Probability = ??? ProteinProphet software combines probabilities of peptides assigned to MS/MS spectra to compute accurate probabilities that corresponding proteins are present

19 Non-random Grouping of Peptides
Correct peptide assignments tend to correspond to “multi-hit” proteins, those to which other correctly assigned peptides correspond Incorrect peptide assignments tend to correspond to “single-hit” proteins to which no other correctly assigned peptide corresponds False positive identification error rate on the protein level is higher than on the peptide level Hard to distinguish single-hit correct proteins form the incorrect ones

20 Amplification of False Positive Error Rate
When Going from Peptide to Protein Level + Prot A Peptide 1 in the sample (enriched for ‘multi-hit’ proteins) Peptide 2 Prot B + Peptide 3 + 5 correct (+) Peptide 4 Peptide 5 Prot Peptide 6 not in the sample (enriched for ‘single hits’) + Peptide 7 Prot Prot Peptide 8 Prot Peptide 9 Prot + Peptide10 Peptide Level: 50% False Positives Protein Level: 71% False Positives

21 Repeated Sequencing Events
>sp|P02754|LACB_BOVIN BETA-LACTOGLOBULIN PRECURSOR (BETA-LG) (ALLERGEN BOS D 5) - Bos taurus (Bovine). MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI Same peptide sequenced multiple times PTPEGDLEILLQK : p = 0.81 PTPEGDLEILLQK : p = 0.62 PTPEGDLEILLQK : p = 0.95 P = ??

22 Protein Inference Problem
Degenerate peptides: correspond to more than a single entry in protein database Prot A protein A or protein B ?? Or both? Peptide 1 Prot B In shotgun proteomics the connectivity between peptides and proteins is lost Degenerate peptides are more prevalent with databases of higher eukaryotes due to the presence of: Related protein family members Alternative splice forms Partial sequences

23 protein mixture separation peptide mixture separation
2D Gel-based Approach vs. Shotgun Approach A B A Protein mixture B MW protein mixture separation pI Separated proteins protein digestion Peptide mixture peptide mixture separation MS/MS sequencing MS/MS spectra peptide identification Peptide sequences GAGGLR HYFEDR AEMK GAGGLR HYFEDR AEMK GAGGLR HYFEDR peptide grouping A B A B Implicated database proteins A B protein inference A B ? B Identified proteins

24 Protein ID False Positive Rates: ProteinProphet vs. Score Thresholds
Keep acquiring data and you will “identify” everything in the database! Washburn et al., 2001 Tirumalai et al., 2003 ProteinProphet # spectra / protein Control Datasets: Halobacterium vs. Halo+Human (4 runs) Halobacterium vs. Halo+Human (45 runs) 18 purified proteins vs. 18+Human (22 runs)

25 Protein Identifications 375 run Experiment
Data Filter # ids # non-single hits # single-hits Publ. Threshold mode l# Publ. Threshold model # ProteinProphet, p


Download ppt "Peptide & Protein Identification by MS/MS"

Similar presentations


Ads by Google