Proteomics February 15, 2017 Dr. ir. Perry Moerland Bioinformatics Laboratory Academic Medical Center p.d.moerland@amc.uva.nl Graduate School ‘Bioinformatics’
Mass spectrometry: MALDI-TOF need only femtomole of peptide material ionized peptides are accelerated time of flight spectrum From Lodish et al: Ion source + mass analyzer MALDI –TOF MS: Matrix-Assisted Laser Desorption Ionization Time Of Flight Mass Spectrometry mass spectrum intensity mass 2
MS ID: peptide mass fingerprint wet lab: gel protein peptides spectrum digest in silico: protein in database theoretical spectrum digest compare http://www.uniprot.org/: > 59,000,000 entries score From Graves & Haystead: K=lysine, R=arginine We will come back to how to obtain the sticks representation of a spectrum later in this lecture 3
Identification: peptide mass fingerprint Why a fingerprint and not the intact protein mass? Pre-sequences Post translational modifications Alternative splicing Errors in databases: 1 in 1000-10000 bases (i.e. 300-3000 amino acids) Measurement error: 10 ppm error on 100kDa = 1 Da Pre-sequence: signal peptide 4
Intact mass and protein identity Variable/unknown modifications Intron/exon boundaries N terminal processing Pi X Ac GlcNAc Sequencing errors & polymorphisms C terminal processing Known/fixed modifications Only a few modified peptides Most peptides will have the predicted mass 5
Identification: how to score? Simply counting the number of matches is not enough The limited mass accuracy of MS can lead to both false positives and false negatives 6
Identification: how to score? Simply counting the number of matches is not enough False positive, because MS has only limited mass accuracy Bias towards ‘heavy’ proteins … False negatives, because Peptides can be post-translationally modified Sometimes digestion is not perfect: missed cleavages The limited mass accuracy of MS can lead to both false positives and false negatives 7 7
Importance of mass accuracy 1H 1.0078250 12C 12 14N 14.0030740 16O 15.9949140 31P 30.9737620 32S 31.9720707 8
Peptide mass fingerprinting: Mascot Trade-off between false positives (specificity) and false negatives (sensitivity) http://www.matrixscience.com 9
Peptide mass fingerprinting: missed cleavages Higher sensitivity, lower specificity 10
Scoring scheme: MASCOT Score, Expect? 11
Scoring scheme: MOWSE/MASCOT (I) # of matching proteins For protein A: Mowse score = 50/(NH) 50 : average protein of 50 kDa H : molecular weight of protein A N : product of match scores, each match score M (0≤M ≤ 1) weighted inversely proportionally to peptide mass http://www.matrixscience.com/help/scoring_help.html: accumulate statistics on the size distribution of peptide masses as a function of protein mass https://proteomicsresource.washington.edu/mascot/help/scoring_help.html Note: this implicitly incorporates the number of matches Warning: many scoring schemes (and there are tens of them) are very ad-hoc and prone to false positives 12
Scoring scheme: MOWSE/MASCOT (II) Probabilistic: transformation of the Mowse score in probability P that match is a random event Score: correct for size of database P = 1 / (20 x 5000) S = -10LogP = 50 Expect (E-value): number of times you could expect to get this score or better by chance E-value≥1: completely random 13
LC-MS/MS in 30 sec. LC chromatogram MS mass spectrum fragmentation From Peng and Gygi 14
MS/MS fragmentation Breaks a selected peptide in smaller parts fragmentation spectrum: identifies individual amino acids 15
MS/MS peptide identification Sequence searching Sequence tag De novo 16
MS/MS ID: sequence searching dBase shortlist Predict MSMS spectra tubulin 353-370 Compare MATCH Very dominant fragmentation pattern from some daughter ions: -good chemical reasons for this behaviour. No good contiguous sequence can be established to a high degree of confidence. In these cases raw MSMS searching is much more apt in identifying parent. RAW MSMS searching will be the future of fragmentation - db searching. Computer generates a theoretical fragmentation spectrum: Mascot 17
MS/MS ID: sequence TAG Internal sequence dBase Format: 409,76-----T1A2G3-------528,13 mass1 internal sequence mass2 Compare mass Internal sequence tag shortlist Compare fragments Strech of contiguous high peaks easily found Peak to peak distance corresponds with aminoacid mass Predominantly Y” fragmentation @ low energy CID C&N masses correspond to a combination of aa’s (all permutations) 2+ precursor charge state: Fragment masses> parent mass ensure these are products from a non singly charged molecule. match 18
MS/MS ID: de novo (I) De novo: for peptides/proteins not yet in database or not identified by fingerprint techniques: unsequenced species modified peptides Algorithms for extracting long stretches of peptide are needed: high-resolution & very good quality spectra 19
MS/MS ID: de novo (II) Problem: MS/MS spectra display a mixture of C- and N-terminal fragment ion series (a, b, c, and x, y, z, respectively) Solution: reduce complexity of MS/MS spectra Protein digestion using metalloendopeptidase with Lys-N specificity: cleavage N-terminally of lysine electron-transfer-induced dissociation MS/MS spectrum is dominated by c-ions MS/MS spectrum has few gaps In a gold standard dataset 42% of peptides identified via database search can be identified de novo Van Breukelen et al, Proteomics (2010) 20
MS/MS ID: validation (I) Many MS/MS search engines are threshold-based Probabilistic approaches: compute probability that a match is correct: Mascot PeptideProphet Nesvizhskii et al, Nature Methods (2007) 21
MS/MS ID: validation (II) Many MS/MS search engines are threshold-based Target-decoy searching: Database augmented with reversed or randomized sequences of DB Filter using various score cutoffs (FDR) http://www.matrixscience.com/help/decoy_help.html Nesvizhskii et al, Nature Methods (2007) 23
Cellular changes in T-cell expression patterns upon infection with HIV Analysis of human T cells upon HIV-1 infection 1921 protein spots detected with 2D DIGE 288 spots differentially expressed at peak infection 93 unique proteins identified via peptide mass fingerprinting 188 unidentified spots Goal: try to find candidate proteins in silico Nandal et al, 16:25, BMC Bioinformatics (2015) 24
Recapitulation: 2D gel electrophoresis Iso-electric point Horizontal axis: separation by pI Vertical axis: separation by Mw, lower = lighter Molecular weight 2D-DIGE: quantification of changes in protein expression using fluorescent labeling 25
In-depth mining of 2D-DIGE data 26
Step 1: Calculation of pI and Mw http://web.expasy.org/compute_pi/ 27
Step 2: Fitting calibration curves Cubic smoothing splines 28 28 28
Step 3: Generation of candidate list Properties & pI/Mw ranges of proteins sent to TagIdent -> get list of UNIPROT identifiers back of proteins that are close to predicted pI/Mw Isoforms are disregarded, as STRING does not handle isoforms. 29 http://web.expasy.org/tagident/ 29 29
Step 4: Prioritization of candidates STRING 30
STRING: functional protein association network STRING is a database of known and predicted protein interactions, derived from different sources: Interactions are visualized via graphs v9.1: >5,200,000 proteins from 1,133 organisms http://string-db.org/ Different line colors represent the types of evidence for the association STRING 10: database currently covers 9'643'763 proteins from 2'031 organisms. 31
STRING network: 2D-DIGE identified proteins Visualisation of the STRING network. Associations are represented with lines. Different colours of lines code for different evidence categories. New proteins have to fit in this network as well as possible. 377 observed interactions as compared to 66.2 expected interactions (P << 0.0001) 32
? Step 5: gene expression-based filtering gene expression ~ protein expression ? http://barcode.luhs.org/ 33
Properties of candidate lists Post-translational modifications Hydrophobic proteins 34
Results TPR: true-positive rate With optimal settings for pI range and Mw range: ~44% of correct proteins in top-5 35
Results: gene expression-based filtering http://string-db.org/9_1/p/885743793 36
Further pointers Catalogues: http://www.humanproteomemap.org/ https://www.proteomicsdb.org/ http://www.proteinatlas.org/ Galaxy: https://usegalaxy.org/ Boekel et al, Nature Biotechnology, 33:2, 139 (2015) 37