Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
1336 SW Bertha Blvd, Portland OR 97219
How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
De Novo Sequencing v.s. Database Search Bin Ma School of Computer Science University of Waterloo Ontario, Canada.
Bin Ma, CTO Bioinformatics Solutions Inc. June 5, 2011.
Heuristic alignment algorithms and cost matrices
Peptide Identification by Tandem Mass Spectrometry Behshad Behzadi April 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.
Similar Sequence Similar Function Charles Yan Spring 2006.
7-2 Estimating a Population Proportion
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Previous Lecture: Regression and Correlation
Each results report will contain:
Scaffold Download free viewer:
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
My contact details and information about submitting samples for MS
Facts and Fallacies about de Novo Sequencing & Database Search.
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Multiple testing correction
Proteomics Informatics – Data Analysis and Visualization (Week 13)
+ Chapter 9 Summary. + Section 9.1 Significance Tests: The Basics After this section, you should be able to… STATE correct hypotheses for a significance.
The dynamic nature of the proteome
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
Copyright © 2012 by Nelson Education Limited. Chapter 7 Hypothesis Testing I: The One-Sample Case 7-1.
Education Research 250:205 Writing Chapter 3. Objectives Subjects Instrumentation Procedures Experimental Design Statistical Analysis  Displaying data.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
INF380 - Proteomics-91 INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS The MS/MS identification problem can be formulated.
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
Common parameters At the beginning one need to set up the parameters.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
A Comprehensive Comparison of the de novo Sequencing Accuracies of PEAKS, BioAnalyst and PLGS Bin Ma 1 ; Amanda Doherty-Kirby 1 ; Aaron Booy 2 ; Bob Olafson.
Laxman Yetukuri T : Modeling of Proteomics Data
Comp. Genomics Recitation 3 The statistics of database searching.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
INF380 - Proteomics-71 INF380 – Proteomics Chap 7 –Protein Identification and Characterization by MS Protein identification in our context means that we.
CSE182 CSE182-L11 Protein sequencing and Mass Spectrometry.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational.
Multiple flavors of mass analyzers Single MS (peptide fingerprinting): Identifies m/z of peptide only Peptide id’d by comparison to database, of predicted.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Error tolerant search Large number of spectra remain without significant score. Reasonable number of fragment ion peaks might have not match. – Underestimated.
Tag-based Blind Identification of PTMs with Point Process Model 1 Chunmei Liu, 2 Bo Yan, 1 Yinglei Song, 2 Ying Xu, 1 Liming Cai 1 Dept. of Computer Science.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Unit 5: Hypothesis Testing.
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
Constructing high resolution consensus spectra for a peptide library
김지형. Introduction precursor peptides are dynamically selected for fragmentation with exclusion to prevent repetitive acquisition of MS/MS spectra.
MassMatrix Search Results Explained
Protein Identification via Database searching
Sequence comparison: Multiple testing correction
Proteomics Informatics –
Basic Local Alignment Search Tool (BLAST)
Sequence comparison: Multiple testing correction
Sim and PIC scoring results for standard peptides and the test shotgun proteomics dataset. Sim and PIC scoring results for standard peptides and the test.
Sequence alignment, E-value & Extreme value distribution
Viewing your results from the PAW Pipeline
Presentation transcript:

Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste

Mass Spectra analysis Biological sample Results report

Mass Spectra analysis Biological sample Results report

Computational analysis of MS/MS Two approaches: – De novo sequencing – Database searching based – Hybrid

De novo sequencing

– can identify new peptides and proteins – Able to discover (new) PTMs – Independent of protein databases  – Requires MS/MS data of good quality – No statistics based validation

Database searching-based MS/MS tandem mass spectra identification Pipeline Input data Peptide assignment Validation Protein inference Quantitation Interpretation

Database searching-based MS/MS tandem mass spectra identification Pipeline Input data Peptide assignment Validation Protein inference Quantitation Interpretation

Database searching-based MS/MS tandem mass spectra identification Pipeline Input data Peptide identification Validation Protein inference Quantitation Interpretation Data formatsDatabase searching Statistical methods for validations Protein assembling

Mass spectrum: – Histogram of the mass over charge of the observed fragment ions. – Spectrum normalization. Usually intensity is scaled to [0,100] interval. Input data Peptide assignment Validation Protein inference Quantitation Interpretation

Most common formats are the mzXML, MGF and DAT, Input data Peptide assignment Validation Protein inference Quantitation Interpretation

MGF file format Input data Peptide assignment Validation Protein inference Quantitation Interpretation

.mzXML Input data Peptide assignment Validation Protein inference Quantitation Interpretation

>IPI:IPI |SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTR SHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGC CNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQE QRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA Input data Experimental Spectra Scores: 1.2 Scores: 1.2 Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: Protein sequence DB

>IPI:IPI |SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTR SHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGC CNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQE QRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA Input data Experimental Spectra Scores: Scores: Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: Protein sequence DB

>IPI:IPI |SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTR SHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGC CNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQE QRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA Input data Experimental Spectra Scores: Scores: Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: Protein sequence DB

>IPI:IPI |SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTR SHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGC CNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQE QRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA Input data Experimental Spectra Scores: Scores: Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: Protein sequence DB

>IPI:IPI |SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTR SHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGC CNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQE QRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA Input data Experimental Spectra Scores: Scores: Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: Protein sequence DB

>IPI:IPI |SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTR SHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGC CNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQE QRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA Input data Experimental Spectra Scores: Scores: Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: Protein sequence DB

>IPI:IPI |SWISS-PROT:P MEICRGLRSHLITLLLFLFHSETICRPSGRKSSKMQAFRIWDVNQKTFYLRNNQLVAGYLQGPNVN LEEKIDVVPIEPHALFLGIHGGKMCLSCVKSGDETRLQLEAVNITDLSENRKQDKRFAFIRSDSGPTT SFESAACPGWFLCTAMEADQPVSLTNMPDEGVMVTKFYFQEDE Input data Experimental Spectra Scores: Scores: Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: Protein sequence DB

>IPI:IPI |SWISS-PROT:P MEICRGLRSHLITLLLFLFHSETICRPSGRKSSKMQAFRIWDVNQKTFYLRNNQLVAGYLQGPNVN LEEKIDVVPIEPHALFLGIHGGKMCLSCVKSGDETRLQLEAVNITDLSENRKQDKRFAFIRSDSGPTT SFESAACPGWFLCTAMEADQPVSLTNMPDEGVMVTKFYFQEDE Input data Experimental Spectra Scores: Scores: Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: Protein sequence DB

Input data Experimental Spectra Scores: Scores: Input data Peptide assignment Validation Protein inference Quantitation Interpretation Protein sequence DB Score: 32 Peptide: SHLITLLLFLFHSETICR

>IPI:IPI |SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTR SHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGC CNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQE QRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA Input data Experimental Spectra Scores: Scores: Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: Protein sequence DB

Input data Experimental Spectra Scores: Scores: Input data Peptide assignment Validation Protein inference Quantitation Interpretation Protein sequence DB Score: 4 Peptide: AELDLNMTR Score: 32 Peptide: SHLITLLLFLFHSETICR

Input data Experimental Spectra Scores: Scores: Input data Peptide assignment Validation Protein inference Quantitation Interpretation Protein sequence DB Score: 4 Peptide: AELDLNMTR Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 3 Peptide: MEICRGLR

Scores: Scores: Input data Experimental Spectra Input data Peptide assignment Validation Protein inference Quantitation Interpretation Protein sequence DB Score: 4 Peptide: AELDLNMTR Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK

>IPI:IPI |SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTR SHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGC CNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQE QRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA Input data Experimental Spectra Scores: 1.2 Scores: 1.2 Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: Protein sequence DB

>IPI:IPI |SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTR SHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGC CNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQE QRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA Input data Experimental Spectra Scores: 1.2 Scores: 1.2 Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: Protein sequence DB 1.

>IPI:IPI |SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTR SHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGC CNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQE QRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA Input data Experimental Spectra Scores: 1.2 Scores: 1.2 Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: Protein sequence DB 1. 2.

Shared Peak Count (SPC) This is the number of the peaks in the theoretical spectrum that are matched to peaks in the experimental spectrum Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: % 0% 1 0

Shared Peak Count (SPC) This is the number of the peaks in the theoretical spectrum that are matched to peaks in the experimental spectrum Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: % 0% 1 0 SPC = 7

Inner product (I) This is the sum of the intensities of the peaks in the experimental spectrum that match to peaks in the theoretical spectrum Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: % 0% 1 0

Inner product (I) This is the sum of the intensities of the peaks in the experimental spectrum that match to peaks in the theoretical spectrum Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: % 0% 1 0 I = 3.5

Hyperscore: H = I*N b !*N y ! I is the sum of the intensity of the matched peaks N b, (resp. N y ) is the number of the matched b (resp. y ) peaks in the theoretical spectrum ! is the factorial function. Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: % 0% 1 0 bbbb b yyyy y

Hyperscore: H = I*N b !*N y ! - I is the sum of the intensity of the matched peaks - N b, (resp. N y ) is the number of the matched b (resp. y ) peaks in the theoretical spectrum - ! is the factorial function. Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: % 0% 1 0 bbbb b yyyy y H = 3.2*3!*4! = 3.2*6*24 = 460.8

Xcorr q is the query spectrum t is the theoretical spectrum Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: % 0% 1 0

Xcorr q is the query spectrum t is the theoretical spectrum Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: % 0% 1 0 I(q,t)=3.2

Xcorr q is the query spectrum t is the theoretical spectrum Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: % 0% 1 0 I(q,t)=3.2 I(q,t[-75])=

Xcorr q is the query spectrum t is the theoretical spectrum Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: % 0% 1 0 I(q,t)=3.2 I(q,t[-32])=

Xcorr q is the query spectrum t is the theoretical spectrum Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: % 0% 1 0 I(q,t)=3.2 I(q,t[0])=

Xcorr q is the query spectrum t is the theoretical spectrum Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: % 0% 1 0 I(q,t)=3.2 I(q,t[32])= And so on.

Protein Sequence Databases – Completeness: Complete  Longer searching time – Redundancy: Sequence variations can be found  Redundant database can mess up the statistics – Quality of sequence annotation Protein sequence DB 2. Input data Peptide assignment Validation Protein inference Quantitation Interpretation

Entrez Protein DB – – Most complete, redundant Reference Sequence (RefSeq) and UniProt (Swiss-Prot and TrEMBL) – – – Well annotated, non-redundant International Protein Index (IPI) – – Represents a good balance between redundancy and completeness. – Contains cross-reference to Ensemble, UniProt, RefSeq. Sequences from a single genome – Difficult to obtain good statistics on small datasats. Protein sequence DB 2. Input data Peptide assignment Validation Protein inference Quantitation Interpretation

Protein sequence DB 2. Input data Peptide assignment Validation Protein inference Quantitation Interpretation Taxonomy Allows searches to be limited to entries from particular species or groups of species. Speed up a search, and ensures that the hit list will only contain entries from the selected species. For non-redundant databases, a single entry may represent identical sequences from multiple species. The accession string and title text from the FASTA entry, listed on the master results page, will usually describe just one of these entries. To see the equivalent entries, and to explore their taxonomy, follow the accession number link in the results list to the Protein View. If the hit is from a non-redundant database, and represents multiple entries with identical sequences, the Protein View will include links to NCBI Entrez and the NCBI Taxonomy Browser for all equivalent entries.

Run time Database search has to enumerate all peptides and compare them to all experimental spectra. This can be slow with large protein sequence databases especially when slow scoring function is applied, like Xcorr. Input data Peptide assignment Validation Protein inference Quantitation Interpretation

Speedup techniques Fast database indexing – Fast implementation of sequence indexing in the database Parent mass check – PTMs can be lost Sequest’s preliminary score Tag-based filtering (de novo hybrid) – Increases the specificity(or sensitivity) Input data Peptide assignment Validation Protein inference Quantitation Interpretation

Advanced database indexing – Better implementation of the sequence indexing – Better representation of protein sequences. Input data Peptide assignment Validation Protein inference Quantitation Interpretation

>IPI:IPI |SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTR SHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGC CNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQE QRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA Input data Experimental Spectra Scores: 1.2 Scores: 1.2 Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison Protein sequence DB Parent mass check

>IPI:IPI |SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTR SHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGC CNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQE QRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA Input data Experimental Spectra Scores: Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison Protein sequence DB  Parent mass check

Fast prescoring (used in SEQUEST) So called Sp score: R(q,t) is the maximum number of consecutive matched b-y ions. Input data Peptide assignment Validation Protein inference Quantitation Interpretation 100% 0% 1 0 S p =3.2*7*( *4)/10= SEQUEST selects the top 500 scoring peptides, scored by S p, and rescores them using the Xcorr.

Sequence tag based filtering Extract short amino acid tags from the experimental spectra, Using spectrum graph, where nodes are the peaks, masses which differ by the mass of an amino acid are linked by an edge. Input data Peptide assignment Validation Protein inference Quantitation Interpretation

W R A C V G E K D W Q P T L T Input data Peptide assignment Validation Protein inference Quantitation Interpretation

W R A C V G E K D W L P T L T TAG Prefix Mass AVG 0.0 WTD PET Generates short peptide sequence tags from the spectrum, and uses these tags to filter the protein sequence database. Tags make database search much faster, analogous to the way that BLAST’s filter speeds up sequence search. Input data Peptide assignment Validation Protein inference Quantitation Interpretation

Tag-based filtering MDHPEDESHSEK QDDEEALARLEEIK SIEAKLTLR QNNLNPERPDSAYLR LKQINEEQREGLR FVSEAVTAICEAK SSDIQAAVQICSLLHQR EFSASLTQGLLK SAEDLEADK MDHPEDESHSEK QDD EEA LARLEEIK SIEAKLTLR QNNLNPERPDSAYLR LKQIN EEQ REGLR FVSEAVTAICEAK SSDIQAAVQICSLLHQR EFSAS LTQ GLLK SAEDLEADK Input data Peptide assignment Validation Protein inference Quantitation Interpretation

Summary Experimental spectra are compared to protein sequence database. Scoring function, Protein Database, Speedup techniques, Input data Peptide assignment Validation Protein inference Quantitation Interpretation

Validation Input data Peptide assignment Validation Protein inference Quantitation Interpretation

Input data Experimental Spectra Protein sequence DB Score: 4 Peptide: AELDLNMTR Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Score: 3 Peptide: SIEAKLTLR Input data Peptide assignment Validation Protein inference Quantitation Interpretation

Scores: Scores: Input data Experimental Spectra Protein sequence DB Score: 4 Peptide: AELDLNMTR Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Score: 3 Peptide: SIEAKLTLR Input data Peptide assignment Validation Protein inference Quantitation Interpretation

Scores: Scores: Input data Experimental Spectra Protein sequence DB Score: 4 Peptide: AELDLNMTR Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Score: 3 Peptide: SIEAKLTLR Input data Peptide assignment Validation Protein inference Quantitation Interpretation

Scores: Scores: Input data Experimental Spectra Protein sequence DB Score: 4 Peptide: AELDLNMTR Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Score: 3 Peptide: SIEAKLTLR Input data Peptide assignment Validation Protein inference Quantitation Interpretation

Scores: Scores: Input data Experimental Spectra Protein sequence DB Score: 4 Peptide: AELDLNMTR Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Score: 3 Peptide: SIEAKLTLR Input data Peptide assignment Validation Protein inference Quantitation Interpretation

Input data Experimental Spectra Protein sequence DB Score: 4 Peptide: AELDLNMTR Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Score: 3 Peptide: SIEAKLTLR Input data Peptide assignment Validation Protein inference Quantitation Interpretation How can peptide assignments be approved or rejected automatically? Why is it necessary?

Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: Scores: Human judgment is biased and can be unreliable, Millions of spectra per day, Very difficult by looking at the spectrum visually. Why is it necessary to do it automatically?

Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: Scores: Two computational approaches: Relative score probability based scoring

Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: Scores: Relative score: SEQUEST: delta score

Scores: Scores: Input data Experimental Spectra Protein sequence DB Score: 4 Peptide: AELDLNMTR Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Score: 3 Peptide: SIEAKLTLR Input data Peptide assignment Validation Protein inference Quantitation Interpretation  Cn=(32-4)/32=0.875

Scores: Scores: Input data Experimental Spectra Protein sequence DB Score: 4 Peptide: AELDLNMTR Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Score: 3 Peptide: SIEAKLTLR Input data Peptide assignment Validation Protein inference Quantitation Interpretation  Cn=(32-4)/32=0.875  Cn=(4-4)/4=0

Scores: Scores: Input data Experimental Spectra Protein sequence DB Score: 4 Peptide: AELDLNMTR Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Score: 3 Peptide: SIEAKLTLR Input data Peptide assignment Validation Protein inference Quantitation Interpretation  Cn=(32-4)/32=0.875  Cn=(4-4)/4=0  Cn=(3-3)/3=0

Scores: Scores: Input data Experimental Spectra Protein sequence DB Score: 4 Peptide: AELDLNMTR Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Score: 3 Peptide: SIEAKLTLR Input data Peptide assignment Validation Protein inference Quantitation Interpretation  Cn=(32-4)/32=0.875  Cn=(4-4)/4=0  Cn=(3-3)/3=0  Cn=(15-4)/15=0.733

Scores: Scores: Input data Experimental Spectra Protein sequence DB Score: 4 Peptide: AELDLNMTR Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Score: 3 Peptide: SIEAKLTLR Input data Peptide assignment Validation Protein inference Quantitation Interpretation  Cn=(32-4)/32=0.875  Cn=(4-4)/4=0  Cn=(3-3)/3=0  Cn=(15-4)/15=0.733 Keep the peptide assignment that exceeds a certain limit.

Scores: Scores: Input data Experimental Spectra Protein sequence DB Score: 4 Peptide: AELDLNMTR Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Score: 3 Peptide: SIEAKLTLR Input data Peptide assignment Validation Protein inference Quantitation Interpretation  Cn=(32-4)/32=0.875  Cn=(4-4)/4=0  Cn=(3-3)/3=0  Cn=(15-4)/15=0.733 Keep the peptide assignment that exceeds a certain limit.

Scores: Scores: Input data Experimental Spectra Protein sequence DB Score: 4 Peptide: AELDLNMTR Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Score: 3 Peptide: SIEAKLTLR Input data Peptide assignment Validation Protein inference Quantitation Interpretation  Cn=(32-4)/32=0.875  Cn=(4-4)/4=0  Cn=(3-3)/3=0  Cn=(15-4)/15=0.733 Keep the peptide assignment that exceeds a certain limit.

Scores: Scores: Input data Experimental Spectra Protein sequence DB Score: 4 Peptide: AELDLNMTR Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Score: 3 Peptide: SIEAKLTLR Input data Peptide assignment Validation Protein inference Quantitation Interpretation  Cn=(32-4)/32=0.875  Cn=(4-4)/4=0  Cn=(3-3)/3=0  Cn=(15-4)/15=0.733 Keep the peptide assignment that exceeds a certain limit.

Scores: Scores: Input data Experimental Spectra Protein sequence DB Score: 4 Peptide: AELDLNMTR Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Score: 3 Peptide: SIEAKLTLR Input data Peptide assignment Validation Protein inference Quantitation Interpretation  Cn=(32-4)/32=0.875  Cn=(4-4)/4=0  Cn=(3-3)/3=0  Cn=(15-4)/15=0.733 Keep the peptide assignment that exceeds a certain limit.

Scores: Scores: Input data Experimental Spectra Protein sequence DB Score: 4 Peptide: AELDLNMTR Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Score: 3 Peptide: SIEAKLTLR Input data Peptide assignment Validation Protein inference Quantitation Interpretation  Cn=(32-4)/32=0.875  Cn=(4-4)/4=0  Cn=(3-3)/3=0  Cn=(15-4)/15=0.733 Keep the peptide assignment that exceeds a certain limit.

Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: Scores: Probability based peptide assignment validation: Compute the statistical significance of the score. The statistical significance of a score s is the probability of observing a random score x that is higher or equal that the score s, formally P(s <= x). This probability is called the p-value. 3 approaches: 1. using analytical functions, 2. Fitting a distribution of the sample of random scores. 3. non-parametric approach. Compute the probability that the peptide assignment with the corresponding score is correct.

Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: Scores: Probability based peptide assignment validation: The probability based approach means, very loosely speaking, how far the score is from the random.

Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: Scores: Probability based peptide assignment validation: Random score is a score obtained by a comparison between a randomly selected experimental and a randomly selected theoretical spectrum. This random score has a probability density distribution, and it depends on the scoring functions. As a null hypothesis.

Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: Scores: Probability based peptide assignment validation: The distribution depends on the scoring function. Random matches caused by match with noise

Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: Scores: Probability based peptide assignment validation: 1. Analytical function. Depends on the scoring function. And the parameters are calculated from the spectra to be compared. 1. In the case of the SPC scoring function, the distribution of the random scores can be modeled with hyper geometrical distribution. 2. In the case of the inner product scoring function, the random scores can be modeled with normal distirbution.

Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: Scores: Probability based approach: Build a histogram of the scores that were obtained during the comparison. Fit a known distribution function, and use this for calculation of the p-value of the top score.

Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: Scores: Probability based approach: Decoy approach. Make a dummy dataset, big enough to obtain solid statistics. Decoy dataset can be made by: 1.random shuffling 2.Markov-chain generated amino acid sequences 3.more typically, by simply reversing the sequence of proteins in the database. Sometimes it is called reverse database. No correct matches are expected from the decoy dataset, so the scores obtained on Decoy dataset are used for excellent estimate of random distribution.

Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: Scores: Spectra comparison: Protein sequence DB Input data Experimental Spectra >IPI:IPI |SWISS-PROT:P MEICRGLRSHLITLLLFLFHSETICRPSGRKSSKMQAFRIWDVNQKTFYLRNNQLVAGYLQGPNVN LEEKIDVVPIEPHALFLGIHGGKMCLSCVKSGDETRLQLEAVNITDLSENRKQDKRFAFIRSDSGPTT SFESAACPGWFLCTAMEADQPVSLTNMPDEGVMVTKFYFQEDE Decoy Protein sequence DB

Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: Scores: Spectra comparison: Protein sequence DB Input data Experimental Spectra >Decoy_protein_sequence_1 EDEQFYFKTVMVGEDPMNTRLSVPQDAEMATCLFWGPCAASEFSTTPGSDSRIFAFRKDQKRNE SLDTINVAELQLRTEDGSKVCSLCMKGGHIGLFLAHPEIPVVDIKEELNVNPGQLYGAVLQNNRLYF TKQNVDWIRFAQMKSSKRGSPRCITESHFLFLLLTILHSRLGRCIEM Decoy Protein sequence DB Decoy Scores: Decoy Scores:

Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: Scores: Protein sequence DB Input data Experimental Spectra Decoy Protein sequence DB Decoy Scores: Decoy Scores: Can provide more accurate random distribution model.  Doubles the execution time. Frequently applied approach!

Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: Scores: Protein sequence DB Input data Experimental Spectra Decoy Protein sequence DB Decoy Scores: Decoy Scores: Non-parametric approach. Instead of fitting probability density function to the histogram: Calculate the percentage of the scores on the decoy dataset, equal or higher score than the actual top score.

Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: Scores: Protein sequence DB Decoy Protein sequence DB Decoy Scores: Decoy Scores: False Positive Rate (FPR), the probability of labelling a random score significant (area B in the figure). A FPR of 0.01 means that 1% of the random scores are labelled significant. E-value: The E-value of a query is the expected number for finding a database element with random score greater than or equal to the query hit s on a database of n data. For instance, an E-value of means that the score h is expected to occur by chance only once in 100 independent similarity searches over the database. If the E-value is 10, then ten random hits with score greater or equal to h are expected within a single similarity search.

Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: Scores: Protein sequence DB Decoy Protein sequence DB Decoy Scores: Decoy Scores: False Discovery Rate, the ratio of random scores within significant scores, formally FDR= A /( A + B). The FDR = 0.01 means the 1% of the scores labelled significant are actually observed by chance. FDR is often used to control the ratio of the false positives. The threshold T can be set to keep the FDR under a certain level, typical levels are 0.01 or 0.05, i.e experimenters set thresholds to allow 1% or 5% of false positives. The lower the FDR the more true (non-random) similarity hits are lost. Decoy dataset is used to calculate the FDR.

Input data Peptide assignment Validation Protein inference Quantitation Interpretation Summary: 1.Peptide assignment has to be validated. 2.Relative scoring or probability based scoring can be applied. 3.False positives (false assignments) can be kept under a certain level.

Protein Inference Input data Peptide assignment Validation Protein inference Quantitation Interpretation

Take the peptides that passed the validation. This section is about to infer the proteins that could produces these peptides. The task is not trivial. Input data Peptide assignment Validation Protein inference Quantitation Interpretation Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 15 Peptide: LLHGDPGEEDK

Input data Peptide assignment Validation Protein inference Quantitation Interpretation Peptides: MDHPEDESHSEK QDDEEALARLEEIK SIETLR QNNLNPERPDSAYLR LKQINEEQREGLR FVSEAVTAICEAK SSDIQAAVQICSLLHQR EFSASLTQGLLK SAEDLEADK Proteins:

Input data Peptide assignment Validation Protein inference Quantitation Interpretation Peptides: MDHPEDESHSEK QDDEEALARLEEIK SIETLR QNNLNPERPDSAYLR LKQINEEQREGLR FVSEAVTAICEAK SSDIQAAVQICSLLHQR EFSASLTQGLLK SAEDLEADK Proteins:  

Input data Peptide assignment Validation Protein inference Quantitation Interpretation Peptides: MDHPEDESHSEK QDDEEALARLEEIK SIETLR QNNLNPERPDSAYLR LKQINEEQREGLR FVSEAVTAICEAK SSDIQAAVQICSLLHQR EFSASLTQGLLK SAEDLEADK Proteins:       

Input data Peptide assignment Validation Protein inference Quantitation Interpretation By Occam’s razor, the Protein A should be preferred. Protein A, B ad C can be homologous proteins

Input data Peptide assignment Validation Protein inference Quantitation Interpretation Many models have been develop to cope with to this problem. Statistical based, Graph theory and spectral Network based. Well-known method ProteinProphet.

Summary Input data Peptide identification Validation Protein inference Quantitation Interpretation Data formatsDatabase searching Statistical methods for validations Protein assembling

Database Searching – Simple and straightforward – Has a limited search space. – Completeness – Statistical analysis can be carried out.  – Has a limited search space. Limited to the database. – Enumerating all candidates is too slow, particularly when modifications and non-tryptic peptides must be considered. (A modern instrument produces million spectra per day) Input data Peptide assignment Validation Protein inference Quantitation Interpretation