Informatics for proteomic inventories Biomedical Informatics Vanderbilt University.

Slides:



Advertisements
Similar presentations
Genomes and Proteomes genome: complete set of genetic information in organism gene sequence contains recipe for making proteins (genotype) proteome: complete.
Advertisements

Protein Quantitation II: Multiple Reaction Monitoring
Proteomics Informatics – Protein characterization I: post-translational modifications (Week 10)
In-depth Analysis of Protein Amino Acid Sequence and PTMs with High-resolution Mass Spectrometry Lian Yang 2 ; Baozhen Shan 1 ; Bin Ma 2 1 Bioinformatics.
Mining Clinical Proteomes for Post-Translational Modifications David L. Tabb, Ph.D.
How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
Mass Spectrometry in a drug discovery setting Claus Andersen Senior Scientist Sienabiotech Spa.
Bin Ma, CTO Bioinformatics Solutions Inc. June 5, 2011.
Peptide Identification by Tandem Mass Spectrometry Behshad Behzadi April 2005.
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Analysis of tandem mass spectra - I Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Proteomics Informatics Workshop Part I: Protein Identification
Previous Lecture: Regression and Correlation
This work is licensed under a Creative Commons Attribution 4.0 International License. Oliver Kohlbacher, Sven Nahnsen, Knut Reinert COMPUTATIONAL PROTEOMICS.
Scaffold Download free viewer:
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
My contact details and information about submitting samples for MS
Goals in Proteomics 1.Identify and quantify proteins in complex mixtures/complexes 2.Identify global protein-protein interactions 3.Define protein localizations.
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Proteomics Informatics Workshop Part III: Protein Quantitation
Fa 05CSE182 CSE182-L9 Mass Spectrometry Quantitation and other applications.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Tryptic digestion Proteomics Workflow for Gel-based and LC-coupled Mass Spectrometry Protein or peptide pre-fractionation is a prerequisite for the reduction.
Karl Clauser Proteomics and Biomarker Discovery Taming Errors for Peptides with Post-Translational Modifications Bioinformatics for MS Interest Group ASMS.
Production of polypeptides, Da, and middle-down analysis by LC-MSMS Catherine Fenselau 1, Joseph Cannon 1, Nathan Edwards 2, Karen Lohnes 1,
The dynamic nature of the proteome
Acknowledgements This work is supported by NSF award DBI , and National Center for Glycomics and Glycoproteomics, funded by NIH/NCRR grant 5P41RR
Common parameters At the beginning one need to set up the parameters.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Laxman Yetukuri T : Modeling of Proteomics Data
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
Peptidesproteinsgenes protein accessionsharedsharedunique gene nameshareduniqueunique Identified by gene unique peptides Identified by protein and gene.
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
Metabolomics Metabolome Reflects the State of the Cell, Organ or Organism Change in the metabolome is a direct consequence of protein activity changes.
CS 461b/661b: Bioinformatics Tools and Applications Software Algorithm Mathematical Models Biology Experiments and Data.
Temple University MASS SPECTROMETRY FURTHER INVESTIGATIONS Ilyana Mushaeva and Amber Moscato Department of Electrical and Computer Engineering Temple University.
Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.
Genomics II: The Proteome Using high-throughput methods to identify proteins and to understand their function.
Glycoprotein Microheterogeneity via N-Glycopeptide Identification Kevin Brown Chandler, Petr Pompach, Radoslav Goldman, Nathan Edwards Georgetown University.
CSE182 CSE182-L11 Protein sequencing and Mass Spectrometry.
Multiple flavors of mass analyzers Single MS (peptide fingerprinting): Identifies m/z of peptide only Peptide id’d by comparison to database, of predicted.
EBI is an Outstation of the European Molecular Biology Laboratory. In silico analysis of accurate proteomics, complemented by selective isolation of peptides.
Separates charged atoms or molecules according to their mass-to-charge ratio Mass Spectrometry Frequently.
Proteomics Informatics (BMSC-GA 4437) Instructor David Fenyö Contact information
Tag-based Blind Identification of PTMs with Point Process Model 1 Chunmei Liu, 2 Bo Yan, 1 Yinglei Song, 2 Ying Xu, 1 Liming Cai 1 Dept. of Computer Science.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
Constructing high resolution consensus spectra for a peptide library
Minimize Database-Dependence in Proteome Informatics Apr. 28, 2009 Kyung-Hoon Kwon Korea Basic Science Institute.
2015/06/03 Park, Hyewon 1. Introduction Protein assembly Transforms a list of identified peptides into a list of identified proteins. 2 Duplicate Spectrum.
Protein quantitation I: Overview (Week 5). Fractionation Digestion LC-MS Lysis MS Sample i Protein j Peptide k Proteomic Bioinformatics – Quantitation.
Mass Spectrometry makes it possible to measure protein/peptide masses (actually mass/charge ratio) with great accuracy Major uses Protein and peptide identification.
Proteomic Parsimony through Bipartite Graph Analysis Improves Accuracy and Transparency 2013/05/28 Ahn, Soohan.
Yiming Yang1,2, Abhay Harpale1 and Subramanian Ganaphathy1
A Database of Peak Annotations of Empirically Derived Mass Spectra
Protein Identification via Database searching
Mass spectrometry-based proteomics
Bioinformatics Solutions Inc.
Proteomics Informatics David Fenyő
Peptide & Protein Identification by MS/MS
A perspective on proteomics in cell biology
Proteomics Informatics –
NoDupe algorithm to detect and group similar mass spectra.
Volume 24, Issue 13, Pages (July 2014)
Shotgun Proteomics in Neuroscience
Proteomics Informatics David Fenyő
Presentation transcript:

Informatics for proteomic inventories Biomedical Informatics Vanderbilt University

Overview Explaining the whys and hows of proteomics Matching peptides from protein sequence databases to MS/MS spectra Filtering peptide-spectrum matches (PSMs) to an acceptable false discovery rate (FDR) Inferring proteins parsimoniously and scalably

Methods capture only part of story J_Alves: glycine tRNA J_Alves: glucose and cholesterol ElaineMeng: H-ras, PDB 121P Genomics and epigenetics describe state of “catalog.” Transcriptomics describes current “purchase orders.” Proteomics measures current inventory of cell capabilities. Metabolomics examine cell state most directly.

What does proteomics include? Protein Inventories Tissue Imaging 1D and 2D Gel Electrophoresis Protein Quantitation Post-Translational Modifications Gerald_G scales, Gsagri04: gel, AB SCIEX tissue image

Discovery Proteomics Peptide Mixture Liquid Chromatography Electrospray Ionization High-Resolution Mass Spectrometry Isolate Ions of Peptide Collide Ions to Dissociate Collect Fragments in Tandem MS Protein Mixture Two types of measurements for each peptide: intact m/z (mass/charge) and a list of fragment m/zs. Peptide Fractionation

Collision-induced dissociation (CID) “Tickle” energizes peptide, causing varied conformations and proton movement. A mobile proton associates with a carbonyl adjoining a peptide bond, drawing electrons. Electrons of the prior carbonyl attack, forming a ringed intermediate that quickly dissociates. Wysocki et al, Anal. Chem. (2000) 35: Paizs and Suhai, Rapid Comm. Mass Spectrom. (2002) 16:

TSIIGTIGPK N-terminal b4 ion C-terminal y6 ion Broken peptide bonds yield fragments

HF- -LEK-SELEK -ISELEK -FISELEK Neutral loss of water from peptide HFISELEK, +2 charge state

Same spectrum compared to FHEIKELS instead of HFISELEK FH- has same mass as HF- -EIKELS has same mass as -ISELEK Neutral loss of water from peptide

Disassembly and reassembly After AI Nesvizhskii, Mol Cell Proteomics (2005) 4:

Database search overview Eng et al (1994) J. Amer. Soc. Mass Spectrom. 5: Yates et al (1995) Anal. Chem. 67:

Emulating proteases in silico N Edwards and R Lippert. Lecture Notes In Computer Science (2002) 2452:

Dynamic PTMs grow search space Because multiple PTMs may be in each peptide, adding PTMs to a search creates an exponential cost. Here, three sites lead to eight PTM variants. CASA1_BOVIN

Peptide mass filter Sequences outside mass tolerance are not compared. Many sequences may share a common mass. Sequences of one mass may score differently. Sequences of different mass may score the same.

Fragment masses and charge segregation AA HOH H+H+ H+H+ AA HOH H+H+ H+H+ AA HOH H+H+ H+H+ H+H+ H+H

Sequest cross correlation Normalize observed spectrum. Generate model spectrum for each candidate. Convert observed and model spectrum to frequency domain by FFT. Cross-correlate, reporting ratio between zero- offset alignment and nearby alignments. J Eng et al. J. Proteome Res. (2008) 7: J Eng et al. J Amer. Soc. Mass. Spectrom. (1994) 5:

X!Tandem scoring Predict more accurate fragment intensities Count matched b ions and matched y ions Compute dot product of intensities Generate hyperscore = Build histogram of scores per spectrum Report expectation value Craig and Beavis. Rapid Comm. Mass Spectrom. (2003) 17: Fenyö and Beavis. Anal. Chem. (2003) 75:

Random match probabilities Imagine spectrum as jar of 100 black and 900 white marbles (peaks and voids). Sample 20 marbles for a predicted peaklist, drawing 15 black and 5 white. Compute probability of random match by hypergeometric distribution: T Fridman. J. Bioinfo. Computat. Bio. (2005) 3:

Disassembly and reassembly After AI Nesvizhskii, Mol Cell Proteomics (2005) 4:

The “longest list” problem Perceived value of early proteomics experiments was linked only to sensitivity. Systems to evaluate specificity lagged behind, and false positive rates were left unchecked. Two developments were needed: – Community consensus on reporting standards – New tools for evaluating identification error rates Carr et al. Mol. Cell. Proteomics (2004) 3: Taylor et al. Nature Biotech. (2007) 25:

Strategy I: Target/decoy estimates FDR Sequence database has equal numbers of target and decoy sequences. False IDs distribute evenly between target and decoy sequences. Apply a threshold, and: – False estimate = 2 x [decoy hit count]. – False Discovery Rate (FDR) = False estimate divided by number of passing IDs. Elias and Gygi. Nature Methods (2007) 4:

Decoys model false distribution Elias Nat. Methods (2007) 4:

Strategy II: Peptide Prophet Estimates correctness probability for individual identifications Combines multiple subscores from each Sequest identification through DFA Fits mixed model to observed matches with expectation maximization A Keller. Anal. Chem. (2002) 74:

Discriminant Function Analysis combines sub-scores from Sequest

Mixture Model analysis separates true and false distributions Expectation maximization adjusts two curves to fit observed data. Here, negatives are fit to a gamma distribution and positives to a normal distribution.

Disassembly and reassembly After AI Nesvizhskii, Mol Cell Proteomics (2005) 4:

Why are peptides shared among proteins? “Orthologs are direct evolutionary counterparts derived from a common ancestor through vertical descent; whenever we speak of the ‘the same gene in different species,’ we actually mean orthologs. In contrast, paralogs are genes within the same genome that have evolved by duplication.” Koonin. Genome Biology (2001) 2: comment

Protein isoforms A single gene may give rise to many transcripts that overlap for one or more exons. When isoforms are listed as separate proteins in the FASTA, a peptide may match a shared or distinctive part of a protein sequence. VEGF incorporates eight exons, where either 6 or 7, both, or neither may be incorporated.

Parsimony noun: “economy of explanation in conformity with Occam's razor” – Merriam Webster OnLine “Plurality ought never be posed without necessity.” – William of Occam

IDPicker 1.Assemble maximal protein list. 2.Combine proteins that point to the same peptides, and combine peptides that point to the same proteins. 3.Find “set cover” by greedy algorithm to pick minimal protein list to explain peptides. B Zhang et al. J. Proteome Res. (2007) 6: Z Ma et al. J. Proteome Res. (2010) 8:

Two proteins or seven? Sample mixes mouse and human proteins. Isoforms, paralogs, and orthologs complicate protein-peptide map. Untangling relationships is non-trivial. Data from Broad Institute, CPTAC

Greedy algorithm Data from Broad Institute, CPTAC

ProteinProphet 1.Combine peptide identification probabilities into protein identification probabilities. 2.Distribute probability for shared peptides across multiple proteins. 3.Compute protein probability by subtracting probability that all observed peptides are false from 1. – AI Nesvizhskii. Anal. Chem. (2003) 75:

Number of Sibling Peptides and Degenerate Peptides NSP places more confidence in peptides for proteins with abundant supporting evidence. Degenerate peptides match multiple potential proteins, each associated with a weight. Expectation maximization determines weights that minimize proteins count and maximize protein probability.

Parsimony reduces protein lists Maximal list Grouping indiscernibles Grouping + parsimony SwissProt HUMAN International Protein Index SwissProt Multispecies Zhang et al. J. Proteome Res. (2007) 6:

Protein FDR is not PSM FDR Minimum PSMs/Prot Confident PSMs Distinct Peptides Distinct Protein Groups Empirical Protein FDR PSM FDR fixed at 3% Two distinct peptides required per protein True PSMs group together on true proteins. False PSMs spread across the database. Data from Broad Institute, CPTAC

Takeaway messages Tandem mass spectrometry produces lists of fragment m/z values and precursor masses. Database search narrows the set of all possible peptides to plausible candidates. Controlling peptide and protein FDR is essential for credible, publishable inventories. Parsimony and scalable filtering are necessary to field modern data sets.