Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.

Slides:



Advertisements
Similar presentations
Genomes and Proteomes genome: complete set of genetic information in organism gene sequence contains recipe for making proteins (genotype) proteome: complete.
Advertisements

Big Data & the CPTAC Data Portal Nathan Edwards, Peter McGarvey Mauricio Oberti, Ratna Thangudu Shuang Cai, Karen Ketchum Georgetown University & ESAC.
Proteomics and Glycoproteomics (Bio-)Informatics of Protein Isoforms Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
N-Glycopeptide Identification from CID Tandem Mass Spectra using Glycan Databases and False Discovery Rate Estimation Kevin B. Chandler, Petr Pompach,
Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular.
How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
Peptide Identification by Tandem Mass Spectrometry Behshad Behzadi April 2005.
Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.
Previous Lecture: Regression and Correlation
Mass Spectrometry. What are mass spectrometers? They are analytical tools used to measure the molecular weight of a sample. Accuracy – 0.01 % of the total.
Goals in Proteomics 1.Identify and quantify proteins in complex mixtures/complexes 2.Identify global protein-protein interactions 3.Define protein localizations.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Novel Peptide Identification using ESTs and Sequence Database Compression Nathan Edwards Center for Bioinformatics and Computational Biology University.
Proteomics Josh Leung Biology 1220 April 13 th, 2010.
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Gene Set Enrichment and Splicing Detection using Spectral Counting Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University.
Evaluated Reference MS/MS Spectra Libraries Current and Future NIST Programs.
Tryptic digestion Proteomics Workflow for Gel-based and LC-coupled Mass Spectrometry Protein or peptide pre-fractionation is a prerequisite for the reduction.
Production of polypeptides, Da, and middle-down analysis by LC-MSMS Catherine Fenselau 1, Joseph Cannon 1, Nathan Edwards 2, Karen Lohnes 1,
Improving Genome Annotation using Proteomics Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park.
Improving the Reliability of Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Nathan Edwards Center for Bioinformatics and Computational Biology
Top-down characterization of proteins in bacteria with unsequenced genomes Nathan Edwards Georgetown University Medical Center.
Direct Experimental Observation of Functional Protein Isoforms by Tandem Mass Spectrometry Nathan Edwards Center for Bioinformatics and Computational Biology.
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
Acknowledgements This work is supported by NSF award DBI , and National Center for Glycomics and Glycoproteomics, funded by NIH/NCRR grant 5P41RR
Common parameters At the beginning one need to set up the parameters.
Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Meta-Search and Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
Laxman Yetukuri T : Modeling of Proteomics Data
Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
Protein bioinformatics and systems biology Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
In-Gel Digestion Why In-Gel Digest?
Genomics II: The Proteome Using high-throughput methods to identify proteins and to understand their function.
Software Project MassAnalyst Roeland Luitwieler Marnix Kammer April 24, 2006.
PEAKS: De Novo Sequencing using Tandem Mass Spectrometry Bin Ma Dept. of Computer Science University of Western Ontario.
Glycoprotein Microheterogeneity via N-Glycopeptide Identification Kevin Brown Chandler, Petr Pompach, Radoslav Goldman, Nathan Edwards Georgetown University.
CSE182 CSE182-L11 Protein sequencing and Mass Spectrometry.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.
Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational.
Multiple flavors of mass analyzers Single MS (peptide fingerprinting): Identifies m/z of peptide only Peptide id’d by comparison to database, of predicted.
Lecture-9 MS Techniques and Protein Identification Huseyin Tombuloglu, Phd GBE423 Genomics & Proteomics.
Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center  Peptide sequence databases, meta-search engine, machine-learning.
Improving the Sensitivity of Peptide Identification by Meta-Search, Grid-Computing, and Machine-Learning Nathan Edwards Georgetown University Medical Center.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Proteomics & Mass Spectrometry
Application of meta-search, grid-computing, and machine-learning can significantly improve the sensitivity of peptide identification. The PepArML meta-search.
Top-down characterization of proteins in bacteria with unsequenced genomes Colin Wynne Catherine Fenselau University of Maryland, College Park Nathan Edwards.
Minimize Database-Dependence in Proteome Informatics Apr. 28, 2009 Kyung-Hoon Kwon Korea Basic Science Institute.
Proteomics: Technology and Cell Signaling Presenter: Ido Tal Advisor: Prof. Michal Linial י " ג סיון תשע " ה.
Post translational modification n- acetylation Peptide Mass Fingerprinting (PMF) is an analytical technique for identifying unknown protein. Proteins to.
Algorithms and Computation: Bottom-Up Data Analysis Workflows
The Syllabus. The Syllabus Safety First !!! Students will not be allowed into the lab without proper attire. Proper attire is designed for your protection.
Proteomics Informatics David Fenyő
Protein Identification Using Tandem Mass Spectrometry
Protein Identification Using Mass Spectrometry
Bioinformatics for Proteomics
Proteomics Informatics David Fenyő
Presentation transcript:

Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center

Why Tandem Mass Spectrometry? MS/MS spectra provide evidence for the amino-acid sequence of functional proteins. Key concepts: Spectrum acquisition is unbiased Direct observation of amino-acid sequence Sensitive to small sequence variations

Mass Spectrometry for Proteomics Measure mass of many (bio)molecules simultaneously High bandwidth Mass is an intrinsic property of all (bio)molecules No prior knowledge required

Mass Spectrometer Ionizer Sample Mass Analyzer Detector MALDI + _ Mass Analyzer Detector MALDI Electro-Spray Ionization (ESI) Time-Of-Flight (TOF) Quadrapole Ion-Trap Electron Multiplier (EM)

Mass Spectrum

Mass is fundamental

Mass Spectrometry for Proteomics Measure mass of many molecules simultaneously ...but not too many, abundance bias Mass is an intrinsic property of all (bio)molecules ...but need a reference to compare to

Mass Spectrometry for Proteomics Mass spectrometry has been around since the turn of the century... ...why is MS based Proteomics so new? Ionization methods MALDI, Electrospray Protein chemistry & automation Chromatography, Gels, Computers Protein sequence databases A reference for comparison

Sample Preparation for MS/MS Enzymatic Digest and Fractionation

Single Stage MS MS

Tandem Mass Spectrometry (MS/MS) Precursor selection

Tandem Mass Spectrometry (MS/MS) Precursor selection + collision induced dissociation (CID) MS/MS

Peptide Fragmentation Peptide: S-G-F-L-E-E-D-E-L-K y1 y2 y3 y4 y5 y6 y7 y8 y9 ion 1020 907 778 663 534 405 292 145 88 MW 762 SGFL EEDELK b4 389 SGFLEED ELK b7 633 SGFLE EDELK b5 1080 S GFLEEDELK b1 1022 SG FLEEDELK b2 875 SGF LEEDELK b3 504 SGFLEE DELK b6 260 SGFLEEDE LK b8 147 SGFLEEDEL K b9

Unannotated Splice Isoform Human Jurkat leukemia cell-line Lipid-raft extraction protocol, targeting T cells von Haller, et al. MCP 2003. LIME1 gene: LCK interacting transmembrane adaptor 1 LCK gene: Leukocyte-specific protein tyrosine kinase Proto-oncogene Chromosomal aberration involving LCK in leukemias. Multiple significant peptide identifications

Unannotated Splice Isoform

Unannotated Splice Isoform

Translation start-site correction Halobacterium sp. NRC-1 Extreme halophilic Archaeon, insoluble membrane and soluble cytoplasmic proteins Goo, et al. MCP 2003. GdhA1 gene: Glutamate dehydrogenase A1 Multiple significant peptide identifications Observed start is consistent with Glimmer 3.0 prediction(s)

Halobacterium sp. NRC-1 ORF: GdhA1 K-score E-value vs PepArML @ 10% FDR Many peptides inconsistent with annotated translation start site of NP_279651

Translation start-site correction

Phyloproteomics Tandem mass-spectra of proteins (top-down) High-accuracy instrument (Orbitrap, UMD Core) Proteins from unsequenced bacteria matching identical proteins in related organisms Demonstration using Y.rohdei.

Protein Fragmentation Spectrum Match to Y. pestis 50S RP L32 A­V­Q­Q­N­K­P­T­R­S­K­R­G­M­R­R­S­H­D­A­ L­T­T­A­T­L­S­V­D­K­T­S­G­E­T­H­L­R­H­H­ I­T­A­D­G­F­Y­R­G­R­K­V­I­G

Phyloproteomics

phylogeny.fr – "One-Click" Phyloproteomics Protein Sequence 16S-rRNA Sequence phylogeny.fr – "One-Click"

Shared "Biomarker" Proteins

Phyloproteomics Recent extension to highly homologous proteins in related organisms Merely require N- and/or C-terminus in common Broadens applicability considerably Phyloproteomic trees for E.herbicola and Enterocloacae, neither sequenced. New paradigm for phylogenetic analysis?

Lost peptide identifications Missing from the sequence database Search engine strengths, weaknesses, quirks Poor score or statistical significance Thorough search takes too long

Searching under the street-light… Tandem mass spectrometry doesn’t discriminate against novel peptides... ...but protein sequence databases do! Searching traditional protein sequence databases biases the results in favor of well-understood and/or computationally predicted proteins and protein isoforms!

Peptide Sequence Databases All amino-acid 30-mers, no redundancy From ESTs, Proteins, mRNAs 30-40 fold size, search time reduction Formatted as a FASTA sequence database One entry per gene/cluster. Organism Size (AA) Size (Entries) Human 248Mb 74,976 Mouse 171Mb 55,887 Rat 76Mb 42,372 Zebra-fish 94Mb 40,490

We can observe evidence for… Known coding SNPs Unannotated coding mutations Alternate splicing isoforms Alternate/Incorrect translation start-sites Microexons Alternate/Incorrect translation frames …though it must be treated thoughtfully.

PeptideMapper Web Service I’m Feeling Lucky

PeptideMapper Web Service I’m Feeling Lucky

PeptideMapper Web Service I’m Feeling Lucky

PeptideMapper Web Service Suffix-tree index on peptide sequence database Fast peptide to gene/cluster mapping “Compression” makes this feasible Peptide alignment with cluster evidence Amino-acid or nucleotide; exact & near-exact Genomic-loci mapping via UCSC “known-gene” transcripts, and Predetermined, embedded genomic coordinates

Comparison of search engine results No single score is comprehensive Search engines disagree Many spectra lack confident peptide assignment 38% 14% 28% 3% 2% 1% X! Tandem SEQUEST Mascot Here is way, no single one gives the best results Q: after improvement, what is the percentage of identified spectra, how is the improvement? 25 – 30% Searle et al. JPR 7(1), 2008

Combining search engine results – harder than it looks! Consensus boosts confidence, but... How to assess statistical significance? Gain specificity, but lose sensitivity! Incorrect identifications are correlated too! How to handle weak identifications? Consensus vs disagreement vs abstention Threshold at some significance? We apply unsupervised machine-learning.... Lots of related work unified in a single framework.

Supervised Learning

Unsupervised Learning

Peptide Atlas A8_IP LTQ Dataset

Running many search engines Search engine configuration can be difficult: Correct spectral format Search parameter files and command-line Pre-processed sequence databases. Tracking spectrum identifiers Extracting peptide identifications, especially modifications and protein identifiers

Peptide Identification Meta-Search Simple unified search interface for: Mascot, X!Tandem, K-Score, OMSSA, MyriMatch, S-Score, InsPecT, KM-Score Automatic decoy searches Automatic spectrum file "chunking" Automatic scheduling Serial, Multi-Processor, Cluster, Grid

PepArML Meta-Search Engine X!Tandem, KScore, OMSSA, MyriMatch, Mascot (1 core). NSF TeraGrid 1000+ CPUs Heterogeneous compute resources X!Tandem, KScore, OMSSA, MyriMatch. Secure communication Edwards Lab Scheduler & 48+ CPUs Scales easily to 250+ simultaneous searches Single, simple search request UMIACS 250+ CPUs

PepArML Meta-Search Engine X!Tandem, KScore, OMSSA, MyriMatch, Mascot (1 core). NSF TeraGrid 1000+ CPUs Heterogeneous compute resources X!Tandem, KScore, OMSSA, MyriMatch. Secure communication Edwards Lab Scheduler & 80+ CPUs Scales easily to 250+ simultaneous searches Single, simple search request

PepArML Meta-Search Engine Heterogeneous compute resources NSF TeraGrid 1000+ CPUs Edwards Lab Scheduler & 48+ CPUs Secure communication Simple search request UMIACS 250+ CPUs

PepArML Meta-Search Engine Heterogeneous compute resources NSF TeraGrid 1000+ CPUs Edwards Lab Scheduler & 48+ CPUs Secure communication Simple search request UMIACS 250+ CPUs

Peptide Identification Grid-Enabled Meta-Search Access to high-performance computing resources for the proteomics community NSF TeraGrid Community Portal University/Institute HPC clusters Individual lab compute resources Contribute cycles to the community and get access to others’ cycles in return. Centralized scheduler Compute capacity can still be exclusive, or prioritized. Compute client plays well with HPC grid schedulers.

Conclusions Improve the scope and sensitivity of peptide identification for genome annotation, using Exhaustive peptide sequence databases Machine-learning for combining Meta-search tools to maximize consensus Grid-computing for thorough search http://edwardslab.bmcb.georgetown.edu

Acknowledgements Dr. Catherine Fenselau & students Dr. Yan Wang University of Maryland Biochemistry Dr. Yan Wang University of Maryland Proteomics Core Dr. Art Delcher University of Maryland CBCB Dr. Chau-Wen Tseng & Dr. Xue Wu University of Maryland Computer Science Funding: NIH/NCI