Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.

Slides:



Advertisements
Similar presentations
Genomes and Proteomes genome: complete set of genetic information in organism gene sequence contains recipe for making proteins (genotype) proteome: complete.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Proteomics and Glycoproteomics (Bio-)Informatics of Protein Isoforms Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
N-Glycopeptide Identification from CID Tandem Mass Spectra using Glycan Databases and False Discovery Rate Estimation Kevin B. Chandler, Petr Pompach,
Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular.
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
Peptide Identification by Tandem Mass Spectrometry Behshad Behzadi April 2005.
Introduction to BioInformatics GCB/CIS535
Sangtae Kim Ph.D. candidate University of California, San Diego
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.
Bioinformatics Alternative splicing Multiple isoforms Exonic Splicing Enhancers (ESE) and Silencers (ESS) SpliceNest Lecture 13.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Facts and Fallacies about de Novo Sequencing & Database Search.
Novel Peptide Identification using ESTs and Sequence Database Compression Nathan Edwards Center for Bioinformatics and Computational Biology University.
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular.
Gene Set Enrichment and Splicing Detection using Spectral Counting Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University.
Production of polypeptides, Da, and middle-down analysis by LC-MSMS Catherine Fenselau 1, Joseph Cannon 1, Nathan Edwards 2, Karen Lohnes 1,
Improving Genome Annotation using Proteomics Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park.
Improving the Reliability of Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Nathan Edwards Center for Bioinformatics and Computational Biology
Protein Sequence Databases, Peptides to Proteins, and Statistical Significance Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown.
Top-down characterization of proteins in bacteria with unsequenced genomes Nathan Edwards Georgetown University Medical Center.
Direct Experimental Observation of Functional Protein Isoforms by Tandem Mass Spectrometry Nathan Edwards Center for Bioinformatics and Computational Biology.
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
Common parameters At the beginning one need to set up the parameters.
Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Meta-Search and Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
Laxman Yetukuri T : Modeling of Proteomics Data
Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
Protein bioinformatics and systems biology Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Sackler Medical School
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
False-Discovery-Rate Aware Protein Inference by Generalized Protein Parsimony Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.
Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Glycoprotein Microheterogeneity via N-Glycopeptide Identification Kevin Brown Chandler, Petr Pompach, Radoslav Goldman, Nathan Edwards Georgetown University.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.
Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center  Peptide sequence databases, meta-search engine, machine-learning.
Improving the Sensitivity of Peptide Identification by Meta-Search, Grid-Computing, and Machine-Learning Nathan Edwards Georgetown University Medical Center.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Application of meta-search, grid-computing, and machine-learning can significantly improve the sensitivity of peptide identification. The PepArML meta-search.
Top-down characterization of proteins in bacteria with unsequenced genomes Colin Wynne Catherine Fenselau University of Maryland, College Park Nathan Edwards.
Minimize Database-Dependence in Proteome Informatics Apr. 28, 2009 Kyung-Hoon Kwon Korea Basic Science Institute.
2015/06/03 Park, Hyewon 1. Introduction Protein assembly Transforms a list of identified peptides into a list of identified proteins. 2 Duplicate Spectrum.
Cedar: A Multi-Tiered Protein Identification Scheme for Shotgun Proteomics Terry Farrah (1); Eric Deutsch (1); Gilbert Omenn (2,1); Ruedi Aebersold (3),
Considerations for multi-omics data integration Michael Tress CNIO,
Algorithms and Computation: Bottom-Up Data Analysis Workflows
Protein Identification via Database searching
Protein Inference by Generalized Protein Parsimony reduces False Positive Proteins in Bottom-Up Workflows Nathan J. Edwards, Department of Biochemistry.
Proteomics Informatics David Fenyő
Protein Identification Using Mass Spectrometry
Basic Local Alignment Search Tool
Sim and PIC scoring results for standard peptides and the test shotgun proteomics dataset. Sim and PIC scoring results for standard peptides and the test.
Proteomics Informatics David Fenyő
Presentation transcript:

Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center

2 Lost peptide identifications Missing from the sequence database Search engine strengths, weaknesses, quirks Poor score or statistical significance Thorough search takes too long

3 Lost peptide identifications Missing from the sequence database Build exhaustive peptide sequence databases Search prokaryotic genomes Build evidence for unannotated proteins and protein isoforms Search engine strengths, weaknesses, quirks Use multiple search engines and combine results Poor score or statistical significance Use search-engine consensus to boost confidence Use machine-learning to distinguish true from false Thorough search takes too long Harness the power of heterogeneous computational grids

4 Searching under the street-light… Tandem mass spectrometry doesn’t discriminate against novel peptides......but protein sequence databases do! Searching traditional protein sequence databases biases the results in favor of well-understood and/or computationally predicted proteins and protein isoforms!

5 Unannotated Splice Isoform Human Jurkat leukemia cell-line Lipid-raft extraction protocol, targeting T cells von Haller, et al. MCP LIME1 gene: LCK interacting transmembrane adaptor 1 LCK gene: Leukocyte-specific protein tyrosine kinase Proto-oncogene Chromosomal aberration involving LCK in leukemias. Multiple significant peptide identifications

6 Unannotated Splice Isoform

7

8

9 Translation start-site correction Halobacterium sp. NRC-1 Extreme halophilic Archaeon, insoluble membrane and soluble cytoplasmic proteins Goo, et al. MCP GdhA1 gene: Glutamate dehydrogenase A1 Multiple significant peptide identifications Observed start is consistent with Glimmer 3.0 prediction(s)

10 Halobacterium sp. NRC-1 ORF: GdhA1 Peptide identifications filtered at 10% FDR Consistent/not consistent with NP_279651

11 Halobacterium sp. NRC-1 ORF: GdhA1 Peptide identifications filtered at 20% FDR Consistent/not consistent with NP_279651

12 Translation start-site correction

13 We can observe evidence for… Known coding SNPs Unannotated coding mutations Alternate splicing isoforms Alternate/Incorrect translation start-sites Microexons Alternate/Incorrect translation frames …though it must be treated thoughtfully.

14 Peptide Sequence Databases All amino-acid seqs of at most 30 amino-acids from: IPI and all IPI constituent protein sequences IPI, HInvDB, VEGA, UniProt, EMBL, RefSeq, GenBank SwissProt variants, conflicts, splices, and annotated signal peptide truncations. Genbank and RefSeq mRNA sequence 3 frame translation GenBank EST and HTC sequences 6 frame translation and found in at least 2 sequences Grouped by Gene/UniGene cluster and compressed.

15 Formatted as a FASTA sequence database Easy integration with search engines. One entry per gene/cluster. Automated rebuild every few months. Peptide Sequence Databases OrganismSize (AA)Size (Entries) Human248Mb74,976 Mouse171Mb55,887 Rat 76Mb42,372 Zebra-fish 94Mb40,490

16 Peptide evidence, in context Statistically significant identified peptides can be misleading… Isobaric amino-acid/PTM substitutions Unsubstantiated peptide termini Few b-ions or y-ions suggest “random” mass match Single amino-acids on upstream or downstream exons Peptides in 5’ UTR with no upstream Met Need tools to quickly check the corroborating (genomic, transcript, SNP) evidence

17 PeptideMapper Web Service Counts: by gene and evidence EST, mRNA, Protein Sequences: accessions by gene UniProt variants nucleotide sequence & link to BLAT alignment Genomic Loci: one-click projection to the UCSC genome browser

18 PeptideMapper Web Service I’m Feeling Lucky

19 PeptideMapper Web Service I’m Feeling Lucky

20 PeptideMapper Web Service I’m Feeling Lucky

21 PeptideMapper Web Service Suffix-tree index on peptide sequence database Fast peptide to gene/cluster mapping “Compression” makes this feasible Peptide alignment with cluster evidence Amino-acid or nucleotide; exact & near-exact Genomic-loci mapping via UCSC “known-gene” transcripts, and Predetermined, embedded genomic coordinates

22 Comparison of search engine results No single score is comprehensive Search engines disagree Many spectra lack confident peptide assignment Searle et al. JPR 7(1), % 14% 28% 14% 3% 2% 1% X! Tandem SEQUEST Mascot

23 Combining search engine results – harder than it looks! Consensus boosts confidence, but... How to assess statistical significance? Gain specificity, but lose sensitivity! Incorrect identifications are correlated too! How to handle weak identifications? Consensus vs disagreement vs abstention Threshold at some significance? We apply unsupervised machine-learning.... Lots of related work unified in a single framework.

24 Supervised Learning

25 Unsupervised Learning

26 PepArML Combining Results Q-TOF LTQ MALDI Edwards, et al., Clin. Prot. 5(1), 2009

27 Unsupervised Learning H C-TMO U-TMO U*-TMO Edwards, et al., Clin. Prot. 5(1), 2009

28 Peptide Atlas A8_IP LTQ Dataset

29 PepArML in the trenches… MALDI spectra of proteolytic peptides in serum Top-down CID spectra after decharging Halobacterium six-frame search PepArML found 389 non-RefSeq peptides Mascot: 173, OMSSA: 168, K-Score: 292 Peptides for GdhA1: PepArML 9(2), K-Score: 6(1) Semi-tryptic searches work particularly well. S17 Spectra at 10% FDR

30 Searching for Consensus Search engine quirks can destroy consensus Initial methionine loss as tryptic peptide Charge state enumeration or guessing X!Tandem's refinement mode Pyro-Gln, Pyro-Glu modifications Difficulty tracking spectrum identifiers Precursor mass tolerance (Da vs ppm) Decoy searches must be identical!

31 Configuring for Consensus Search engine configuration can be difficult: Correct spectral format Search parameter files and command-line Pre-processed sequence databases. Tracking spectrum identifiers Extracting peptide identifications, especially modifications and protein identifiers

32 Peptide Identification Meta-Search Parameters Instrument Precursor Tolerance Fragment Tolerance Max. Charge Sequence Database Target and # of Decoys Modification Fixed/Variable Amino-Acids Position Delta Proteolytic Agent Motif Peptide Candidates Termini Specificity Precursor Tolerance Missed cleavages Charge State Handling # 13 C Peaks Search Engines Mascot, X!Tandem, K-Score, OMSSA, MyriMatch

33 Peptide Identification Meta-Search Simple unified search interface for: Mascot, X!Tandem, K- Score, OMSSA, MyriMatch Automatic decoy searches Automatic spectrum file "chunking" Automatic scheduling Serial, Multi-Processor, Cluster, Grid

34 Peptide Identification Grid-Enabled Meta-Search NSF TeraGrid CPUs UMIACS 250+ CPUs Edwards Lab Scheduler & 48+ CPUs Secure communication Heterogeneous compute resources Single, simple search request Scales easily to 250+ simultaneous searches X!Tandem, KScore, OMSSA, MyriMatch, Mascot (1 core). X!Tandem, KScore, OMSSA. X!Tandem, KScore, OMSSA.

35 Peptide Identification Grid-Enabled Meta-Search NSF TeraGrid CPUs UMIACS 250+ CPUs Edwards Lab Scheduler & 48+ CPUs Secure communication Heterogeneous compute resources Simple search request

36 Peptide Atlas A8_IP LTQ Dataset Tryptic search of Human ESTs using PepSeqDB spectra searched ~ 26 times: Target + 2 decoys, 5 engines, 1+ vs 2+/3+ charge 8685 search jobs 25.7 days of CPU time TeraGrid TKO jobs < 2 hours Using 143 different machines Total elapsed time < 26 hours Bottleneck: Mascot license.

37 Peptide Identification Grid-Enabled Meta-Search Access to high-performance computing resources for the proteomics community NSF TeraGrid Community Portal University/Institute HPC clusters Individual lab compute resources Contribute cycles to the community and get access to others’ cycles in return. Centralized scheduler Compute capacity can still be exclusive, or prioritized. Compute client plays well with HPC grid schedulers.

38 Conclusions Improve the scope and sensitivity of peptide identification for genome annotation, using Exhaustive peptide sequence databases Machine-learning for combining Meta-search tools to maximize consensus Grid-computing for thorough search

39 Acknowledgements Dr. Catherine Fenselau University of Maryland Biochemistry Dr. Rado Goldman Georgetown University Medical Center Dr. Chau-Wen Tseng & Dr. Xue Wu University of Maryland Computer Science Funding: NIH/NCI