Improving Genome Annotation using Proteomics Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park.

Slides:



Advertisements
Similar presentations
Genomes and Proteomes genome: complete set of genetic information in organism gene sequence contains recipe for making proteins (genotype) proteome: complete.
Advertisements

Proteomics and Glycoproteomics (Bio-)Informatics of Protein Isoforms Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
In-depth Analysis of Protein Amino Acid Sequence and PTMs with High-resolution Mass Spectrometry Lian Yang 2 ; Baozhen Shan 1 ; Bin Ma 2 1 Bioinformatics.
How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
Proteomics The proteome is larger than the genome due to alternative splicing and protein modification. As we have said before we need to know All protein-protein.
Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.
Annotating genomes using proteomics data Andy Jones Department of Preclinical Veterinary Science.
ProReP - Protein Results Parser v3.0©
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Proteomics Informatics Workshop Part I: Protein Identification
Previous Lecture: Regression and Correlation
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Novel Peptide Identification using ESTs and Sequence Database Compression Nathan Edwards Center for Bioinformatics and Computational Biology University.
Proteomics Josh Leung Biology 1220 April 13 th, 2010.
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Proteomics Informatics Workshop Part III: Protein Quantitation
Gene Set Enrichment and Splicing Detection using Spectral Counting Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University.
Fa 05CSE182 CSE182-L9 Mass Spectrometry Quantitation and other applications.
Evaluated Reference MS/MS Spectra Libraries Current and Future NIST Programs.
Proteome.
Tryptic digestion Proteomics Workflow for Gel-based and LC-coupled Mass Spectrometry Protein or peptide pre-fractionation is a prerequisite for the reduction.
Production of polypeptides, Da, and middle-down analysis by LC-MSMS Catherine Fenselau 1, Joseph Cannon 1, Nathan Edwards 2, Karen Lohnes 1,
Chapter 9 Mass Spectrometry (MS) -Microbial Functional Genomics 조광평 CBBL.
Improving the Reliability of Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Nathan Edwards Center for Bioinformatics and Computational Biology
Top-down characterization of proteins in bacteria with unsequenced genomes Nathan Edwards Georgetown University Medical Center.
Optimal k-mer superstrings for protein identification and DNA assay design. Nathan Edwards Center for Bioinformatics and Computational Biology University.
Direct Experimental Observation of Functional Protein Isoforms by Tandem Mass Spectrometry Nathan Edwards Center for Bioinformatics and Computational Biology.
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
Common parameters At the beginning one need to set up the parameters.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Laxman Yetukuri T : Modeling of Proteomics Data
Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
Protein bioinformatics and systems biology Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
In-Gel Digestion Why In-Gel Digest?
Genomics II: The Proteome Using high-throughput methods to identify proteins and to understand their function.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.
Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational.
Separates charged atoms or molecules according to their mass-to-charge ratio Mass Spectrometry Frequently.
Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center  Peptide sequence databases, meta-search engine, machine-learning.
Improving the Sensitivity of Peptide Identification by Meta-Search, Grid-Computing, and Machine-Learning Nathan Edwards Georgetown University Medical Center.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
Proteomics Informatics (BMSC-GA 4437) Instructor David Fenyö Contact information
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Salamanca, March 16th 2010 Participants: Laboratori de Proteomica-HUVH Servicio de Proteómica-CNB-CSIC Participants: Laboratori de Proteomica-HUVH Servicio.
Application of meta-search, grid-computing, and machine-learning can significantly improve the sensitivity of peptide identification. The PepArML meta-search.
Top-down characterization of proteins in bacteria with unsequenced genomes Colin Wynne Catherine Fenselau University of Maryland, College Park Nathan Edwards.
Proteomics Informatics (BMSC-GA 4437) Course Directors David Fenyö Kelly Ruggles Beatrix Ueberheide Contact information
2014 생화학 실험 (1) 6주차 실험조교 : 류 지 연 Yonsei Proteome Research Center 산학협동관 421호
Constructing high resolution consensus spectra for a peptide library
Minimize Database-Dependence in Proteome Informatics Apr. 28, 2009 Kyung-Hoon Kwon Korea Basic Science Institute.
Proteomics: Technology and Cell Signaling Presenter: Ido Tal Advisor: Prof. Michal Linial י " ג סיון תשע " ה.
Yonsei Proteome Research Center Peptide Mass Finger-Printing Part II. MALDI-TOF 2013 생화학 실험 (1) 6 주차 자료 임종선 조교 내선 6625.
The Syllabus. The Syllabus Safety First !!! Students will not be allowed into the lab without proper attire. Proper attire is designed for your protection.
Proteomics Informatics David Fenyő
Shotgun Proteomics in Neuroscience
Proteomics Informatics David Fenyő
Presentation transcript:

Improving Genome Annotation using Proteomics Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park

2 Mass Spectrometry for Proteomics Measure mass of many (bio)molecules simultaneously High bandwidth Mass is an intrinsic property of all (bio)molecules No prior knowledge required

3 Mass Spectrometer Ionizer Sample + _ Mass Analyzer Detector MALDI Electro-Spray Ionization (ESI) Time-Of-Flight (TOF) Quadrapole Ion-Trap Electron Multiplier (EM)

4 High Bandwidth

5 Mass is fundamental!

6 Mass Spectrometry for Proteomics Measure mass of many molecules simultaneously...but not too many, abundance bias Mass is an intrinsic property of all (bio)molecules...but need a reference to compare to

7 Mass Spectrometry for Proteomics Mass spectrometry has been around since the turn of the century......why is MS based Proteomics so new? Ionization methods MALDI, Electrospray Protein chemistry & automation Chromatography, Gels, Computers Protein / genome sequences A reference for comparison

8 Sample Preparation for Peptide Identification Enzymatic Digest and Fractionation

9 Single Stage MS MS m/z

10 Tandem Mass Spectrometry (MS/MS) Precursor selection m/z

11 Tandem Mass Spectrometry (MS/MS) Precursor selection + collision induced dissociation (CID) MS/MS m/z

12 Peptide Identification For each (likely) peptide sequence 1. Compute fragment masses 2. Compare with spectrum 3. Retain those that match well Peptide sequences from (any) sequence database Swiss-Prot, IPI, NCBI’s nr, ESTs, genomes,... Automated, high-throughput peptide identification in complex mixtures

13 Peptide Identification...can provide direct experimental evidence for the amino-acid sequence of functional proteins. Evidence for: Functional protein isoforms Translation start and frame Proteins with short open-reading-frames

14 How could this help? Evidence for SNPs and alternative splicing stops with transcription No genomic or transcript evidence for translation start-site. Conservation doesn’t stop at coding bases! Statistical gene-finders struggle with micro- exons, translation start-site, and short ORFs.

15 What can be observed? Known coding SNPs Novel coding mutations Alternative splicing isoforms Microexons ( non-cannonical splice-sites ) Alternative translation start-sites ( codons ) Alternative translation frames “Dark” open-reading-frames

16 Splice Isoform Human Jurkat leukemia cell-line Lipid-raft extraction protocol, targeting T cells von Haller, et al. MCP LIME1 gene: LCK interacting transmembrane adaptor 1 LCK gene: Leukocyte-specific protein tyrosine kinase Proto-oncogene Chromosomal aberration involving LCK in leukemias. Multiple significant peptide identifications

17 Splice Isoform

18 Novel Splice Isoform

19 Translation Start-Site Human erythroleukemia K562 cell-line Depth of coverage study Resing et al. Anal. Chem THOC2 gene: Part of the heteromultimeric THO/TREX complex. Initially believed to be a “novel” ORF RefSeq mRNA in Jun 2007, no RefSeq protein TrEMBL entry Feb 2005, no SwissProt entry Genbank mRNA in May 2002 (complete CDS) Plenty of EST support ~ 100,000 bases upstream of other isoforms

20 Translation Start-Site

21 Translation Start-Site

22 Translation Start-Site

23 Translation Start-Site

24 Easily distinguish minor sequence variations Two B. anthracis Sterne α/β SASP annotations RefSeq/Gb: MVMARN... (7441 Da) CMR: MARN... (7211 Da) Intact proteins differ by 230 Da 7441 Da vs 7211 Da N-terminal tryptic peptides: MVMAR (606.3 Da), MVMARNR (876.4 Da), vs MARNR (646.3 Da) Very different MS/MS spectra

25 Bacterial Gene-Finding …TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA… Stop codon Find all the open-reading-frames......courtesy of Art Delcher

26 Bacterial Gene-Finding …TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA… Stop codon …ATCTTTTTACCGAGAAATCTATTTAAAGTACTTTTTATAACT… Shifted Stop Stop codon Reverse strand Find all the open-reading-frames......but they overlap – which ones are correct?...courtesy of Art Delcher

27 Coding-Sequence “Score”...courtesy of Art Delcher

28 Glimmer3 Performance Glimmer3 trained & compared to RefSeq genes with annotated function Correct STOP: 99.6% Correct START: 84.3% “Not all the genomes necessarily have carefully/accurately annotated start sites, so the results for number of correct starts may be suspect.”

29 N-terminal peptides (Protein) N-terminal peptides establish start-site of known & unexpected ORFs Use: Directly to annotate genomes Evaluate and improve algorithms Map cross-species

30 N-terminal peptide workflows Typical proteomics workflows sample peptides from the proteome “randomly” Caulobacter crescentus (70%) 3733 Proteins (RefSeq Genome annot.) 66K tryptic peptides (600 Da to 3000 Da) 2085 N-terminal tryptic peptides (3%)

31 N-terminal peptide workflow Protect protein N-terminus Digest to peptides Chemically modify free peptide N-term Use chem. mod. to capture unwanted peptides Nat Biotech, Vol. 21, pp , 2003.

32 Increasing N-terminal peptide coverage Multiple (digest) enzymes: trypsin-R: 60% (80%) acid + lys-C + trypsin: 85% (94%) Repeated LC-MS/MS Precursor Exclusion / Inclusion lists MALDI / ESI Protein separation and/or orthogonal fractionation Anal Chem, Vol. 76, pp , 2004.

33 Proteomics Informatics Search spectra against: Entire bacterial genome; All Met initiated peptides; or Statistically likely Met initiated peptides. Easily consider initial Met loss PTM, too Off-the-shelf MS/MS search engines (Mascot / X!Tandem / OMSSA)

34 Other Practical Issues Suitable for commonly available instrumentation Only the sample prep. is (somewhat) novel. Need living organism Stage of life-cycle? Bang for buck? N-terminal peptides / $$$$

35 Other Research Projects Alternative splicing and coding SNPs in clinical cancer samples MS/MS spectral matching using HMMs Combining MS/MS search engine results using machine learning Microorganism identification using MS ( Gapped/spaced seeds for inexact sequence alignment. Applications of SBH-graphs and Eulerian paths

36 Hidden Markov Models for Spectral Matching Capture statistical variation and consensus in peak intensity Capture semantics of peaks Extrapolate model to other peptides Good specificity with superior sensitivity for peptide detection Assign 1000’s of additional spectra (w/ p-value < )

37 Peptide DLATVYVDVLK

38 Peptide DLATVYVDVLK

39 Acknowledgements Catherine Fenselau, Steve Swatkoski UMCP Biochemistry Chau-Wen Tseng, Xue Wu UMCP Computer Science Cheng Lee Calibrant Biosystems PeptideAtlas, HUPO PPP, X!Tandem Funding: NIH/NCI, USDA/ARS