Nathan Edwards Center for Bioinformatics and Computational Biology

Slides:



Advertisements
Similar presentations
Genomes and Proteomes genome: complete set of genetic information in organism gene sequence contains recipe for making proteins (genotype) proteome: complete.
Advertisements

Protein Quantitation II: Multiple Reaction Monitoring
Proteomics and Glycoproteomics (Bio-)Informatics of Protein Isoforms Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular.
How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.
ProReP - Protein Results Parser v3.0©
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Announcements: Proposal resubmissions are due 4/23. It is recommended that students set up a meeting to discuss modifications for the final step of the.
Proteomics Informatics Workshop Part I: Protein Identification
Previous Lecture: Regression and Correlation
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Novel Peptide Identification using ESTs and Sequence Database Compression Nathan Edwards Center for Bioinformatics and Computational Biology University.
Proteomics Josh Leung Biology 1220 April 13 th, 2010.
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Gene Set Enrichment and Splicing Detection using Spectral Counting Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University.
Fa 05CSE182 CSE182-L9 Mass Spectrometry Quantitation and other applications.
Evaluated Reference MS/MS Spectra Libraries Current and Future NIST Programs.
Tryptic digestion Proteomics Workflow for Gel-based and LC-coupled Mass Spectrometry Protein or peptide pre-fractionation is a prerequisite for the reduction.
Production of polypeptides, Da, and middle-down analysis by LC-MSMS Catherine Fenselau 1, Joseph Cannon 1, Nathan Edwards 2, Karen Lohnes 1,
The dynamic nature of the proteome
Improving Genome Annotation using Proteomics Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park.
Improving the Reliability of Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Top-down characterization of proteins in bacteria with unsequenced genomes Nathan Edwards Georgetown University Medical Center.
Optimal k-mer superstrings for protein identification and DNA assay design. Nathan Edwards Center for Bioinformatics and Computational Biology University.
Dinosaur Proteomics. 2 Claims Proteins can be extracted from fossilized bones Extracted proteins can be analyzed by LC-MS/MS MS/MS can be matched to.
Direct Experimental Observation of Functional Protein Isoforms by Tandem Mass Spectrometry Nathan Edwards Center for Bioinformatics and Computational Biology.
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
Common parameters At the beginning one need to set up the parameters.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Laxman Yetukuri T : Modeling of Proteomics Data
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Lecture 9. Functional Genomics at the Protein Level: Proteomics.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
In-Gel Digestion Why In-Gel Digest?
Genomics II: The Proteome Using high-throughput methods to identify proteins and to understand their function.
PEAKS: De Novo Sequencing using Tandem Mass Spectrometry Bin Ma Dept. of Computer Science University of Western Ontario.
CSE182 CSE182-L11 Protein sequencing and Mass Spectrometry.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.
Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational.
Multiple flavors of mass analyzers Single MS (peptide fingerprinting): Identifies m/z of peptide only Peptide id’d by comparison to database, of predicted.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
Overview of Mass Spectrometry
Separates charged atoms or molecules according to their mass-to-charge ratio Mass Spectrometry Frequently.
Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center  Peptide sequence databases, meta-search engine, machine-learning.
Improving the Sensitivity of Peptide Identification by Meta-Search, Grid-Computing, and Machine-Learning Nathan Edwards Georgetown University Medical Center.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
Proteomics Informatics (BMSC-GA 4437) Instructor David Fenyö Contact information
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Top-down characterization of proteins in bacteria with unsequenced genomes Colin Wynne Catherine Fenselau University of Maryland, College Park Nathan Edwards.
Proteomics Informatics (BMSC-GA 4437) Course Directors David Fenyö Kelly Ruggles Beatrix Ueberheide Contact information
Constructing high resolution consensus spectra for a peptide library
Minimize Database-Dependence in Proteome Informatics Apr. 28, 2009 Kyung-Hoon Kwon Korea Basic Science Institute.
What is proteomics? Richard Mbasu and Ben Richards.
Considerations for multi-omics data integration Michael Tress CNIO,
The Syllabus. The Syllabus Safety First !!! Students will not be allowed into the lab without proper attire. Proper attire is designed for your protection.
Proteomics Informatics David Fenyő
Bioinformatics for Proteomics
Shotgun Proteomics in Neuroscience
Proteomics Informatics David Fenyő
Presentation transcript:

Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park

Synopsis MS/MS spectra provide evidence for the amino-acid sequence of functional proteins. Key concepts: Spectrum acquisition is unbiased Direct observation of amino-acid sequence Sensitive to minor sequence variation Observed peptides represent folded proteins

Synopsis MS/MS spectra provide evidence for the amino-acid sequence of functional proteins. Applications: Cancer biomarkers Genome annotation

Mass Spectrometry for Proteomics Measure mass of many (bio)molecules simultaneously High bandwidth Mass is an intrinsic property of all (bio)molecules No prior knowledge required

Mass Spectrometer Ionizer Sample Mass Analyzer Detector MALDI + _ Mass Analyzer Detector MALDI Electro-Spray Ionization (ESI) Time-Of-Flight (TOF) Quadrapole Ion-Trap Electron Multiplier (EM)

High Bandwidth 100 250 500 750 1000 m/z % Intensity

Mass is fundamental!

Mass Spectrometry for Proteomics Measure mass of many molecules simultaneously ...but not too many, abundance bias Mass is an intrinsic property of all (bio)molecules ...but need a reference to compare to

Mass Spectrometry for Proteomics Mass spectrometry has been around since the turn of the century... ...why is MS based Proteomics so new? Ionization methods MALDI, Electrospray Protein chemistry & automation Chromatography, Gels, Computers Protein sequence databases A reference for comparison

Sample Preparation for Peptide Identification Enzymatic Digest and Fractionation

Single Stage MS MS m/z

Tandem Mass Spectrometry (MS/MS) m/z Precursor selection m/z

Tandem Mass Spectrometry (MS/MS) Precursor selection + collision induced dissociation (CID) m/z MS/MS m/z

Peptide Identification For each (likely) peptide sequence 1. Compute fragment masses 2. Compare with spectrum 3. Retain those that match well Peptide sequences from protein sequence databases Swiss-Prot, IPI, NCBI’s nr, ... Automated, high-throughput peptide identification in complex mixtures

Why don’t we see more novel peptides? Tandem mass spectrometry doesn’t discriminate against novel peptides... ...but protein sequence databases do! Searching traditional protein sequence databases biases the results towards well-understood protein isoforms!

What goes missing? Known coding SNPs Novel coding mutations Alternative splicing isoforms Alternative translation start-sites Microexons Alternative translation frames

Why should we care? Alternative splicing is the norm! Only 20-25K human genes Each gene makes many proteins Proteins have clinical implications Biomarker discovery Evidence for SNPs and alternative splicing stops with transcription Genomic assays, ESTs, mRNA sequence. Little hard evidence for translation start site

Novel Splice Isoform Human Jurkat leukemia cell-line LIME1 gene: Lipid-raft extraction protocol, targeting T cells von Haller, et al. MCP 2003. LIME1 gene: LCK interacting transmembrane adaptor 1 LCK gene: Leukocyte-specific protein tyrosine kinase Proto-oncogene Chromosomal aberration involving LCK in leukemias. Multiple significant peptide identifications

Novel Splice Isoform

Novel Splice Isoform

Novel Mutation HUPO Plasma Proteome Project TTR gene Pooled samples from 10 male & 10 female healthy Chinese subjects Plasma/EDTA sample protocol Li, et al. Proteomics 2005. (Lab 29) TTR gene Transthyretin (pre-albumin) Defects in TTR are a cause of amyloidosis. Familial amyloidotic polyneuropathy late-onset, dominant inheritance

Ala2→Pro associated with familial amyloid polyneuropathy Novel Mutation Ala2→Pro associated with familial amyloid polyneuropathy

Novel Mutation

Translation Start-Site Human erythroleukemia K562 cell-line Depth of coverage study Resing et al. Anal. Chem. 2004. THOC2 gene: Part of the heteromultimeric THO/TREX complex. Initially believed to be a “novel” ORF RefSeq mRNA in Jun 2007, no RefSeq protein TrEMBL entry Feb 2005, no SwissProt entry Genbank mRNA in May 2002 (complete CDS) Plenty of EST support ~ 100,000 bases upstream of other isoforms

Translation Start-Site

Translation Start-Site

Translation Start-Site

Translation Start-Site

Expressed Sequence Tags (ESTs) Cheap, fast, coding Single sequencing reads of mRNA Sequence from 5’ or 3’ end No “assembly” http://www.ncbi.nlm.nih.gov/About/primer/est.html

Searching ESTs Proposed long ago: Now: Yates, Eng, and McCormack; Anal Chem, ’95. Now: Protein sequences are sufficient for protein identification Computationally expensive/infeasible Difficult to interpret Make EST searching feasible for routine searching to discover novel peptides.

Searching Expressed Sequence Tags (ESTs) Pros No introns! Primary splicing evidence for annotation pipelines Evidence for dbSNP Often derived from clinical cancer samples Cons No frame Large (8Gb) “Untrusted” by annotation pipelines Highly redundant Nucleotide error rate ~ 1%

Compressed EST Peptide Sequence Database For all ESTs mapped to a UniGene gene: Six-frame translation Eliminate ORFs < 30 amino-acids Eliminate amino-acid 30-mers observed once Compress to C2 FASTA database Complete, Correct for amino-acid 30-mers Gene-centric peptide sequence database: Size: < 3% of naïve enumeration, 20774 FASTA entries Running time: ~ 1% of naïve enumeration search E-values: ~ 2% of naïve enumeration search results

Compressed EST Peptide Sequence Database For all ESTs mapped to a UniGene gene: Six-frame translation Eliminate ORFs < 30 amino-acids Eliminate amino-acid 30-mers observed once Compress to C2 FASTA database Complete, Correct for amino-acid 30-mers Gene-centric peptide sequence database: Size: < 3% of naïve enumeration, 20774 FASTA entries Running time: ~ 1% of naïve enumeration search E-values: ~ 2% of naïve enumeration search results

SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI

Compressed SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI

Sequence Databases & CSBH-graphs Original sequences correspond to paths ACDEFGI, ACDEFACG, DEFGEFGI

Sequence Databases & CSBH-graphs All k-mers represented by an edge have the same count 1 2 2 1 2

cSBH-graphs Quickly determine those that occur twice 2 2 1 2

Correct, Complete (C2) Enumeration Set of paths that use each edge at least once ACDEFGEFGI, DEFACG

Compressed EST Database Gene centric compressed EST peptide sequence database 20,774 sequence entries ~8Gb vs 223 Mb ~35 fold compression 22 hours becomes 15 minutes E-values improve by similar factor! Makes routine EST searching feasible Search ESTs instead of IPI?

Significant False Positives E-values are not enough! Random guessers are easy to beat. Post-translational modifications vs. amino-acid substitution methylation (on I/L, Q, R, C, H, K, S, T, N): +14 D → E, G → A, V → I/L, N → Q, S → T: +14 Peptide extension z=+2 → z=+3 Nonsense AA masses sum to precursor Need to ensure: fragment ions define novel sequence sequence evidence is strong other plausible explanations can be eliminated

Significant False Positives DFLAGGLAAAISK 2.2x10-8 2 ESTs DFLAGGIAAAISK 2.2x10-8 IPI (2), RefSeq, mRNA, ~ 1400 ESTs DFLAGGVAAAISK 3.7x10-8 IPI, RefSeq, mRNA, ~700 ESTs DFLAGGVAAAISKMAVVPI 3.5x10-5 Genscan exon AISFAKDFLAGGIAAAISK 3.3x10-4

Significant False Positives

Back to the lab... Current LC/MS/MS workflows identify a few peptides per protein ...not sufficient for protein isoforms Need to raise the sequence coverage to (say) 80% ...protein separation prior to LC/MS/MS analysis Potential for database of splice sites of (functional) proteins!

Spectral Matching for Peptide Identification Detection vs. identification Increased sensitivity & specificity No novel peptides NIST GC/MS Spectral Library Identifies small molecules, 100,000’s of (consensus) spectra Bundled/Sold with many instruments “Dot-product” spectral comparison Current project: Peptide MS/MS

NIST MS Search: Peptides

Peptide DLATVYVDVLK

Protein Families

Protein Families

Peptide DLATVYVDVLK

Hidden Markov Models for Spectral Matching Capture statistical variation and consensus in peak intensity Capture semantics of peaks Extrapolate model to other peptides Good specificity with superior sensitivity for peptide detection Assign 1000’s of additional spectra (p-value < 10-5)

(m/z,int) pair emitted by ion & insert states Hidden Markov Model Delete Insert Ion (m/z,int) pair emitted by ion & insert states

Spectral Matching of Peptide Variants DFLAGGIAAAISK DFLAGGVAAAISK

Spectral Matching of Peptide Variants AVMDDFAAFVEK AVM*DDFAAFVEK

HMM model extrapolation

Conclusions Proteomics can inform genome annotation Eukaryotic and prokaryotic Functional vs silencing variants Peptides identify more than just proteins Untapped source of disease biomarkers Compressed peptide sequence databases make routine EST searching feasible Novel spectral matching technique using HMMs looks very promising

Acknowledgements Catherine Fenselau, Steve Swatkoski UMCP Biochemistry Chau-Wen Tseng, Xue Wu UMCP Computer Science Cheng Lee Calibrant Biosystems PeptideAtlas, HUPO PPP, X!Tandem Funding: NIH/NCI, USDA/ARS