Download presentation
Presentation is loading. Please wait.
Published byLindsay Moody Modified over 9 years ago
1
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park
2
2 Mass Spectrometry for Proteomics Measure mass of many (bio)molecules simultaneously High bandwidth Mass is an intrinsic property of all (bio)molecules No prior knowledge required
3
3 Mass Spectrometry for Proteomics Measure mass of many molecules simultaneously...but not too many, abundance bias Mass is an intrinsic property of all (bio)molecules...but need a reference to compare to
4
4 High Bandwidth
5
5 Mass is fundamental!
6
6 Mass Spectrometry for Proteomics Mass spectrometry has been around since the turn of the century......why is MS based Proteomics so new? Ionization methods MALDI, Electrospray Protein chemistry & automation Chromatography, Gels, Computers Protein sequence databases A reference for comparison
7
7 Sample Preparation for Peptide Identification Enzymatic Digest and Fractionation
8
8 Single Stage MS MS m/z
9
9 Tandem Mass Spectrometry (MS/MS) Precursor selection m/z
10
10 Tandem Mass Spectrometry (MS/MS) Precursor selection + collision induced dissociation (CID) MS/MS m/z
11
11 Peptide Identification For each (likely) peptide sequence 1. Compute fragment masses 2. Compare with spectrum 3. Retain those that match well Peptide sequences from protein sequence databases Swiss-Prot, IPI, NCBI’s nr,... Automated, high-throughput peptide identification in complex mixtures
12
12 Why don’t we see more novel peptides? Tandem mass spectrometry doesn’t discriminate against novel peptides......but protein sequence databases do! Searching traditional protein sequence databases biases the results towards well-understood protein isoforms!
13
13 What goes missing? Known coding SNPs Novel coding mutations Alternative splicing isoforms Alternative translation start-sites Microexons Alternative translation frames
14
14 Why should we care? Alternative splicing is the norm! Only 20-25K human genes Each gene makes many proteins Proteins have clinical implications Biomarker discovery Evidence for SNPs and alternative splicing stops with transcription Genomic assays, ESTs, mRNA sequence. Little hard evidence for translation start site
15
15 Novel Splice Isoform Human Jurkat leukemia cell-line Lipid-raft extraction protocol, targeting T cells von Haller, et al. MCP 2003. LIME1 gene: LCK interacting transmembrane adaptor 1 LCK gene: Leukocyte-specific protein tyrosine kinase Proto-oncogene Chromosomal aberration involving LCK in leukemias. Multiple significant peptide identifications
16
16 Novel Splice Isoform
17
17 Novel Splice Isoform
18
18 Novel Frame
19
19 Novel Frame
20
20 Novel Mutation HUPO Plasma Proteome Project Pooled samples from 10 male & 10 female healthy Chinese subjects Plasma/EDTA sample protocol Li, et al. Proteomics 2005. (Lab 29) TTR gene Transthyretin (pre-albumin) Defects in TTR are a cause of amyloidosis. Familial amyloidotic polyneuropathy late-onset, dominant inheritance
21
21 Novel Mutation Ala2→Pro associated with familial amyloid polyneuropathy
22
22 Novel Mutation
23
23 Searching ESTs Proposed long ago: Yates, Eng, and McCormack; Anal Chem, ’95. Now: Protein sequences are sufficient for protein identification Computationally expensive/infeasible Difficult to interpret Make EST searching feasible for routine searching to discover novel peptides.
24
24 Searching Expressed Sequence Tags (ESTs) Pros No introns! Primary splicing evidence for annotation pipelines Evidence for dbSNP Often derived from clinical cancer samples Cons No frame Large (8Gb) “Untrusted” by annotation pipelines Highly redundant Nucleotide error rate ~ 1%
25
25 Compressed EST Peptide Sequence Database For all ESTs mapped to a UniGene gene: Six-frame translation Eliminate ORFs < 30 amino-acids Eliminate amino-acid 30-mers observed once Compress to C 2 FASTA database Complete, Correct for amino-acid 30-mers Gene-centric peptide sequence database: Size: < 3% of naïve enumeration, 20774 FASTA entries Running time: ~ 1% of naïve enumeration search E-values: ~ 2% of naïve enumeration search results
26
26 Compressed EST Peptide Sequence Database For all ESTs mapped to a UniGene gene: Six-frame translation Eliminate ORFs < 30 amino-acids Eliminate amino-acid 30-mers observed once Compress to C 2 FASTA database Complete, Correct for amino-acid 30-mers Gene-centric peptide sequence database: Size: < 3% of naïve enumeration, 20774 FASTA entries Running time: ~ 1% of naïve enumeration search E-values: ~ 2% of naïve enumeration search results
27
27 SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI
28
28 Compressed SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI
29
29 Sequence Databases & CSBH-graphs Original sequences correspond to paths ACDEFGI, ACDEFACG, DEFGEFGI
30
30 Sequence Databases & CSBH-graphs All k-mers represented by an edge have the same count 2 2 1 2 1
31
31 cSBH-graphs Quickly determine those that occur twice 2 2 1 2
32
32 Correct, Complete, Compact (C 3 ) Enumeration Set of paths that use each edge exactly once ACDEFGEFGI, DEFACG
33
33 Correct, Complete (C 2 ) Enumeration Set of paths that use each edge at least once ACDEFGEFGI, DEFACG
34
34 Patching the CSBH-graph Use artificial edges to fix unbalanced nodes
35
35 Compressed EST Database Gene centric compressed EST peptide sequence database 20,774 sequence entries ~8Gb vs 223 Mb ~35 fold compression 22 hours becomes 15 minutes E-values improve by similar factor! Makes routine EST searching feasible Search ESTs instead of IPI?
36
36 “Novel Peptide” Computational Infrastructure Binaries (C++) cSBH-graph construction Condor grid-enabled Eulerian path k-mer enumeration Suitable for large graphs Data-model for peptide identification Spectra (>5 million) Peptide identifications Mascot, SEQUEST, X!Tandem, NIST Genomic context of peptides
37
37 “Novel Peptide” Computational Infrastructure Condor grid-enabled MS/MS search Mascot, X!Tandem, (Inspect, OMSSA) TurboGears python web-stack SQLObject Object-Relational-Manager MVC web-application framework Suitable for AJAX & web-services too Integration with UCSC genome browser caBIG compatible web-services Java applet for viewing spectra
38
38 Peptide Identification Navigator
39
39 Peptide Identification Navigator
40
40 Spectrum Viewer
41
41 Spectrum Viewer
42
42 Back to the lab... Current LC/MS/MS workflows identify a few peptides per protein...not sufficient for protein isoforms Need to raise the sequence coverage to (say) 80%...protein separation prior to LC/MS/MS analysis Potential for database of splice sites of (functional) proteins!
43
43 Microorganism Identification by MALDI Mass Spectrometry Direct observation of microorganism biomarkers in the field. Peaks represent masses of abundant proteins. Statistical models assess identification significance. B.anthracis spores MALDI Mass Spectrometry
44
44 Key Principles Protein mass from protein sequence No introns, few PTMs Specificity of single mass is very weak Statistical significance from many peaks Not all proteins are equally likely to be observed Ribosomal proteins, SASPs
45
45 Rapid Microorganism Identification Database (www.RMIDb.org) Protein Sequences 8.1M (2.9M) Species ~ 18K Genbank, Microbial, Virus, Plasmid RefSeq CMR, Swiss-Prot TrEMBL
46
46 Rapid Microorganism Identification Database (www.RMIDb.org)
47
47 Informatics Issues Need good species / strain annotation B.anthracis vs B.thuringiensis Need correct protein sequence B.anthracis Sterne α/β SASP RefSeq/Gb: MVMARN... (7442 Da) CMR: MARN... (7211 Da) Need chemistry based protein classification
48
48 Conclusions Proteomics can inform genome annotation Eukaryotic and prokaryotic Functional vs silencing variants Peptides identify more than just proteins Untapped source of disease biomarkers Compressed peptide sequence databases make routine EST searching feasible
49
49 Future Research Directions Identification of protein isoforms: Optimize proteomics workflow for isoform detection Identify splice variants in cancer cell-lines (MCF-7) and clinical brain tumor samples Aggressive peptide sequence enumeration dbPep for genomic annotation Open, flexible informatics infrastructure for peptide identification
50
50 Future Research Directions Proteomics for Microorganism Identification Specificity of tandem mass spectra Revamp RMIDb prototype Incorporate spectral matching Primer design k-mer sets as FASTA sequence databases Uniqueness oracle for exact and inexact match Integration with Primer3 Tiling, multiplexing, pooling, & tag arrays
51
51 Acknowledgements Chau-Wen Tseng, Xue Wu UMCP Computer Science Catherine Fenselau, Steve Swatkoski UMCP Biochemistry Calibrant Biosystems PeptideAtlas, HUPO PPP, X!Tandem Funding: National Cancer Institute
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.