Download presentation
Presentation is loading. Please wait.
Published byWilliam Butler Modified over 9 years ago
1
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park
2
2 Novel Peptides Absent from traditional protein sequence databases IPI, SwissProt, TrEMBL, NCBI’s nr, MSDB Due to Deliberate “redundancy” elimination “Dark-side” genes Bias towards high-quality, high-confidence full-length protein sequence
3
3 What is missing? Known coding SNPs Novel coding mutations Alternative splicing isoforms Alternative translation start-sites Microexons Alternative translation frames
4
4 Why should we care? Alternative splicing is the norm! Only 20-25K human genes Each gene makes many proteins Proteins have clinical implications Biomarker discovery Evidence for SNPs and alternative splicing stops with transcription Genomic assays, ESTs, mRNA sequence. No hard evidence for translation start site
5
5 Novel Protein HEQASNVLSDISEFR Evidence: log 10 (E-value) = -9.6 100’s of ESTs Full length mRNA sequence Details: Peptide Atlas A8_IP (Resing et al.);
6
6 Novel Protein
7
7
8
8
9
9 Novel Splice Isoform LQGSATAAEAQVGHQTAR Evidence: log 10 (E-value) = -6.8 10’s of ESTs Full length mRNA sequence Details: Peptide Atlas raftflow (von Haller, et al.); LIME1 gene
10
10 Novel Splice Isoform
11
11 Novel Splice Isoform
12
12 Novel Splice Isoform
13
13 Novel Frame TAGSPLCLPTPGAAPGSAGSCSHR Evidence: log 10 (E-value) = -3.9 10’s of ESTs Full length mRNA sequence Details: Peptide Atlas raftflow (von Haller, et al.); LIME1 gene, downstream from LQGSA...
14
14 Novel Frame
15
15 Novel Frame
16
16 Novel Frame
17
17 “Novel” Microexon LQTASDESYKDPTNIQLSK Evidence: log 10 (E-value) = -6.4 10’s of ESTs / mRNA sequences SwissProt variant, absent from IPI Details: Peptide Atlas raftflow (von Haller, et al.); SPTAN1 gene
18
18 “Novel” Microexon
19
19 “Novel” Microexon
20
20 “Novel” Microexon
21
21 “Novel” Microexon
22
22 Novel Mutation KADDTWEPFASGK Evidence: log 10 (E-value) = -7.6 2 ESTs from same clone library Ala2 Deletion Details: HUPO PPP 29_b1-EDTA_1 (Qian/He; Omenn et al.); TTR gene Known Mutation: Ala2-to-Pro associated with familial amyloidotic polyneuropathy.
23
23 Novel Mutation
24
24 Novel Mutation
25
25 Novel Mutation
26
26 Novel Mutation
27
27 Known Coding SNP DTEEEDFHVDQ[V|A]TTVK Evidence: log 10 (E-value) = -9.5 / -9.4 Known dbSNP (coding): Val12-to-Ala Wildtype also observed Details: HUPO PPP 40 (Wang; Omenn et al.); SERPINA1 gene
28
28 Wildtype
29
29 Known Coding SNP
30
30 Known Coding SNP
31
31 Known Coding SNP LQHL[E|V]NELTHDIITK Evidence: log 10 (E-value) = -6.7/-10.9 4 ESTs, same clone library Known dbSNP (coding): Glu5-to-Val Wildtype also observed Details: HUPO PPP 28_b2-CIT (Pounds/Adkins/Rodland/Anderson; Omenn et al.); SERPINA1 gene
32
32 IPI Common Variant Elimination YYGGGYGSTQATFMVFQALAQYQK Evidence: log 10 (E-value) = -5.9 100’s ESTs, mRNA sequence IPI has (rare) variant (Insertion of AS@10) Differ in 5’ splice site. Details: HUPO PPP 29 (Qian/He; Omenn et al.); C3 gene
33
33 Why don’t we see more novel peptides? Tandem mass spectrometry doesn’t discriminate against novel peptides......but protein sequence databases do! Searching traditional protein sequence databases biases the results towards well-understood protein isoforms!
34
34 Why don’t we see more novel peptides? Traditional protein sequence databases High-quality, full-length proteins only Many interesting peptides are omitted Exclusive – peptide identifications are lost. ESTs, genomic & mRNA sequence Used as evidence for full-length protein sequences Inclusive – may need to filter results
35
35 Significant False Positives E-values are not enough! Random guessers are easy to beat. Post-translational modifications vs. amino-acid substitution methylation (on I/L, Q, R, C, H, K, S, T, N): +14 D → E, G → A, V → I/L, N → Q, S → T: +14 Peptide extension z=+2 → z=+3 Nonsense AA masses sum to precursor Need to ensure: fragment ions define novel sequence sequence evidence is strong other plausible explanations can be eliminated
36
36 Significant False Positives DFLAGGLAAAISK 2.2x10 -8 2 ESTs DFLAGGIAAAISK 2.2x10 -8 IPI (2), RefSeq, mRNA, ~ 1400 ESTs DFLAGGVAAAISK3.7x10 -8 IPI, RefSeq, mRNA, ~700 ESTs DFLAGGVAAAISKMAVVPI3.5x10 -5 Genscan exon AISFAKDFLAGGIAAAISK 3.3x10 -4 Genscan exon
37
37 Significant False Positives
38
38 How do we know they are novel? How do we know they are real? Good spectra Good E-value Good ion ladders Good sequence evidence Lack of other explanations...
39
39 Peptide Sequence Evidence C 3 Compression: Amino-acid 30-mers Complete, Correct(, Compact) Present at least twice (ESTs only)
40
40 SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI
41
41 Compressed-SBH-graph ACDEFGI 2 2 1 2 1
42
42 Peptide Sequence Databases MS/MS search engine input only Protein context is lost Inclusive, rather than exclusive Download from http://www.umiacs.umd.edu/~nedwards Exact string search for gene/protein context Recover peptide sequence evidence Relational database to reassemble......with respect to genes & genome Grid Computing + Web Services + Viewer Work in progress
43
43 Peptide Identification Navigator
44
44 Peptide Identification Navigator
45
45 Conclusions Peptides identify more than proteins Search EST sequences (at least) Compressed peptide sequence databases make this feasible
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.