Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

Slides:



Advertisements
Similar presentations
RNA-Seq as a Discovery Tool
Advertisements

In-depth Analysis of Protein Amino Acid Sequence and PTMs with High-resolution Mass Spectrometry Lian Yang 2 ; Baozhen Shan 1 ; Bin Ma 2 1 Bioinformatics.
Data integration across omics landscapes Bing Zhang, Ph.D. Department of Biomedical Informatics Vanderbilt University School of Medicine
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Genome analysis and annotation. Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products.
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
Tutorial 7 Genome browser. Free, open source, on-line broswer for genomes Contains ~100 genomes, from nematodes to human. Many tools that can be used.
Protein Sequencing and Identification by Mass Spectrometry.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
The Influence of Alternative Splicing in Protein Structure The fact that gene number is not significantly different between mammals and some invertebrates.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Scaffold Download free viewer:
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Human Molecular Genetics Section 14–3
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Novel Peptide Identification using ESTs and Sequence Database Compression Nathan Edwards Center for Bioinformatics and Computational Biology University.
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Doug Brutlag 2011 Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University School of Medicine Genomics, Bioinformatics.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
Doug Brutlag 2011 Next Generation Sequencing and Human Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University.
Alternative Splicing. mRNA Splicing During RNA processing internal segments are removed from the transcript and the remaining segments spliced together.
Presented by: Andrew McMurry Boston University Bioinformatics Children’s Hospital Informatics Program Harvard Medical School Center for BioMedical Informatics.
Genome Analysis Research Group Leibniz Institute for Age Research – Fritz Lipmann Institute Violating the splicing rules: TG dinucleotides function as.
Improving Genome Annotation using Proteomics Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park.
Improving the Reliability of Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Nathan Edwards Center for Bioinformatics and Computational Biology
Protein Sequence Databases, Peptides to Proteins, and Statistical Significance Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown.
Top-down characterization of proteins in bacteria with unsequenced genomes Nathan Edwards Georgetown University Medical Center.
Optimal k-mer superstrings for protein identification and DNA assay design. Nathan Edwards Center for Bioinformatics and Computational Biology University.
Direct Experimental Observation of Functional Protein Isoforms by Tandem Mass Spectrometry Nathan Edwards Center for Bioinformatics and Computational Biology.
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
Common parameters At the beginning one need to set up the parameters.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
MPL Identification of alternative spliced mRNA variants related to cancers by genome-wide ESTs alignment KIM DAE SOO Oncogene Apr.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Genome Annotation Rosana O. Babu.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
False-Discovery-Rate Aware Protein Inference by Generalized Protein Parsimony Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.
Using Exons to Define Isoforms in PRO Timothy Danford Novartis Institutes for Biomedical Research PRO / AlzForum Kickoff Meeting Oct. 4, 2011.
Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2005.
Eukaryotic mRNA processing
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.
How can we find genes? Search for them Look them up.
Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational.
Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center  Peptide sequence databases, meta-search engine, machine-learning.
Improving the Sensitivity of Peptide Identification by Meta-Search, Grid-Computing, and Machine-Learning Nathan Edwards Georgetown University Medical Center.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Top-down characterization of proteins in bacteria with unsequenced genomes Colin Wynne Catherine Fenselau University of Maryland, College Park Nathan Edwards.
Proteomics Informatics (BMSC-GA 4437) Course Directors David Fenyö Kelly Ruggles Beatrix Ueberheide Contact information
Peptide de novo sequencing Peptide de novo sequencing is the analytical process that derives a peptide’s amino acid sequence from its tandem mass spectrum.
UniProt: Universal Protein Resource
Searching the NCBI Databases
Proteomics Informatics David Fenyő
A: OAZ1 mRNA transcript of 775-1, and parental cell lines showing the stop codon introduced by the nonsense mutations in the and transcripts,
Nic’s genome contains 16,124 variants,
Proteomics Informatics David Fenyő
Presentation transcript:

Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park

2 Novel Peptides Absent from traditional protein sequence databases IPI, SwissProt, TrEMBL, NCBI’s nr, MSDB Due to Deliberate “redundancy” elimination “Dark-side” genes Bias towards high-quality, high-confidence full-length protein sequence

3 What is missing? Known coding SNPs Novel coding mutations Alternative splicing isoforms Alternative translation start-sites Microexons Alternative translation frames

4 Why should we care? Alternative splicing is the norm! Only 20-25K human genes Each gene makes many proteins Proteins have clinical implications Biomarker discovery Evidence for SNPs and alternative splicing stops with transcription Genomic assays, ESTs, mRNA sequence. No hard evidence for translation start site

5 Novel Protein HEQASNVLSDISEFR Evidence: log 10 (E-value) = ’s of ESTs Full length mRNA sequence Details: Peptide Atlas A8_IP (Resing et al.);

6 Novel Protein

7

8

9 Novel Splice Isoform LQGSATAAEAQVGHQTAR Evidence: log 10 (E-value) = ’s of ESTs Full length mRNA sequence Details: Peptide Atlas raftflow (von Haller, et al.); LIME1 gene

10 Novel Splice Isoform

11 Novel Splice Isoform

12 Novel Splice Isoform

13 Novel Frame TAGSPLCLPTPGAAPGSAGSCSHR Evidence: log 10 (E-value) = ’s of ESTs Full length mRNA sequence Details: Peptide Atlas raftflow (von Haller, et al.); LIME1 gene, downstream from LQGSA...

14 Novel Frame

15 Novel Frame

16 Novel Frame

17 “Novel” Microexon LQTASDESYKDPTNIQLSK Evidence: log 10 (E-value) = ’s of ESTs / mRNA sequences SwissProt variant, absent from IPI Details: Peptide Atlas raftflow (von Haller, et al.); SPTAN1 gene

18 “Novel” Microexon

19 “Novel” Microexon

20 “Novel” Microexon

21 “Novel” Microexon

22 Novel Mutation KADDTWEPFASGK Evidence: log 10 (E-value) = ESTs from same clone library Ala2 Deletion Details: HUPO PPP 29_b1-EDTA_1 (Qian/He; Omenn et al.); TTR gene Known Mutation: Ala2-to-Pro associated with familial amyloidotic polyneuropathy.

23 Novel Mutation

24 Novel Mutation

25 Novel Mutation

26 Novel Mutation

27 Known Coding SNP DTEEEDFHVDQ[V|A]TTVK Evidence: log 10 (E-value) = -9.5 / -9.4 Known dbSNP (coding): Val12-to-Ala Wildtype also observed Details: HUPO PPP 40 (Wang; Omenn et al.); SERPINA1 gene

28 Wildtype

29 Known Coding SNP

30 Known Coding SNP

31 Known Coding SNP LQHL[E|V]NELTHDIITK Evidence: log 10 (E-value) = -6.7/ ESTs, same clone library Known dbSNP (coding): Glu5-to-Val Wildtype also observed Details: HUPO PPP 28_b2-CIT (Pounds/Adkins/Rodland/Anderson; Omenn et al.); SERPINA1 gene

32 IPI Common Variant Elimination YYGGGYGSTQATFMVFQALAQYQK Evidence: log 10 (E-value) = ’s ESTs, mRNA sequence IPI has (rare) variant (Insertion of Differ in 5’ splice site. Details: HUPO PPP 29 (Qian/He; Omenn et al.); C3 gene

33 Why don’t we see more novel peptides? Tandem mass spectrometry doesn’t discriminate against novel peptides......but protein sequence databases do! Searching traditional protein sequence databases biases the results towards well-understood protein isoforms!

34 Why don’t we see more novel peptides? Traditional protein sequence databases High-quality, full-length proteins only Many interesting peptides are omitted Exclusive – peptide identifications are lost. ESTs, genomic & mRNA sequence Used as evidence for full-length protein sequences Inclusive – may need to filter results

35 Significant False Positives E-values are not enough! Random guessers are easy to beat. Post-translational modifications vs. amino-acid substitution methylation (on I/L, Q, R, C, H, K, S, T, N): +14 D → E, G → A, V → I/L, N → Q, S → T: +14 Peptide extension z=+2 → z=+3 Nonsense AA masses sum to precursor Need to ensure: fragment ions define novel sequence sequence evidence is strong other plausible explanations can be eliminated

36 Significant False Positives DFLAGGLAAAISK 2.2x ESTs DFLAGGIAAAISK 2.2x10 -8 IPI (2), RefSeq, mRNA, ~ 1400 ESTs DFLAGGVAAAISK3.7x10 -8 IPI, RefSeq, mRNA, ~700 ESTs DFLAGGVAAAISKMAVVPI3.5x10 -5 Genscan exon AISFAKDFLAGGIAAAISK 3.3x10 -4 Genscan exon

37 Significant False Positives

38 How do we know they are novel? How do we know they are real? Good spectra Good E-value Good ion ladders Good sequence evidence Lack of other explanations...

39 Peptide Sequence Evidence C 3 Compression: Amino-acid 30-mers Complete, Correct(, Compact) Present at least twice (ESTs only)

40 SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI

41 Compressed-SBH-graph ACDEFGI

42 Peptide Sequence Databases MS/MS search engine input only Protein context is lost Inclusive, rather than exclusive Download from Exact string search for gene/protein context Recover peptide sequence evidence Relational database to reassemble......with respect to genes & genome Grid Computing + Web Services + Viewer Work in progress

43 Peptide Identification Navigator

44 Peptide Identification Navigator

45 Conclusions Peptides identify more than proteins Search EST sequences (at least) Compressed peptide sequence databases make this feasible