Annotating genomes using proteomics data Andy Jones Department of Preclinical Veterinary Science.

Slides:



Advertisements
Similar presentations
Genomes and Proteomes genome: complete set of genetic information in organism gene sequence contains recipe for making proteins (genotype) proteome: complete.
Advertisements

CSCE555 Bioinformatics Lecture 3 Gene Finding Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Genome analysis and annotation. Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products.
De Novo Sequencing v.s. Database Search Bin Ma School of Computer Science University of Waterloo Ontario, Canada.
Finding Eukaryotic Open reading frames.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Gene Finding (DNA signals) Genome Sequencing and assembly
Introduction to BioInformatics GCB/CIS535
Gene Finding Charles Yan.
CSE182-L12 Gene Finding.
Eukaryotic Gene Finding
Lecture 12 Splicing and gene prediction in eukaryotes
CSE182-L10 MS Spec Applications + Gene Finding + Projects.
Eukaryotic Gene Finding
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
A combination of the words Proteomics and Genomics. Proteogenomics commonly refer to studies that use proteomic information, often derived from mass spectrometry,
Biological Motivation Gene Finding in Eukaryotic Genomes
Finding prokaryotic genes and non intronic eukaryotic genes
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Tryptic digestion Proteomics Workflow for Gel-based and LC-coupled Mass Spectrometry Protein or peptide pre-fractionation is a prerequisite for the reduction.
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
Fine Structure and Analysis of Eukaryotic Genes
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
Coding Domain Sequence Prediction and Alternative Splicing Detection in Human Malaria Gambiae Jun Li 1, Bing-Bing Wang 2, Jose M. Ribeiro 3, Kenneth D.
Common parameters At the beginning one need to set up the parameters.
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Identification of Cancer-Specific Motifs in
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Laxman Yetukuri T : Modeling of Proteomics Data
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
XML Standards for Proteomics Data Andrew Jones, Dr Jonathan Wastling and Dr Ela Hunt Department of Computing Science and the Institute of Biomedical and.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
From Genomes to Genes Rui Alves.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
A Non-EST-Based Method for Exon-Skipping Prediction Rotem Sorek, Ronen Shemesh, Yuval Cohen, Ortal Basechess, Gil Ast and Ron Shamir Genome Research August.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
Research about Alternative Splicing recently 楊佳熒.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Chapter 3 The Interrupted Gene.
Genome Annotation Assessment in Drosophila melanogaster by Reese, M. G., et al. Summary by: Joe Reardon Swathi Appachi Max Masnick Summary of.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
Annotation of eukaryotic genomes
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
A knowledge-based approach to integrated genome annotation Michael Brent Washington University.
Proteomics Informatics (BMSC-GA 4437) Course Directors David Fenyö Kelly Ruggles Beatrix Ueberheide Contact information
Considerations for multi-omics data integration Michael Tress CNIO,
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
Yiming Yang1,2, Abhay Harpale1 and Subramanian Ganaphathy1
PlantGDB: Annotation Principles & Procedures
Eukaryotic Gene Finding
Ab initio gene prediction
Introduction to Bioinformatics II
Proteomics Informatics David Fenyő
Secreted Fringe-like Signaling Molecules May Be Glycosyltransferases
From Mendel to Genomics
Proteomics Informatics David Fenyő
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Presentation transcript:

Annotating genomes using proteomics data Andy Jones Department of Preclinical Veterinary Science

Overview Genome annotation – Current informatics methods – Experimental data – How good are we at annotating genomes? Proteome data for genome annotation – Study on Toxoplasma – Challenges – Proposed solutions

Summary: 780 “completed” genomes; 734 “draft” assembly; 842 “in progress” Total: 2356 (1996 prokaryote, 360 eukaryote) Genome sequencing is just a starting point to understanding genes / proteins

Annotating eukaryotic genomes Genome annotation: – Find start codons / transcriptional initiation – Recognise splice acceptor and donor sequences – Stop codon – Predict alternative splicing... Start codon Exon 1Exon 2Exon 3Exon 4 Stop codon Genomic DNA mRNA

Computational gene prediction De novo prediction – single genome – Trained with “typical” gene structures - learn exon-intron signals, translation initiation and termination signals e.g. Markov models – Many different predictions scored based on training set of known genes Multiple genome – Compare confirmed gene sequences from other species – Coding regions more highly conserved  conservation indicates gene position – Pattern searching: Higher mutation rate of bases separated in multiples of three (mutations in 3 rd position of codons are often silent) Experimental data also contribute to many genome projects New methods weigh evidence from a variety of sources – Attempting to reproduce how a human annotator would work Brent, Nat Rev Genet Jan;9(1):62-73

Experimental corroboration of models Expressed Sequence Tags – Simple to obtain large volumes of data – sequence randomly from cDNA libraries – Problems: Data sets can contain unprocessed transcripts (do not always confirm splicing) Rarely cover 5’ end of gene Generally “low-quality” sequences High-throughput sequencing – “Next-generation” sequencers capable of directly sequencing mRNA – Likely to become more widely used in the future Proteome data (peptide sequence data)

How good are gene models? Plasmodium falciparum (causative agent malaria) – genome sequenced in 2002, undergone considerable curation of gene models Recent article: cDNA study of P. falciparum Suggests ~25% of genes in Plasmodium falciparum are incorrect (85 genes out of 356 sampled) Majority of errors are in splice junctions (intron- exon boundaries) What does this mean for other genomes...? – Likely that high percentage of gene sequences are incorrect! BMC Genomics Jul 27;8:255.

Proteome data for genome annotation Motivation for genome annotation: – Can rule out that transcripts are non protein-coding – Large volumes of proteome data often collected for other purposes – Certain types of proteome data able to confirm the start codon of genes (difficult by other methods) – Even where considerable ESTs / cDNA sequencing has been performed, proteins can be detected with no corresponding EST evidence

Proteogenomic study of Toxoplasma gondii Proteome study of Toxoplasma gondii using three complementary techniques – parasite of clinical significance related to Plasmodium Study aims: Identify as many components of the proteome as possible Relate peptide sequence data back to genome to confirm genes Relate protein expression data to transcriptional data (EST / microarray)

2D gel electrophoresis 1D gel electrophoresis Cut bands Trypsin digestion Cut gel spot Trypsin digestion Fractions Mass spectrometry Sequence database search (compare with theoretical spectra predicted for each peptide in DB) Liquid chromatography Peptides

Database search strategy ToxoDB 60MB genome sequence “Official” gene models Alternative gene models predicted by gene finders = DNA sequence database = amino acid sequence database ORFs predicted in a 6 frame translation Concatenate databases Search all spectra Identify peptides and proteins Align peptide sequences back to corresponding genomic region

Five exon gene; incomplete agreement between different gene models Peptide evidence for all 5 exons and 2 introns out of 4 Note: Can only provide positive evidence, no peptides matched to 5’ and 3’ termini of gene model

-Appears to be additional exon at 5’ -None of GLEAN, TwinScan or TigrScan algorithms appears to have made correct prediction

ORF/ part of TgGlimmerHMM sequence: VVGGFSSNFLSFFSVIITSVKMSDAEDVTFETA DAGASHTYPMQAGAIKKNGFVMLKGNPCKV VDYSTSKTGKHGHAKAHIVGLDIFTGKKYED VCPTSHNMEVPNVKRSEFQLIDLSDDGFCTLL LENGETKDDLMLPKDSEGNLDEVATQVKNLF TDGKSVLVTVLQACGKEKIIASKEL 50.m5694 sequence: MVEGVYSSFEAMIFSLPHACRTVTRT DLPSVKRFLTCVATSSKFPSESLGSIK SSFVSPFSRSSVQKPSSDKSINWNSDL FTFGTSML - All peptides matched to gene models on opposite strand

Study outcomes Protein evidence for approximately 1/3 of predicted genes (2250 proteins) Around 2500 splicing events confirmed – Peptides aligned across intron-exon boundaries Around 400 protein IDs appear to match alternative gene models Genome database (ToxoDB) hosts peptide sequences aligned against gene models Can we use informatics to improve this strategy...? Xia et al. (2008) Genome Biology,9(7),pp.R11

Challenges of proteogenomics Main informatics challenge: – A protein can usually only be identified if the gene sequence has been correctly predicted from the genome – In effect, would like to use MS data directly for gene discovery – But... searching a six frame genome translation is problematic All peptide and protein identifications are probabilistic – False positive rate is proportional to search database size On average only ~10-20% of spectra identify a peptide – Need methods that can exploit the rest of the meaningful spectra When gene models change, protein identifications are out of date – No dynamic interaction between proteome and genome data

Automated re-annotation pipeline Planned improvements to the informatics workflow: 1.Re-querying pipeline – each time gene models change, all mass spectra are automatically re- queried 2.Integrate peptide evidence directly into gene finding software 3.Maximising the number of informative mass spectra 4.Attempt to optimise algorithms for de novo sequencing of peptides 5.N-terminal proteomics - Could be used to confirm gene initiation point

Spectra Multiple database search engines Official gene set Confirmed official model Multiple database search engines Modified de novo algorithms Novel ORF, splice junction Promote alternative model Stage 1 Stage 2 Gene Finder Proteomic evidence Alternative gene models Genome sequence Spectra searched in series Peptide evidence confirming official gene, alternative model, new ORF: Direct flow back to modified gene finder Produce new set of predictions Iteratively improve number of spectra identified In each iteration, fewer spectra flow on to stage 2 and 3 Stage 3

Combining evidence in gene finders Dynamically checking proposed gene models against peptide evidence Combining evidence from different gene finding algorithms In this case, probably no single algorithm appears to have correct model

Query spectra using different search engines Jones et al. Improving sensitivity in proteome studies by analysis of false discovery rates for multiple search engines. PROTEOMICS, in press (2008) Each search engine produces a different non-standard score of the quality of a match Developed a search engine independent score, based on analysis of false discovery rate Identifications made more search engines are scored more highly Can generate 35% more peptide identification than best single search engine Omssa X!Tandem Mascot Peptides Combined list Peptides OmssaX!Tandem Mascot Peptide identifications Rescoring Algorithm (FDR)

Conclusions Proteome data is able to confirm gene models are correct – Currently data under-exploited Challenges searching mass spec data directly against the genome for gene discovery Build re-querying pipeline – Iteratively improve gene models – Improve capabilities for using multiple search engines – Integrate peptide evidence directly into gene finders

Acknowledgments Data from Wastling lab: – Dong Xia, Sanya Sanderson, Jonathan Wastling ToxoDB at Upenn – David Roos, Brian Brunk