Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.

Slides:



Advertisements
Similar presentations
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
McPromoter – an ancient tool to predict transcription start sites
Reese, E-GASP Short comparion GASP ‘99- EGASP ‘05 Martin Reese Omicia Inc Horton Street Emeryville, CA
HMM Sampling and Applications to Gene Finding and Alignment European Conference on Computational Biology 2003 Simon Cawley * and Lior Pachter + and thanks.
Tutorial 7 Genome browser. Free, open source, on-line broswer for genomes Contains ~100 genomes, from nematodes to human. Many tools that can be used.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CHAPTER 18 LECTURE SLIDES.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
UCSC Known Genes Version 3 Take 9. Known Gene History Initially based on Genie predictions constrained by BLAT mRNA alignments. –David Kulp got busy at.
BME 130 – Genomes Lecture 7 Genome Annotation I – Gene finding & function predictions.
CSE182-L12 Gene Finding.
Genome Annotation and the landscape of the Human Genome Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
Eukaryotic Gene Finding
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Eukaryotic Gene Finding
Biological Motivation Gene Finding in Eukaryotic Genomes
Anum kamal(BB ) Umm-e-Habiba(BB ). Gene splicing “Gene splicing is the removal of introns from the primary trascript of a discontinuous gene.
Gene Structure and Identification
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Genomics of Microbial Eukaryotes Igor Grigoriev, Fungal Genomics Program Head US DOE Joint Genome Institute, Walnut Creek, CA.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
Genome Analysis & Gene Prediction. Overview about Genes Gene : whole nucleic acid sequence necessary for the synthesis of a functional protein (or functional.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012.
Genome Annotation Rosana O. Babu.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Sackler Medical School
Fprom promoter predictions Victor Solovyev & Igor Seledtsov Royal Holloway College, University of London Softberry Inc.
Annotating genomes using MAKER-P and iPlant. What Are Annotations? Annotations are descriptions of features of the genome –Structural: exons, introns,
The Havana-Gencode annotation GENCODE CONSORTIUM.
Mark D. Adams Dept. of Genetics 9/10/04
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Finding genes in the genome
CFE Higher Biology DNA and the Genome Transcription.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
A knowledge-based approach to integrated genome annotation Michael Brent Washington University.
Work Presentation Novel RNA genes in A. thaliana Gaurav Moghe Oct, 2008-Nov, 2008.
Bioinformatics Computing 1 CMP 807 – Day 4 Kevin Galens.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
bacteria and eukaryotes
EGASP 2005 Evaluation Protocol
The Transcriptional Landscape of the Mammalian Genome
Eukaryotic Gene Structure
EGASP 2005 Evaluation Protocol
Visualization of genomic data
Introduction to Bioinformatics II
BLAT Blast Like Alignment Tool
Genome Annotation and the Human Genome
Introduction to Alternative Splicing and my research report
Basic Local Alignment Search Tool
Gene Structure.
Gene Structure.
Presentation transcript:

Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry Inc.

Steps of FGENESH++ ANNOTATION PIPELINE 1. RefSeq set of mRNA mapping by EST_MAP program – sequences with mapped genes are excluded from further gene prediction process. 2. NR proteins mapping by Prot_map program 3. Fgenesh+ gene prediction on sequences having significant hit with the protein sequences (sequences with predicted genes are excluded from further gene prediction process) 4. Run FGENESH ab initio gene prediction in regions free from predictions made on stages 1 and Run of FGENESH gene predictions in large introns of known and predicted genes. Simple variant of pipeline was used For Human a lot of additional info can be used as ESTs, for example

Components of Fgenesh++ automatic pipeline FGENESH – ab initio gene prediction. Run on whole chromosomes (~300MB) FAST: The Human genome of 3 GB sequences is processed for ~ 4 hours EST_MAP a program for fast mapping of a set of mRNAs/ESTs to a chromosome sequence. EST_MAP takes into account splice site weight matrices for accurate mapping. Maps more accurately than BLAT small exon sequences. FGENESH+ This derivative of FGENESH use information on homologous proteins for improving gene prediction, if a homolog can be found. PROT_MAP is used for mapping a database of protein sequences to genome with accounting for splice sites

Example of Prot_map – mapping of a protein sequence to genome First sequence Chr19 [cut: ] [DD] Sequence: 1( 1), S: , L:1739 IPI:IPI |SWISS-PROT:Q8TEK3-1 Summ of block lengths: 1468, Alignment bounds: On first sequence: start , end , length On second sequence: start 263, end 1739, length 1477 Blocks of alignment: 19 1 E: [ca GT] P: L: 23, G: , W: 1160, S: E: [AG GT] P: L: 35, G: , W: 1810, S: E: [AG GT] P: L: 14, G: , W: 720, S: E: [AG GT] P: L: 37, G: , W: 1880, S: E: [AG GT] P: L: 78, G: , W: 3930, S: E: [AG GT] P: L: 37, G: , W: 2000, S: E: [AG GT] P: L: 30, G: , W: 1510, S: E: [AG GT] P: L: 34, G: , W: 1690, S: E: [AG GT] P: L: 46, G: , W: 2240, S: E: [AG GT] P: L: 42, G: , W: 2110, S: E: [AG GT] P: L: 161, G: , W: 8290, S: E: [AG GT] P: L: 45, G: , W: 2340, S: E: [AG GT] P: L: 49, G: , W: 2360, S: E: [AG GT] P: L: 38, G: , W: 1900, S: E: [AG GT] P: L: 194, G: , W: 9740, S: E: [AG GC] P: L: 68, G: , W: 3530, S: E: [AG GT] P: L: 21, G: , W: 1010, S: …………………………………………………………………

Prot_map example of alignment gatcacagaggctgg(..)agtgtctgtgtttca?[GGRIVSSKPFAPLNFRINSRNLSg (..)evdhqlkerfanmke GGRIVSSKPFAPLNFRINSRNLS ]gtaagaaactctcat(..)ctgtggctcctgcag[acIGTIMRVVELSPLKGSVSWTGK (..) dIGTIMRVVELSPLKGSVSWTGK PVSYYLHTIDRTI]gtgagtatctcgctg(..)ctttcttctttttag[LENYFSSLKNP PVSYYLHTIDRTI (..) LENYFSSLKNP KLR]gtaagtttgtgtgtt(..)ctgctctccttccag[EEQEAARRRQQRESKSNAATP KLR (..) EEQEAARRRQQRESKSNAATP TKGPEGKVAGPADAPM]gtaaggccccagcct(..)ccttgtgtcctccag[DSGAEEEK TKGPEGKVAGPADAPM (..) DSGAEEEK

Prot_map aligns (using on 1 processor) Human protein set of proteins to chromosome 19 (~59 MB) for 90 min (best hit for each protein) and 148 min (all significant hits for each protein)

Predicted genes in different classes 44 sequences 31 sequences 13 sequences Predictions mRNA supported 35.14% 34.34% 36.72% prot. supported 51.84% 51.35% 52.82% ab initio 13.29% 14.41% 11.07% % protein coding bases mRNA supported might have alternative splice forms that are overlapped

Predicted gene numbers 44 seq 31 seq 13 seq mRNA supported 177 (313) 118 (209) 59 (104) prot. supported ab initio Total 678 (814) 468 (559) 210 (255) Havana 435 (1061) 297 (716) 138 (345)

CDS prediction accuracy on nucleotide level 44 sequences 31 sequences 13 sequences all genes, nucleotide level, CDS, shift 1 base fixed: sn+ = sn+ = sn+ = sp+ = sp+ = sp+ = sn- = sn- = sn- = sp- = sp- = sp- = sn = sn = sn = sp = sp = sp = all genes, nucleotide level, CDS, WITHOUT fix: sn = sn = sn = sp = sp = sp = It was a bug in initial posting where exon of mRNA supported genes in negative chain were shifted by 1 bp

Prediction accuracy on nucleotide level 44 sequences 31 sequences 13 sequences CDS: sn = sn = sn = sp = sp = sp = Coding + noncoding EXONS: sn = sn = sn = sp = sp = sp = HAVANA annotations contain much more untranslated and partially translated exons than we have in our predictions We have such exons only for mRNA mapped genes (~ 35% cases) Need to add such exons in annotations in future using EST and provisional mRNA

Nucleotide specificity depending on prediction class 44 sequences 31 sequences 13 sequences CDS: sn = sn = sn = sp = sp = sp = mRNA supported genes vs. "44regions_coding.gff“ 35% sp = sp = sp = protein supported genes vs. "44regions_coding.gff“ 53% sp = sp = sp = ab initio genes vs. "44regions_coding.gff“ (13% of all CDS) sp = sp = sp = some NEW genes (?), also ~ 10% of them overlapped with predicted pseudogenes

Accuracy of exact CDS prediction: 44 sequences 31 sequences 13 sequences CDS OVERLAP sn = sn = sn = sp = sp = sp = CDS 1EDGE sn = sn = sn = sp = sp = sp = CDS EXACT sn = sn = sn = sp = sp = sp = 66.95

Canonical and Non-canonical splice sites GT-AG: 99.24% GC-AG: 0.69% AT-AC: 0.05% other sites: 0.02% SpliceDB (Burset, Seledtsov, Solovyev, NAR 1999,2000) Gene prediction is usually done with only standard splice sites What we have not done: Fgenesh/Fgenesh+ have an option to account for GC donor site At least for Prot_map + Fgenesh+ predictions we need to include GC splice sites

How we can improve the power of Fgenesh++annotation pipeline: USE ESTs and provisional mRNA Fgenesh_c predicts genes using genomic sequence and est sequence Add EST-based noncoding exons/parts of exons USE synteny We have a pipeline to generate syntenic regions between genomes based on coding exons annotation produced by Fgenesh++ Fgenesh2 predicts genes using 2 syntenic genomic sequences Mark or remove pseudogenes from the predictions (especially check ab initio) Include Promoter prediction to Fgenesh (developed) Then include prediction of non-coding exons Time + testing to define in what extent we can improve by above approaches

To Encode: Keep and improve annotations of 44 Encode regions to use them as a test bed for addition of new blocks to annotation pipelines Good to have GTF annotations of 44 regions with sequences extended with inclusion of complete genes at both ends Include in check of downloading predictions signalling of UNUSUAL CDS without GT/AG ends or ATG-GT or AG-STOP structure to avoid bugs in data posted for evaluation