Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry Inc.
Steps of FGENESH++ ANNOTATION PIPELINE 1. RefSeq set of mRNA mapping by EST_MAP program – sequences with mapped genes are excluded from further gene prediction process. 2. NR proteins mapping by Prot_map program 3. Fgenesh+ gene prediction on sequences having significant hit with the protein sequences (sequences with predicted genes are excluded from further gene prediction process) 4. Run FGENESH ab initio gene prediction in regions free from predictions made on stages 1 and Run of FGENESH gene predictions in large introns of known and predicted genes. Simple variant of pipeline was used For Human a lot of additional info can be used as ESTs, for example
Components of Fgenesh++ automatic pipeline FGENESH – ab initio gene prediction. Run on whole chromosomes (~300MB) FAST: The Human genome of 3 GB sequences is processed for ~ 4 hours EST_MAP a program for fast mapping of a set of mRNAs/ESTs to a chromosome sequence. EST_MAP takes into account splice site weight matrices for accurate mapping. Maps more accurately than BLAT small exon sequences. FGENESH+ This derivative of FGENESH use information on homologous proteins for improving gene prediction, if a homolog can be found. PROT_MAP is used for mapping a database of protein sequences to genome with accounting for splice sites
Example of Prot_map – mapping of a protein sequence to genome First sequence Chr19 [cut: ] [DD] Sequence: 1( 1), S: , L:1739 IPI:IPI |SWISS-PROT:Q8TEK3-1 Summ of block lengths: 1468, Alignment bounds: On first sequence: start , end , length On second sequence: start 263, end 1739, length 1477 Blocks of alignment: 19 1 E: [ca GT] P: L: 23, G: , W: 1160, S: E: [AG GT] P: L: 35, G: , W: 1810, S: E: [AG GT] P: L: 14, G: , W: 720, S: E: [AG GT] P: L: 37, G: , W: 1880, S: E: [AG GT] P: L: 78, G: , W: 3930, S: E: [AG GT] P: L: 37, G: , W: 2000, S: E: [AG GT] P: L: 30, G: , W: 1510, S: E: [AG GT] P: L: 34, G: , W: 1690, S: E: [AG GT] P: L: 46, G: , W: 2240, S: E: [AG GT] P: L: 42, G: , W: 2110, S: E: [AG GT] P: L: 161, G: , W: 8290, S: E: [AG GT] P: L: 45, G: , W: 2340, S: E: [AG GT] P: L: 49, G: , W: 2360, S: E: [AG GT] P: L: 38, G: , W: 1900, S: E: [AG GT] P: L: 194, G: , W: 9740, S: E: [AG GC] P: L: 68, G: , W: 3530, S: E: [AG GT] P: L: 21, G: , W: 1010, S: …………………………………………………………………
Prot_map example of alignment gatcacagaggctgg(..)agtgtctgtgtttca?[GGRIVSSKPFAPLNFRINSRNLSg (..)evdhqlkerfanmke GGRIVSSKPFAPLNFRINSRNLS ]gtaagaaactctcat(..)ctgtggctcctgcag[acIGTIMRVVELSPLKGSVSWTGK (..) dIGTIMRVVELSPLKGSVSWTGK PVSYYLHTIDRTI]gtgagtatctcgctg(..)ctttcttctttttag[LENYFSSLKNP PVSYYLHTIDRTI (..) LENYFSSLKNP KLR]gtaagtttgtgtgtt(..)ctgctctccttccag[EEQEAARRRQQRESKSNAATP KLR (..) EEQEAARRRQQRESKSNAATP TKGPEGKVAGPADAPM]gtaaggccccagcct(..)ccttgtgtcctccag[DSGAEEEK TKGPEGKVAGPADAPM (..) DSGAEEEK
Prot_map aligns (using on 1 processor) Human protein set of proteins to chromosome 19 (~59 MB) for 90 min (best hit for each protein) and 148 min (all significant hits for each protein)
Predicted genes in different classes 44 sequences 31 sequences 13 sequences Predictions mRNA supported 35.14% 34.34% 36.72% prot. supported 51.84% 51.35% 52.82% ab initio 13.29% 14.41% 11.07% % protein coding bases mRNA supported might have alternative splice forms that are overlapped
Predicted gene numbers 44 seq 31 seq 13 seq mRNA supported 177 (313) 118 (209) 59 (104) prot. supported ab initio Total 678 (814) 468 (559) 210 (255) Havana 435 (1061) 297 (716) 138 (345)
CDS prediction accuracy on nucleotide level 44 sequences 31 sequences 13 sequences all genes, nucleotide level, CDS, shift 1 base fixed: sn+ = sn+ = sn+ = sp+ = sp+ = sp+ = sn- = sn- = sn- = sp- = sp- = sp- = sn = sn = sn = sp = sp = sp = all genes, nucleotide level, CDS, WITHOUT fix: sn = sn = sn = sp = sp = sp = It was a bug in initial posting where exon of mRNA supported genes in negative chain were shifted by 1 bp
Prediction accuracy on nucleotide level 44 sequences 31 sequences 13 sequences CDS: sn = sn = sn = sp = sp = sp = Coding + noncoding EXONS: sn = sn = sn = sp = sp = sp = HAVANA annotations contain much more untranslated and partially translated exons than we have in our predictions We have such exons only for mRNA mapped genes (~ 35% cases) Need to add such exons in annotations in future using EST and provisional mRNA
Nucleotide specificity depending on prediction class 44 sequences 31 sequences 13 sequences CDS: sn = sn = sn = sp = sp = sp = mRNA supported genes vs. "44regions_coding.gff“ 35% sp = sp = sp = protein supported genes vs. "44regions_coding.gff“ 53% sp = sp = sp = ab initio genes vs. "44regions_coding.gff“ (13% of all CDS) sp = sp = sp = some NEW genes (?), also ~ 10% of them overlapped with predicted pseudogenes
Accuracy of exact CDS prediction: 44 sequences 31 sequences 13 sequences CDS OVERLAP sn = sn = sn = sp = sp = sp = CDS 1EDGE sn = sn = sn = sp = sp = sp = CDS EXACT sn = sn = sn = sp = sp = sp = 66.95
Canonical and Non-canonical splice sites GT-AG: 99.24% GC-AG: 0.69% AT-AC: 0.05% other sites: 0.02% SpliceDB (Burset, Seledtsov, Solovyev, NAR 1999,2000) Gene prediction is usually done with only standard splice sites What we have not done: Fgenesh/Fgenesh+ have an option to account for GC donor site At least for Prot_map + Fgenesh+ predictions we need to include GC splice sites
How we can improve the power of Fgenesh++annotation pipeline: USE ESTs and provisional mRNA Fgenesh_c predicts genes using genomic sequence and est sequence Add EST-based noncoding exons/parts of exons USE synteny We have a pipeline to generate syntenic regions between genomes based on coding exons annotation produced by Fgenesh++ Fgenesh2 predicts genes using 2 syntenic genomic sequences Mark or remove pseudogenes from the predictions (especially check ab initio) Include Promoter prediction to Fgenesh (developed) Then include prediction of non-coding exons Time + testing to define in what extent we can improve by above approaches
To Encode: Keep and improve annotations of 44 Encode regions to use them as a test bed for addition of new blocks to annotation pipelines Good to have GTF annotations of 44 regions with sequences extended with inclusion of complete genes at both ends Include in check of downloading predictions signalling of UNUSUAL CDS without GT/AG ends or ATG-GT or AG-STOP structure to avoid bugs in data posted for evaluation