Evaluating genes and transcripts in Ensembl Sep 2006
Outline Ensembl gene set Pseudogenes Ensembl EST genes Ab initio predictions Manual curation (Vega, CCDS) Gene models from other groups
Overview other groups’ models evidence Ensembl predictions manual curation
Annotation process Automated analysis Repeat masking Gene prediction RepeatMasker (Smit), tandem, inverted Gene prediction Genscan (Burge), FGENESH (Solovyev)… Database searches initial protein and DNA matches using BLAST refined protein matches using GeneWise refined EST matches using EST2GENOME, spangle Pfam annotation using GeneWise.
GeneWise genes with UTRs Human Proteins Other Proteins Human cDNAs Human ESTs GeneWise Exonerate Exonerate GeneWise genes Aligned cDNAs Aligned ESTs GeneWise genes with UTRs ClusterMerge ClusterMerge Supported ab initio (optional) Genebuilder Preliminary gene set cDNA genes The gene build process involves many stages. We initially align species specific proteins, and then other proteins and use them to build genewise models. The genewise models are compared with align cDNAs in order to give the transcripts UTRS. This set is then collapsed down into a non redundant set of transcripts. At this stage ab initio predictions can be added to the gene set at this point if they have homology evidence supporting them. The preliminary geneset is then compared with genes predicted using the cDNA alignments and these cDNA based genes can be used to add alternative transcripts and extra novel genes. This produces the core ensembl gene set which is then analyses to find potential pseudogenes so they can be labeled differently to the core ensembl genes. Gene Combiner Final set + pseudogenes Pseudogenes Core Ensembl genes Ensembl EST genes
Gene Builds Are Protein-based Simple DNA - DNA alignments do NOT lead to translatable genes. Essential to align at the protein level allowing for frameshifts and splice sites GeneWise* Protein - Genome alignments Splice site model Penalises stop codons Models frameshifts *E. Birney et al. Genome Research 14:988-995 (2004)
The Ensembl Gene Build Align species-specific proteins Align similar proteins from closely related species Use mRNA information to add UTRs Build transcripts using mRNA evidence Build additional transcripts using ab initio predictors and homology evidence Combine annotations to make genes with alternative transcripts
The trouble with BLAST AG GT AG GT Ideal BLAST Real gene Reality BLAST is good for finding possible exon positions In large genomic sequences.
BLAST ‘replacements’ Exonerate* (Guy Slater) Pmatch (Richard Durbin) Fast gapped DNA-DNA matcher 10,000 x faster than BLAST Pmatch (Richard Durbin) Fast exact protein-dna matcher >10,000 x faster than BLAST *BMC Bioinformatics 6: 31 (2005)
Exonerate
Adding UTRs protein - GeneWise (phases, no UTRs) cDNA - exonerate (UTRs, no phases) Combined prediction protein - GeneWise (phases, no UTRs) cDNA - exonerate (UTRs, no phases) GeneWise prediction Now we use the information within cDNA based transcripts to add UTRs to the protein based genewise predictions obtained from the Targetted and Similarity stages. These are used to generate a consensus transcript structure that consists of a 5' UTR, a coding region and a 3' UTR. The protein data sets contain both full length sequences and some fragmented sequences that overlap with them. In order to avoid adding UTRs to a fragment instead of a full length sequence (which could lead to an incorrect prediction having a short ORF with an extremely long UTR) we first sort the genewise transcript predictions by length, considering both genomic extent and total exon length. We then pair each of the genewise predictions with a cDNA prediction, allowing each cDNA to be matched with a single, long genewise prediction. An individual genewise prediction may at this stage be paired with more than one cDNA. We compare the 5' genewise exon to each of the exons in a cDNA transcript and call a match if: i. The end of the 5' genewise exon exactly coincides with the end of one of the cDNA exons; and either ii. the cDNA exon starts upstream of the genewise exon; or iii. the cDNA exon starts downstream of the genewise exon, and the matching cDNA exon is not the first in the prediction - i.e. there are potential spliced UTR exons. A similar procedure compares the 3' terminal genewise exon with the cDNA, considering exon start coordinates. We do not require a cDNA prediction to extend both 5' and 3' UTRs. Single exon genewise predictions match any cDNA which entirely encloses them within one of its exons. Various examples of matching cDNA and genewise structures are shown in Figure 3 The best matching cDNA is chosen for each genewise based on exon overlap and, if necessary, the extent of shared genomic overlap between the two. The cDNA and genewise transcripts are now combined, giving preference to the genewise predicted ORF/translation coordinates - internal exons are taken from the genewise prediction. The exceptions to this rule are cases where the cDNA exon coordinates did not precisely match the 5'/3' terminal genewise exons. This can occur as genewise sometimes fails to align a very short terminal coding region to the right exon. The translation must be then recalculated to take into account the corrected splice sites and this is achieved using genomewise (Birney et al. 2003). The combined predictions are stored in a clean database. Additionally, if no matching cDNA is found for a genewise prediction we re-store the unmodified genewise prediction. This is important in preventing loss of supporting evidence, as at this stage we store only one transcript per predicted gene. The final GeneBuilder combines all transcripts belonging to the same gene and will transfer supporting evidence from any partial transcripts that may have been subsumed by full length transcripts. Rules for adding UTRs to genewise predictions: -(top left) Simplest case: Ends of exons coincide, thus 1st red exon is extended to include the UTR and the translation start is maintained. Starts of last exons coincide, thus UTR exons are added and the translation stop is maintained. The coordinates of genewise-derived middle exon are used in preference over exonerates’s exon. -(middle) cDNA prediction rejected: Neither the ends of first exons nor the starts of last exons coincide, so the genewise-predicted structure is unmodified. -(bottom) cDNA prediction with short exons: The ends of the first Genewise exon and second exonerate exon and the starts of last exons coincide. Even though Genewise’s shorter than exonerate, it is not the first exon of the cDNA prediction and is thus retained. However, the exonerate exon is shorter than Genewise’s and there are no additional exons, so it is rejected. protein - GeneWise (phases, no UTRs) cDNA - exonerate (UTRs, no phases) GeneWise prediction
Gene Builder Combines results after GeneWise and eventually ab initio predictions. Clusters transcripts into genes by genomic exon overlap. Groups transcripts, which share exons Rejects non-translating transcripts Removes duplicate exons Attaches supporting evidence Writes genes to database When we combine the transcripts from the genewise and UTR addition stages the transcripts have to go through many steps. First we cluster the transcripts on the basis of genomic overlap and prune redundant transcripts. The clusters are then sorted by length both total exon length and translation length is used. Longer translations are given priority over shorter translations and translations with UTRs are given priority over translations without. Then shorter transcripts are subsumed into longer transcripts provided all pairs of adjacent exons in the sorter transcript are shared by the longer transcript. If this happens the evidence is transfered and the shorter transcript is thrown away. Pairs of exons are considered so alternative transcripts aren’t lost. If a cluster only contains single exon genes the longest exon is kept and the evidence transfered. The non redundant transcripts are then clustered into genes using exon overlap. Two transcripts belong in the same gene provided that at least one exon overlaps. Transcripts which lie entirely within introns of other transcripts are then clustered separately. If a cluster contains a particularly large number of transcripts we select the best transcripts for each cluster defaulting to 10. We take long translations and transcripts with utrs over shorter translations. This happens relatively rarely though. Now we remove duplicate exons and transfer all evidence then store the genes. The genes are stored as a unique set of exons which are then linked to transcripts. We store all the supporting evidence from each transcript where the exon appears.
Evidence Tracks in ContigView Expanded tracks Compressed tracks
Pseudogenes and ncRNA and ncRNA Pseudogenes
Pseudogenes: ‘False’ Genes Reverse transcription and re-integration mRNA pseudogene AAAAAA Processed Unprocessed Produced by gene duplication and rearrangement
Spliced Elsewhere BLASTs single exon genes against a database of the multi-exon genes. Span of real gene > 3x span of retro gene Finds an additional ~ 600 pseudogenes in human False positives can occur where gene predictions join together neighbouring genes in a cluster Questions remain over the wisdom of calling all of these genes pseudogenes as some may be functional. Single exon transcripts with frameshifts Single exon transcripts with a spliced gene model elsewhere Transcripts which introns contain more than 80% repeat sequences A gene is labeled as a pseudogene if all transcripts in that gene are labeled as pseudo-transcripts
ncRNAs Functional RNAs Families share conserved secondary structure Low sequence identity Ribosome Spliceosome tRNAs miRNA
RFAN Hand made alignments Use Infernal to make Covariance Models Scan models over subset of EMBL to build family alignments
miRNA Highly conserved across species Precursor stem loop sequence ~ 70nt Mature miRNA ~ 21nt BLAST genomic v miRBase precursors RNAfold used to test for stem loop Mature sequence identified (only 2 nt changes tolerated)
Structures Structures identified by Infernal / RNAfold are stored as transcript attributes ::::::::::::::::<<-<<<<<-<<<________________>>>>>>>>-->>,,,, 1 AuCUUUGCGCAGGGGCaaUaucguAgccAGUGAGGcUuuaCCGAggcgcgauUAuuGCUA 60 A+CUUUGCGCAG GGCA:UAU :UAGCCA+UGAGG+UU++CCGAGGCG: AUUA:UGCUA 181 AGCUUUGCGCAGUGGCAGUAUCAUAGCCAAUGAGGUUUAUCCGAGGCGCAAUUAUUGCUA 240 <<<<_.________.__>>>>,,,,,<<<.<<<<<<<<<<____......__>>>>>>>> 61 gUugA.AAACUAUU.CCcaAccgCCCgcc.aagacgacauguua......uauugucggc 111 :UU A AAA UA AA:+G G:C ::: ::A:::+UUA U :::U::+: 241 AUUAAuAAAUUAAAuAAUAAAAGGG-GACuCUU-UUAGUGCUUAuaaaggUUUACUAACC 298 >>->>>,,,,,,,,,,,,<<<<____>>>> 112 uuuggcAAUUUUUGGAAGcccuccAaaggg 141 :: G:CAA UU +AAG ::C+AA:: 299 ACAGACAACUU---AAAGGUAACAAACCUA 325 Displayable on website as markup on transcript sequence
Human Build Statistics NCBI 36 assembly, released November 2005 ‘known’ genes 21,571 ‘novel’ genes 2,142 Coding transcripts: 49,043 non-coding transcripts 4,145 Ensembl exons: 278,632 Human input sequences: 260,031 proteins, redundant set +---------------------+----------+ | miRNA | 606 | | miRNA_pseudogene | 22 | | misc_RNA | 1060 | | misc_RNA_pseudogene | 7 | | Mt_rRNA | 1 | | Mt_tRNA | 22 | | Mt_tRNA_pseudogene | 603 | | protein_coding | 23713 | | pseudogene | 731 | | rRNA | 334 | | rRNA_pseudogene | 393 | | scRNA | 1 | | scRNA_pseudogene | 902 | | snoRNA | 609 | | snoRNA_pseudogene | 564 | | snRNA | 1387 | | snRNA_pseudogene | 632 | | tRNA_pseudogene | 131 | In parallel with the protein alignments, we align all full length cDNAs from an organism to its genomic sequence. Our sequence sources vary - for the human genome we use cDNA sequences from EMBL (Stoesser et al. 1997) and RefSeq (Pruitt et al. 2000), while for mouse genome we additionally use the FANTOM2 data set (Okazaki et al. 2002). For the most recent human build (NCBI33) we aligned 86918 cDNAs. From a starting point of 48,176 human proteins 42,589 proteins were placed at one location in the genome, 3173 at 2 locations and 781 at 3 locations. 492 proteins (1%) could not be located at all due to missing genomic sequence or insuficient coverage of the placed protein. The final gene build step resulted in 23,299 genes containing 32,035 transcripts. Of these 270 (0.84%) were built solely from cDNAs, 6219 (19%) were built from human proteins with no UTR attachment and 2983 (9%) were built from non-human proteins with no UTR attachment. Of the combined transcripts with UTRs there were 21,889 (68%) built from human protein and cDNA and 674 (2%) built from non-human protein and cDNA. Of the transcripts built 962 were tagged as pseudogenes.
Classification of Transcripts Ensembl Transcripts or Proteins are mapped to UniProt/Swiss-Prot, NCBI RefSeq and UniProt/TrEMBL entries Known genes map to species-specific protein records (targeted build) Novel genes do not map to species-specific protein records (similarity build)
Names and Descriptions Names are inferred from mapped proteins Official gene symbol is assigned if available HGNC (HUGO) symbol for human genes Species-specific nomenclature committees Otherwise Swiss-Prot > RefSeq > TrEMBL ID Novel transcripts have only Ensembl identifiers Genes named after ‘best-named’ transcript Gene description is inferred from mapped database entries, the source is always given
Supporting evidence ExonView
Configuring the Gene Build Data availability Targeted build most useful in human and mouse. Similarity build more important in other species. Structural Issues Zebrafish Many similar genes near each other Genome from different haplotypes Mosquito Many single-exon genes Genes within genes Configuration Files provide flexibility
Low Coverage Genomes Low coverage genomes (~2x) come in lots of scaffolds: “classic” genebuild will result in many partial and fragmented genes Whole Genome Alignment (WGA) to an annotated reference genome: this method reduces fragmentation by piecing together scaffolds into “gene-scaffolds” that contain complete gene(s)
Gene building summary Initial location of possible genes using GENSCAN peptides and BLAST. ReBLASTing of all high scoring proteins with BLAST to find regions GENSCAN has missed Realignment of proteins using GeneWise mRNA/EST genes built using GENSCAN exons.
Evaluating Genes and Transcripts Ensembl gene set Pseudogenes Ensembl EST genes Ab initio predictions Manual curation (Vega, CCDS) Gene models from other groups Now I will talk about how we produce the EST genes. These genes aren’t combined into our core genes because we don’t consider the est data source to be of high enough quality for most genomes.
Genewise genes with UTRs Human Proteins Other Proteins Human cDNAs Human ESTs Genewise Exonerate Exonerate Genewise genes Aligned cDNAs Aligned ESTs Genewise genes with UTRs ClusterMerge ClusterMerge Supported ab initio (optional) Genebuilder Preliminary gene set cDNA genes The EST genebuild process is run along side the standard genebuild. It uses the same set of analyses as we use to align and predict genes from the cdnas. Initially the ests are aligned to the genome using Guy Slaters exonerate. Then the aligned cdnas are merged into transcripts using Eduardo Eyras’ cluster merge algorithm. Expressed sequence tags (ESTs) are notorious for their variable quality. They are single-read sequences and thus prone to sequencing error. Additionally, the libraries from which they are derived can often be contaminated with genomic sequence which cannot be detected by an automatic annotation system. Finally, they are generally around 400bp long and thus a single EST is unlikely to cover an entire gene. For these reasons we have less confidence in genes built from ESTs so we build them separately from the main gene build. Gene Combiner Final set + pseudogenes Pseudogenes Core Ensembl genes Ensembl EST genes
Map ESTs with Exonerate Filter on % identity and depth EST Analysis Map ESTs with Exonerate (determine coverage, % identity and location in genome) Filter on % identity and depth (6 million ESTs from dbEST – we map about 1/3) Exonerate is run with chunks of ests, generally about 300 at once against the whole genome. The analyses runs the chunk against each chromosome then takes tne results and first filters them on coverage and identity. then finds the best in genome hit for each est plus those which lie with in 2% identity of it. We take the best in genome + 2%. These est alignments are then stored in the genome as gene structures as while they probably don’t translate this makes later processing easier. When filtering we only take ests which match with 97% percent identity to the genomic sequence and 90% coverage of the est sequence. This is because ests are single-read sequences and thus prone to sequencing error. Additionally, the libraries from which they are derived can often be contaminated with genomic sequence which cannot be detected by an automatic annotation system.
Alternative Splicing Forms Merge ESTs according to consecutive exon overlap and set splice ends Assign translation Alternative transcripts with translation and UTRs ESTs ClusterMerge groups and merges the exons from the alignments to find a non redundant set of exons. Then we use ORF-finding code to assign each alternate transcript with a translation and UTRs. These are stored in our database and presented as the EST Genes.
EST Genes and ESTs Ensembl transcript EST transcripts Human ESTs The EST genes are displayed on the website along side the core ensembl genes. These are the transcripts colored purple. There were nearly 25,000 genes in the last human EST build and more than 40,000 transcripts. Latest Human Build NCBI 36 assembly EST Genes: 28,639 released Nov 2005 EST Transcripts: 58,916
Evaluating Genes and Transcripts Ensembl gene set Pseudogenes Ensembl EST genes Ab initio predictions Manual curation (Vega, CCDS) Gene models from other groups On the ensembl website we also display predictions from ab initio programs like genscan and genefinderAb initio Genscan predictions
Ab initio Predictions GENSCAN transcript Chris Burge and Samuel Karlin (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78-94.
Evaluating Genes and Transcripts Ensembl gene set Pseudogenes Ensembl EST genes Ab initio predictions Manual curation (Vega, CCDS) Gene models from other groups Information about manually curated sequences is also available through ensembl
Manual Curation Manual annotation of finished clones Vega Genome Browser http://vega.sanger.ac.uk/ Currently only chromosomes 1, 6, 9, 10, 13, 20, 22, X and Y (Sanger Institute) 7 (Washington University) 14 (Genoscope) 18 (Broad Institute) 16, 19 (DOE Joint Genome Institute) Other groups will also contribute to Vega The Vega website which displays manually curated data from both human and other species is based on ensembl technology and the ensembl website also display the annotations from finished chromosomes when it is available
Manual Curation Manually-curated gene sets in Ensembl WormBase (data import) Caenorhabditis elegans FlyBase (data import) Drosophila melanogaster Génoscope (data import) Tetraodon nigroviridis IMCB, Singapore (data import) Takifugu rubripes SGD (data import) Saccharomyces cerevisiae Vega includes some manually-curated finished clones from Danio rerio, Mus musculus and Canis familiaris We import manually curated data from elegans and Drosophila from the respective communities and Vega also displays manually curated clones from mouse and zebrafish
Manually-curated Vega Genes Vega manual curation As already mentioned the human manually curated genes from Vega are displayed on the ensembl website. These genes are blue and the two websites let you jump between them on the basis of those genes.
Manually-curated Vega Genes
Vega Genes ContigView Vega transcripts
Ensembl / Havana Merge ~12,000 full-length protein-coding transcripts annotated by the Sanger Havana team (part of Vega) were added to the human Ensembl gene set in v38 Transcripts: Ensembl: red / black Havana: blue Ensembl/Havana: gold Genes:
Merged Ensembl / Havana gene Ensembl / Havana Merge Merged Ensembl / Havana gene Ensembl transcript Havana transcripts
Ensembl / Havana Merge Transcripts: Genes: +---------------------------+--------+-------+ | logic_name | status | count | | ensembl | KNOWN | 31402 | | ensembl | NOVEL | 4709 | | ensembl_havana_transcript | KNOWN | 174 | | havana | KNOWN | 11153 | | havana | NOVEL | 780 | Genes: +---------------------+--------+-------+ | logic_name | status | count | | ensembl | KNOWN | 15044 | | ensembl | NOVEL | 2145 | | ensembl_havana_gene | KNOWN | 6407 | | ensembl_havana_gene | NOVEL | 18 | | havana | KNOWN | 71 | | havana | NOVEL | 25 |
CCDS (consensus CDS) Collaboration between NCBI, UCSC, Ensembl and Havana to produce a set of stable, reliable, complete (ATG->stop) CDS structures for human Long term aim is to get to a single gene set for human The genebuild pipeline has been modified to retain these ‘blessed’ CDSs (stored in a database for incorporation in the build)
Comparison of CDSs to NCBI Exact matching CDS on the genome with: Complete CDS (ATG->stop) No frameshifts No phase problems No internal stop codons NCBI Hinxton
CCDS release (March 2005) Conservative first set, so the following have been removed: All CDSs which match XMs CDSs with large cDNA v genomic discrepancies CDSs with non consensus splice sites Set contains: 14795 different CDSs 16085 transcripts (in Ensembl) 13031 genes The genebuild pipeline has been modified to retain these ‘blessed’ CDSs (stored in a database for incorporation in the build)
ENCODE regions 44 regions representing 1% of the genome 14 manually / semi manually picked regions 30 randomly picked 0.5Mb regions at varying gene density and non exonic conservation (three band for each). Example id: ENr123
GENCODE Evaluation of prediction accuracy of automatic annotation methods in the Encode regions Based on comparison to manual annotation generated by Havana team Some experimental confirmation of annotation Divided into categories of prediction based on data and methods used: Ab initio Comparative data only Protein, EST and cDNA based Any available data
GENCODE regions For 13 of the regions annotation released prior to competition for training: 2 manual picks 11 random (2 level 1, 4 level 2 and 5 level 3) For the remaining 31 regions 12 manual picks 19 random (8 level 1, 6 level 2 and 5 level 3)
EGASP how complete is the Vega-ENCODE annotation when compared to other existing gene data sets? how well the programs are able to reproduce the Vega-ENCODE annotation? how reliable are the predictions outside of the Vega-ENCODE annotation is there anything outside the annotation and the predictions? M.G.Reese & R. Guigó Genome Biology 7:S1 (2006)
EGASP 05 Nucleotide Exon Sn Sp CC Sn Sp SnSp Acembly 0.96 0.58 0.74 0.84 0.38 0.613 ECgene 0.96 0.46 0.66 0.75 0.30 0.528 Ensembl 0.91 0.92 0.92 0.77 0.82 0.800 Exogean 0.84 0.94 0.89 0.71 0.74 0.728 AceView 0.91 0.79 0.84 0.74 0.49 0.624 Pairagon 0.87 0.93 0.90 0.67 0.78 0.732
Ensembl benchmarking Genes Exonhunter Exogean Augustus AceView Pairagon Ensembl
Evaluating Genes and Transcripts Ensembl gene set Pseudogenes Ensembl EST genes Ab initio predictions Manual curation (Vega) Gene models from other groups
Other Gene Models Through the use of das the ensembl website displays gene models from other groups like Geneid, NCBI or Twinscan transcripts. These link to the peptide sequences of this transcripts and also give information about the group and program which created them.
Q & A
Traces
Initial Placement Targeted Build Similarity Build Species-specific Proteins UniProt and RefSeq Closely Related Proteins UniProt pmatch* Genome Sequence Initial match in genome GENSCAN SNAP BLAST vs Repeat Masked slice, not just GENSCAN Exons (coverage threshold) To use genewise effectively we first align the proteins to the genome. Species specific proteins are aligned using pmatch a program written by Richard Durbin for fast exact matching. Other proteins are aligned using swall. These alignments are then used to pick which proteins to feed to the next stage. Those proteins which hit the genome are Then realigned to the sequence using blast and then that alignment is used to build a miniseq (explained in the next slide). The miniseq and the protein sequence is passed to genewise where a transcript prediction is made. In the last side I mentioned the miniseq. The miniseq is a constuct we use to make genewise run in a reasonable time frame. When producing the miniseq we take the sequence of the hits from the reblast plus 1kb of padding and use this to produce a sequence to pass to genewise with the protein sequence. Once genewise returns a gene structure we use this to map back into genomic requirements and produced our spliced alignment. BLAST Ab initio Transcripts BLAST Raw Exons Raw Exons Novel genes Known genes *R. Durbin, unpublished
Requires strong supporting evidence Genes from GENSCANs Highly conservative Requires strong supporting evidence GENSCAN Protein vertRNA etc. Link exon pairs recursively to build transcripts Transcript 2 Make exons from GENSCANs Discard short features Discard low scoring features Discard unsupported GENSCANs exons Make exon pairs if strands match coordinates sane neighbouring features abut Transcript 1
Vega-ENCODE annotation