Presentation is loading. Please wait.

Presentation is loading. Please wait.

Evaluating genes and transcripts in Ensembl March 2007.

Similar presentations


Presentation on theme: "Evaluating genes and transcripts in Ensembl March 2007."— Presentation transcript:

1 Evaluating genes and transcripts in Ensembl March 2007

2 2 of 37 Ensembl gene set Pseudogenes Ensembl EST genes Ab initio predictions Manual curation (Vega, CCDS) Gene models from other groups Outline

3 3 of 37 evidence other groups’ models manual curation Overview Ensembl predictions

4 4 of 37 What is available? I) Sequence Assemblies from genome sequencing efforts II) Proteins/mRNAs submitted to databases such as UniProt and RefSeq Introduction How does Ensembl use this information to define distinct genes on the genome?

5 5 of 37 Genewise genes Mouse and Rat Proteins Genewise Other proteins Genewise genes with UTRs Genebuilder Supported GENSCANs Preliminary gene set cDNA genes ClusterMerge Gene Combiner Core Ensembl genes Final gene set + pseudogenes Human, Mouse, Dog, Opossum & old genebuild Pseudogenes Selenocysteine genes Targetted Similarity (Homology) Aligned cDNAs Mouse and Rat cDNAs Exonerate Aligned ESTs ESTs Ensembl EST genes Exonerate ClusterMerge Homologous genes

6 6 of 37 Gene Builds Are Protein-based Simple DNA - DNA alignments do NOT lead to translatable genes. Essential to align at the protein level allowing for frameshifts and splice sites GeneWise* Protein - Genome alignments Splice site model Penalises stop codons Models frameshifts *E. Birney et al. Genome Research 14:988-995 (2004)

7 7 of 37 Ensembl Gene Build Align species-specific proteins Align similar proteins from closely related species Use mRNA information to add UTRs Build transcripts using mRNA evidence Build additional transcripts using ab initio predictors and homology evidence Combine annotations to make genes with alternative transcripts

8 8 of 37 Trouble with BLAST AG GTAG GT Real gene Ideal BLAST Reality BLAST is good for finding possible exon positions In large genomic sequences.

9 9 of 37 BLAST ‘replacements’ Exonerate* (Guy Slater) Fast gapped DNA - DNA matcher 10,000 x faster than BLAST Pmatch (Richard Durbin) Fast exact protein-dna matcher >10,000 x faster than BLAST *BMC Bioinformatics 6: 31 (2005)

10 10 of 37 Gene Builder Combines results after GeneWise and eventually ab initio predictions. Clusters transcripts into genes by genomic exon overlap. Groups transcripts, which share exons Rejects non-translating transcripts Removes duplicate exons Attaches supporting evidence Writes genes to database

11 11 of 37 Evidence Tracks in ContigView Compressed tracks Expanded tracks

12 12 of 37 Pseudogenes: ‘False’ Genes Unprocessed Produced by gene duplication and rearrangement Reverse transcription and re-integration mRNA pseudogene AAAAAA Processed AAAAAA

13 13 of 37 Functional RNAs Families share conserved secondary structure Low sequence identity Ribosome Spliceosome tRNAs miRNA ncRNAs

14 14 of 37 RFAN Hand made alignments Use Infernal to make Covariance Models Scan models over subset of EMBL to build family alignments

15 15 of 37 miRNA Highly conserved across species Precursor stem loop sequence ~ 70nt Mature miRNA ~ 21nt BLAST genomic v miRBase precursors RNAfold used to test for stem loop Mature sequence identified (only 2 nt changes tolerated)

16 16 of 37 Human Build Statistics +---------------------+----------+ | miRNA | 581 | | miRNA_pseudogene | 22 | | misc_RNA | 1060 | | misc_RNA_pseudogene | 7 | | Mt_rRNA | 2 | | Mt_tRNA | 22 | | Mt_tRNA_pseudogene | 603 | | protein_coding | 22726 | | pseudogene | 1069 | | rRNA | 332 | | rRNA_pseudogene | 393 | | scRNA | 1 | | scRNA_pseudogene | 902 | | snoRNA | 609 | | snoRNA_pseudogene | 564 | | snRNA | 1387 | | snRNA_pseudogene | 632 | | tRNA_pseudogene | 131 | +---------------------+----------+ NCBI 36 assembly, released November 2005 ‘known’ genes21,661 ‘novel’ genes1,064 Coding transcripts:43,466 non-coding transcripts 1,071 Ensembl exons:270,239 Human input sequences: 260,031 proteins, redundant set

17 17 of 37 Classification of Transcripts Ensembl Transcripts or Proteins are mapped to UniProt/Swiss-Prot, NCBI RefSeq and UniProt/TrEMBL entries Known genes map to species-specific protein records (targeted build) Novel genes are derived from mouse protein records (similarity build)

18 18 of 37 Names and Descriptions Names are inferred from mapped proteins Official gene symbol is assigned if available RGD symbol for rat genes Species-specific nomenclature committee (e.g. HGNC) Otherwise Swiss-Prot > RefSeq > TrEMBL ID Novel transcripts have only Ensembl identifiers Genes named after ‘best-named’ transcript Gene description is inferred from mapped database entries, the source is always given

19 19 of 37 Supporting evidence ExonView cDNA peptide cDNA UTRcoding/UTR

20 20 of 37 Data availability Targeted build most useful in human and mouse. Similarity build more important in other species. Structural Issues Zebrafish Many similar genes near each other Genome from different haplotypes Mosquito Many single-exon genes Genes within genes Configuration Files provide flexibility Configuring the Gene Build

21 21 of 37 Gene Building Summary Initial location of possible genes using GENSCAN peptides and BLAST. ReBLASTing of all high scoring proteins with BLAST to find regions GENSCAN has missed Realignment of proteins using GeneWise mRNA/EST genes built using GENSCAN exons.

22 22 of 37 Ensembl gene set Pseudogenes Ensembl EST genes Ab initio predictions Manual curation (Vega, CCDS) Gene models from other groups Evaluating Genes and Transcripts

23 23 of 37 Human, Mouse, Dog, Opossum & old rat genes Aligned ESTs ESTs Ensembl EST genes Exonerate ClusterMerge Genewise genes Mouse and Rat Proteins Genewise Other proteins Genewise genes with UTRs Genebuilder Supported GENSCANs Preliminary gene set cDNA genes ClusterMerge Gene Combiner Core Ensembl genes Final gene set + pseudogenes Pseudogenes Selenocysteine genes Aligned cDNAs Mouse and Rat cDNAs Exonerate Homologous genes

24 24 of 37 EST Analysis Map ESTs with Exonerate (determine coverage, % identity and location in genome) Filter on % identity and depth (966,141 ESTs from dbEST, 618,606 mapped)

25 25 of 37 Merge ESTs according to consecutive exon overlap and set splice ends Assign translation Alternative transcripts with translation and UTRs ESTs Alternative Splicing Forms

26 26 of 37 Human ESTs EST transcripts Latest Human Build NCBI 36 assemblyEST Genes: 28,639 released Nov 2005EST Transcripts:58,916 EST Genes and ESTs Ensembl transcript

27 27 of 37 Ensembl gene set Pseudogenes Ensembl EST genes Ab initio predictions Manual curation (Vega, CCDS) Gene models from other groups Evaluating Genes and Transcripts

28 28 of 37 Ab initio Predictions GENSCAN transcript

29 29 of 37 Ensembl gene set Pseudogenes Ensembl EST genes Ab initio predictions Manual curation (Vega, CCDS) Gene models from other groups Evaluating Genes and Transcripts

30 30 of 37 Imported Sets Manually-curated gene sets in Ensembl WormBase (data import) Caenorhabditis elegans FlyBase (data import) Drosophila melanogaster SGD (data import) Saccharomyces cerevisiae Vega includes some manually-curated finished clones from Danio rerio, Mus musculus and Canis familiaris Génoscope (data import) Tetraodon nigroviridis IMCB, Singapore (data import) Takifugu rubripes

31 31 of 62 People are the best at –Resolving conflicting heterogeneous information –Recognising “out of the ordinary” biology For high investment genomes an automated pipeline with human intervention is the endgame –Human and Mouse Manual curation

32 32 of 62 Automatic vs Manual AutomaticAnnotationAutomatic Annotation Quick whole genome analysis ~ weeks Consistent annotation Use unfinished sequence or shotgun assembly No polyA sites or signals, pseudogene Predicts ~ 80% loci ManualAnnotationManual Annotation Extremely slow ~ 3 months for Chr 6 Need finished sequences Flexible, can deal with inconsistencies Most rules have exception Consult publications as well as databases

33 33 of 62 –3, 12 (Baylor College of Medicine) –7 (Washington University) –8, 15, 17, 18 (The Broad Institute) –14 (Genoscope) –16, 19 (DOE, Joint Genome Institute) Manual annotation at Vega Genome Browser Currently only chromosomes – 1, 6, 9, 10, 13, 20, 21, 22, X and Y (Sanger Institute) http://vega.sanger.ac.uk Manual curation

34 34 of 37 Comparison of CDSs to NCBI Exact matching CDS on the genome with: Complete CDS (ATG->stop) No frameshifts No phase problems No internal stop codons NCBI Hinxton

35 35 of 37 CCDS release (March 2007) Conservative first set, so the following have been removed: All CDSs which match XMs CDSs with large cDNA v genomic discrepancies CDSs with non consensus splice sites Set contains: 18,290 different CDSs 16,008 genes The genebuild pipeline has been modified to retain these ‘blessed’ CDSs (stored in a database for incorporation in the build)

36 36 of 37 Evaluating Genes and Transcripts Ensembl gene set Pseudogenes Ensembl EST genes Ab initio predictions Manual curation (Vega) Gene models from other groups

37 37 of 37 Other Gene Models

38 38 of 37 Q & A

39 39 of 37 ENCODE regions 44 regions representing 1% of the genome 14 manually / semi manually picked regions 30 randomly picked 0.5Mb regions at varying gene density and non exonic conservation (three band for each). Example id: ENr123

40 40 of 37 GENCODE Evaluation of prediction accuracy of automatic annotation methods in the Encode regions Based on comparison to manual annotation generated by Havana team Some experimental confirmation of annotation Divided into categories of prediction based on data and methods used: Ab initio Comparative data only Protein, EST and cDNA based Any available data

41 41 of 37 EGASP 05 NucleotideExon SnSpCCSnSpSnSp Acembly0.960.580.740.840.380.613 ECgene0.960.460.660.750.300.528 Ensembl0.910.920.920.770.820.800 Exogean0.840.940.890.710.740.728 AceView0.910.790.840.740.490.624 Pairagon0.870.930.900.670.780.732

42 42 of 37 Adding UTRs Combined prediction cDNA - exonerate (UTRs, no phases) protein - GeneWise (phases, no UTRs) GeneWise prediction cDNA - exonerate (UTRs, no phases) protein - GeneWise (phases, no UTRs) GeneWise prediction cDNA - exonerate (UTRs, no phases) protein - GeneWise (phases, no UTRs)

43 43 of 37 EGASP 1.how complete is the Vega-ENCODE annotation when compared to other existing gene data sets? 2.how well the programs are able to reproduce the Vega-ENCODE annotation? 3.how reliable are the predictions outside of the Vega-ENCODE annotation 4.is there anything outside the annotation and the predictions? M.G.Reese & R. Guigó Genome Biology 7:S1 (2006)

44 44 of 37 GENCODE regions For 13 of the regions annotation released prior to competition for training: 2 manual picks 11 random (2 level 1, 4 level 2 and 5 level 3) For the remaining 31 regions 12 manual picks 19 random (8 level 1, 6 level 2 and 5 level 3)

45 45 of 37 Low Coverage Genomes Low coverage genomes (~2x) come in lots of scaffolds: “classic” genebuild will result in many partial and fragmented genes Whole Genome Alignment (WGA) to an annotated reference genome: this method reduces fragmentation by piecing together scaffolds into “gene- scaffolds” that contain complete gene(s)

46 46 of 37 Human gene Human Chimp Projection

47 47 of 37 NNNNNNNNNNNNNNNNNNN Human Chimp Human NNNNNNNNNNNNNNNNNNN Chimp Projection


Download ppt "Evaluating genes and transcripts in Ensembl March 2007."

Similar presentations


Ads by Google