The Havana-Gencode annotation GENCODE CONSORTIUM.

Slides:



Advertisements
Similar presentations
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Advertisements

Homology Based Analysis of the Human/Mouse lncRNome
Gene Finding BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Gene prediction in ENCODE roderic guigó i serra crg-imim-upf, barcelona Advanced Bioinformatics, chsl, october 2005.
BME 130 – Genomes Lecture 7 Genome Annotation I – Gene finding & function predictions.
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
How to access genomic information using Ensembl August 2005.
Eukaryotic Gene Finding
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Eukaryotic Gene Finding
Doug Brutlag 2011 Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University School of Medicine Genomics, Bioinformatics.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
Doug Brutlag 2011 Next Generation Sequencing and Human Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University.
ENCODE pseudogene updates Adam Frankish, HAVANA 6/10/05.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Gene Finding BIO337 Systems Biology / Bioinformatics – Spring 2014 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BIO337/Spring.
is accessible at: The following pages are a schematic representation of how to navigate through ALE-HSA21.
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
NCBI’s Genome Annotation: Overview Incremental processing Re-annotation ( batch ) Post-annotation review Case studies NOTE: limiting discussion to annotation.
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
Discussion Points for 2 nd Pseudogene Call Mark Gerstein 2005, :00 EST.
COURSE OF BIOINFORMATICS Exam_31/01/2014 A.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Analysis of the RNAseq Genome Annotation Assessment Project by Subhajyoti De.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Sackler Medical School
SPIDA Substitution Periodicity Index and Domain Analysis Combining comparative sequence analysis with EST alignment to identify coding regions Damian Keefe.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
MCDB 4650 Developmental Control of Gene Expression.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Gene prediction roderic guigó i serra IMIM/UPF/CRG.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
August 20, 2007 BDGP modENCODE Data Production. BDGP Data Production Project Goals 21,000 RACE experiments 6,000 cDNA’s from directed screening and full.
ENCODE pseudogene updates Adam Frankish, HAVANA 13/10/05.
Evaluating genes and transcripts in Ensembl March 2007.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
A knowledge-based approach to integrated genome annotation Michael Brent Washington University.
AceView Danielle and Jean Thierry-Mieg NCBI = global annotation of the whole human genome ● Restricted to the Gencode Regions ●
Work Presentation Novel RNA genes in A. thaliana Gaurav Moghe Oct, 2008-Nov, 2008.
COURSE OF BIOINFORMATICS Exam_30/01/2014 A.
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Considerations for multi-omics data integration Michael Tress CNIO,
GENCODE: a rich dataset of all gene features in the human genome The GENCODE consortium aims to identify all gene features in the human genome, using a.
Web Databases for Drosophila
bacteria and eukaryotes
EGASP 2005 Evaluation Protocol
What is a Hidden Markov Model?
Using RNA-seq data to improve gene annotation
EGASP 2005 Evaluation Protocol
Experimental Verification Department of Genetic Medicine
Ensembl Genome Repository.
Volume 116, Issue 4, Pages (February 2004)
closing in on the set of human genes. The ENCODE project.
Genome Annotation and the Human Genome
Determine CDS Coordinates
Universal Alternative Splicing of Noncoding Exons
Presentation transcript:

The Havana-Gencode annotation GENCODE CONSORTIUM

Loci annotated in the 44 ENCODE regions

Experimental validations of the manual annotations 5'RACEs to obtain full length mRNA(s) RT-PCRs to check 360 junctions Bidirectionnal RACEs to obtain full length mRNAs Experimental validation of the single exon annotated The annotations produced by the Havana team at Sanger are being verified experimenally through RT-PCRs and RACEs (University of Geneva) Initial annotation Experimental validations Updated annotation New set of confirmed genes

5’RACEs to extend Known and Novel protein genes / 426 loci provided positive RACEs for at least one primer (50%) - About 10% of the successful RACEs extend the loci in 5’ (and some provide new exon junctions) (some RACE products are still being analysed) Experimental validations of the manual annotations

RT-PCRs VEGA Novel_transcript and Putative  The Novel transcript loci have a higher success rate than the Putative loci (in accordance to their definition) When more than one junction were submitted for the same transcript, all the junctions were in accordance in 2/3 of the cases (mostly all junctions negative). Experimental validations of the manual annotations

RT-PCRs on non canonical splice sites 43 non canonical splice sites (non GT-AG or GC-AG) were detected in the 13 training ENCODE regions 32 could be tested by RT-PCR (others: too short exons for primer picking)  1 was confirmed: it is actually a canonical U12 intron (AT-AC)  6 provided canonical junctions (already existing in other annotated splice forms)  25 were negative => None of the non canonical splice sites could be validated experimentally (83 other splice sites are being checked in the 31 other regions) Experimental validations of the manual annotations

Gene predictions outside of Havana-Gencode annotations In 13 ENCODE regions, 1255 predicted introns (by one or more of the 9 methods) are not annotated in VEGA: (30%) extend VEGA objects (1) (42%) are in introns of VEGA objects (2) - 11 (1%) link exons from distinct VEGA objects (3) (27%) are completely outside of VEGA annotations (4) Havana-Gencode: Predictions: (1) (2) (3) (4) 6 computational gene prediction programs (geneid, genscan, SGP, twinscan, fgenesh, exonify) ; 3 EST-based methods (acembly, Ecgene, Ensembl EST)

1255 predicted introns tested: => 16 RT-PCRs confirmed the predicted junction, 9 provided another junction. (excluding pseudogenes) => Only 3 are intergenic (new loci?) --> being extended by RACE Gene predictions outside of Havana-Gencode annotations RT-PCRs on exons junctions *1: RT-PCR successful ; 2: RT-PCR povided a product with a wrong exon junction

Gene predictions outside of Havana-Gencode annotations: 31 last regions -About 3500 introns predicted by standard prograns from UCSC tracks are outside of the Havana-Gencode annotation (about 900 intergenic). Very few of those could correspond to real positive (=> Need to prioritize) - Additionaly, the EGASP predictions add about 7000 other new introns (about 1000 intergenic)

Description of the annotations: gene density

Description of the annotations: alternative splicing Avg: 4.2 transcripts per locus 6.7 exons per transcript

Description of the annotations: coding loci 424 coding loci in 44 ENCODE regions On average, 44.6% of the transcripts are annotated as coding

Description of the annotations: lengths of exons, introns, cds, utrs…

Comparison between Havana-Gencode annotation and other sets ENSEMBL, REFSEQ, MGC, CCDS

=> Most of the genes from the other sets are contained in Havana-Gencode annotation (less for ENSEMBL) Gene level

=> Very few full transcripts are exactly identical The coding part of the transcripts is better conserved Transcript level

=> Few transcripts are exactly identical but most of the transcripts from other sets are included in transcripts from Havana-Encode, especially MGC genes (transcripts not as extended as the annotation) Havana-Gencode transcript: Transcript from other sets: Not supporting the annotated transcript Supporting the annotated transcript Relaxed criterion: allows transcripts from the other sets to be included in Havana-Gencode transcripts

Transcript level: relaxed criterion =>

=> More common introns than exons: could be explained by the fact that most differences are in UTRs (last exons) Exon/intron level

Nucleotide level - Havana-Gencode annotation is richer than the other data sets. -REFSEQ, MGC and CCDS are almost completely contained in Havana –Gencode, especially CCDS (smaller set) - ENSEMBL contains more “false positives” (bigger set) - Transcripts from the other sets are less extended than transcripts from Havana-Gencode annotations, especially MGC (very few transcripts are completely identical) Conclusions

Exon pair level (exon-intron-exon)