The Consensus CoDing Sequence (CCDS) Database

Slides:



Advertisements
Similar presentations
What is RefSeqGene?.
Advertisements

EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
Introduction to Bioinformatics Lecturer: Dr. Yael Mandel-Gutfreund Teaching Assistant: Shula Shazman Sivan Bercovici Course web site :
Genome Browsers Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Lecture 2.21 Retrieving Information: Using Entrez.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Data Mining in Ensembl with EnsMart. 2 of 24 All genes from a candidate region Genes with a particular protein domain Members of a protein family Genes.
Genomic Database - Ensembl Ka-Lok Ng Department of Bioinformatics Asia University.
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
Prosite and UCSC Genome Browser Exercise 3. Protein motifs and Prosite.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
The Ensembl Gene set The “Genebuild” 21 April 2008.
ENCODE pseudogene updates Adam Frankish, HAVANA 6/10/05.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
is accessible at: The following pages are a schematic representation of how to navigate through ALE-HSA21.
Use cases for Tools at the Bovine Genome Database Apollo and Bovine QTL viewer.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
NCBI’s Genome Annotation: Overview Incremental processing Re-annotation ( batch ) Post-annotation review Case studies NOTE: limiting discussion to annotation.
COURSE OF BIOINFORMATICS Exam_31/01/2014 A.
Organizing information in the post-genomic era The rise of bioinformatics.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Sackler Medical School
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
The Havana-Gencode annotation GENCODE CONSORTIUM.
Mark D. Adams Dept. of Genetics 9/10/04
Bioinformatics and Computational Biology
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Cool BaRC Web Tools Prat Thiru. BaRC Web Tools We have.
EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
ENCODE pseudogene updates Adam Frankish, HAVANA 13/10/05.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
An Introduction to NCBI & BLAST National Center for Biotechnology Information Richard Johnston Pasadena City College.
1 Many to 1 Gene Associations The following slides show a few examples of gene predictions by one annotation group that overlap one or more genes from.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
Welcome to the combined BLAST and Genome Browser Tutorial.
COURSE OF BIOINFORMATICS Exam_30/01/2014 A.
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
Figure 1. Number of CCDS IDs and genes represented in the human (A) and mouse (B) CCDS releases. The X-axis indicates the year in which a CCDS dataset.
Functional Annotation of the Horse Genome
Ensembl Genome Repository.
TAMU Bovine QTL db and viewer
Gene Safari (Biological Databases)
Problems from last section
Presentation transcript:

The Consensus CoDing Sequence (CCDS) Database Kim D. Pruitt Mouse Genome Annotation Summit Meeting March 12-13, 2008

Why is the CCDS project needed? The Problem: Annotation of the genome sequence is essential – but beware of different interpretations! The availability of the human and mouse genome sequence has had a significant impact on disease and health research. Most scientists rely on annotation information when designing, interpreting, and evaluating research results. Inconsistencies in annotation results among the main public resources hampers use of this important data. Researchers may not realize that a different annotation result is available elsewhere – possibly leading to erroneous or incomplete interpretations.

CCDS - A collaborative project Initiated by the main public annotation/browser groups to address concerns by the scientific community about inconsistencies in the human and mouse genome annotation. Built by consensus among the collaborating members, which include: European Bioinformatics Institute (EBI) National Center for Biotechnology Information (NCBI) University of California, Santa Cruz (UCSC) Sanger Institute (WTSI)

What is the CCDS project? Project Goals identify a core set of protein-coding genes that are consistently annotated and of high quality support convergence toward a standard set of gene annotations Scope: Human and mouse protein coding regions Update frequency Variable Depends on frequency of genome annotation updates

Process flow – calculating updates NCBI (computational) Havana (manual) Ensembl (computational) RefSeq (manual) Compare CDS (Annotation + Sequence) Ensembl merged annotation QA Identical Similar Novel Existing CCDS Retain Lost New match New CCDS ID Out of scope

Quality assessment tests include: Assessing Quality CCDS status is conservatively applied: Annotated CDS coordinates are identical Annotation is of high quality and passes QA tests, or curator review Existing CCDS proteins can be flagged for review by the collaborating members Updates and removals are by consensus agreement. Quality assessment tests include: Consensus splice sites ("GY..AG" or "AT..AC") Valid start and stop codons with no internal stops NMD Low complexity Repeat-containing Insufficient protein homology Genome conservation Putative pseudogene QA test results are reviewed by curators Over-rides are set to retain supported CDSs

CCDS Counts Date Build CDS IDs GeneIDs Mar-05 Hs35.1 14,795 13,142 Feb-07 Hs36.2 18,290 16,008 Oct-06 Mm36.1 13,374 13,014 Nov-07 Mm37.1 17,707 16,893 Step Source Genes Proteins Annotation NCBI 24765 26851 Annotation Ensembl 27209 39941 Matching CDS 18185 19048 QA & curation rejections 1331 1350 Accepted rejections 1292 1341 Final CCDS ID 16893 17707

Curation – how are updates curated and coordinated? Any member of the collaboration can flag a CCDS for review Update the CDS definition (alter N-terminus extent internal splice site) Withdraw the CCDS ID (insufficiently supported, or non-protein coding) NCBI provides a collaboration web site to coordinate this review All collaborators must agree with a change to finalize a decision Withdrawal of a CCDS may happen between genome annotation updates An update to a CCDS is indicated by: Status change: a status of ‘pending update’ is reported when there is collaborative agreement that a change is needed Version change: The CCDS version number is incremented once the change is reflected in public annotation. This only occurs after a genome annotation update and CCDS analysis has taken place. CCDS curation is fully integrated with RefSeq curation

CCDS update & curation stats Curation-based changes: Mouse: ~5200 curated CCDS genes name action status count human update pending 366 human update agreed 557 human withdraw pending 189 human withdraw agreed 519 mouse update pending 185 mouse update agreed 57 mouse withdraw pending 16 mouse withdraw agreed 8 923 709 242 24 Annotation pipeline-based changes: name build status count human 35.1 Withdrawn, inconsistent annotation 133 human 36.2 Withdrawn, inconsistent annotation 29 mouse 36.1 Withdrawn, inconsistent annotation 29 mouse 37.1 Withdrawn, inconsistent annotation 4

Curation considerations Alignments Track low quality sequences (‘kill list’) Protein conservation Publications Personal communications QA measures

Access – How do I know if an annotation has a CCDS ID? Genome browser displays NCBI UCSC Gene reports Ensembl Vega Other: RefSeq annotation (NCBI) CCDS web site FTP http://www.ncbi.nlm.nih.gov/CCDS/

NCBI Map Viewer (chr.5) Link to CCDS Browser

UCSC Browser chr5:30270000-30650000

UCSC Browser – Tyms gene CCDS Browser

Access of CCDS data at NCBI CCDS Database & Browser interface Project Description Query support Reports attributes of the CCDS Location data Sequence members Status FTP reports

CCDS Browser History Find all CCDSs for the Gene Entrez Gene View CCDS Details

CCDS Browser Mouse-over highlights codon Click to highlight codon and corresponding amino acid

Biology is complex – some CCDS curation examples 1 vs 2 vs ‘n’ genes translation start site

1 vs. 2 vs. ‘n’ genes Curation Considerations: Nomenclature History (scientific use, publications, etc.) Different (but similar) products vs. distinct products Shared promoters

carnitine palmitoyltransferase 1b, choline kinase beta

Current RefSeq representation of the region - two protein coding loci 1 vs. 2 vs. ‘n’ genes Current RefSeq representation of the region - two protein coding loci - one non-coding loci for the non-coding transcript product (a read-through transcript) Chkb (CCDS27750.1) Cpt1b (CCDS27749.1 ) Chkb-cpt1b (PMID:12761301 )

Translation start site Curation Considerations Publication reports (CDS begins at ‘n’) Other cDNA sequencing reveals the ORF can be extended further upstream Evaluate: Genome conservation Literature reports for the protein Putative Kozak signals Presence of in-frame upstream stop codon INSDC submissions from an experimental lab source that do have the longer ORF extent annotated. Consult with an expert

Internal CCDS browser (restricted access) Jmjd2d jumonji domain containing 2D (chr 19)

Update is agreed on by all parties Resulting in a 258 aa N-terminal extension

Examples – no CCDS ID EBI+WTSI and NCBI transcript annotation may differ even though the gene includes annotations with CCDS IDs

Examples –no CCDS ID Reasons: not found by one group different CDS length different splice sites different internal exon Curation removal EBI/WTSI NCBI EBI/WTSI NCBI EBI/WTSI NCBI EBI/WTSI NCBI

Acknowledgements Collaborators at Ensembl, UCSC, Vega Donna Maglott Josh Cherry Keith Oxenride Craig Wallin Andrei Shkeda RefSeq Curators NCBI Genome Annotation Group NCBI Map Viewer Group Collaborators at Ensembl, UCSC, Vega Jen Ashurst & Vega curator group Rachel Harte Mark Diekhans Steve Searle

Ensembl – Tyms gene

Vega browser Tyms gene (chromosome 5 30388989-30404404)