Advisory Board Meeting, Caltech 2004 Sequence curation in WormBase Sanger Institute, Hinxton & GSC, St Louis.

Slides:



Advertisements
Similar presentations
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
ABSTRACT WormBase is a freely available information resource primarily for the nematode Caenorhabditis elegans but which progressively includes data from.
Ab initio gene prediction Genome 559, Winter 2011.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Profiles for Sequences
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Genome analysis and annotation. Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products.
Finding Eukaryotic Open reading frames.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Gene Finding Charles Yan.
CSE182-L12 Gene Finding.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Eukaryotic Gene Finding
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Eukaryotic Gene Finding
RNA.
Genome Annotation BCB 660 October 20, From Carson Holt.
1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.
Transcription Transcription is the synthesis of mRNA from a section of DNA. Transcription of a gene starts from a region of DNA known as the promoter.
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
05/04/2005 Informatics Meeting C. elegans – “Back To The Future”. Paul Davis (aka Huey)
Transposable Elements (TE) in genomic sequence Mina Rho.
Use cases for Tools at the Bovine Genome Database Apollo and Bovine QTL viewer.
GeneWise and Artemis Exercises Spliced Alignment using GeneWise Click on the GeneWise hyperlink on the course links page,
NCBI’s Genome Annotation: Overview Incremental processing Re-annotation ( batch ) Post-annotation review Case studies NOTE: limiting discussion to annotation.
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Welcome to DNA Subway Classroom-friendly Bioinformatics.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
VectorBase BRC The evolving VectorBase gene build: mixing automated and manual approaches when annotating vector genomes Daniel Lawson VectorBase-EBI,
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Genome Annotation Rosana O. Babu.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Sackler Medical School
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
How can we find genes? Search for them Look them up.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
ENCODE pseudogene updates Adam Frankish, HAVANA 13/10/05.
Advisory Board Meeting, Caltech 2004 Genome Sequence Updates. Paul Davis The Sanger Institute.
-1- Module 3: RNA-Seq Module 3 BAMView Introduction Recently, the use of new sequencing technologies (pyrosequencing, Illumina-Solexa) have produced large.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Finding genes in the genome
Sequence Curation Paul Davis Sanger Institute. Overview Sequence curation within WormBase consortium. Import of sequence data. Prediction stats. Work.
Annotation of eukaryotic genomes
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
COURSE OF BIOINFORMATICS Exam_30/01/2014 A.
Features of the genetic code: Triplet codons (total 64 codons) Nonoverlapping Three stop or nonsense codons UAA (ocher), UAG (amber) and UGA (opal)
Web Databases for Drosophila
bacteria and eukaryotes
Annotating The data.
VectorBase genome annotation
Ab initio gene prediction
Ensembl Genome Repository.
1. C. briggsae sequence curation 2. SNP data handling
Annotator Interface GUS 3.0 Workshop June 18-21, 2002.
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Presentation transcript:

Advisory Board Meeting, Caltech 2004 Sequence curation in WormBase Sanger Institute, Hinxton & GSC, St Louis

Advisory Board Meeting, Caltech 2004 Genome sequence ≈ Length 100,277,975 bp ≈ 13,894 bp increase ≈ All chromosomes contiguous ≈ 0 gaps (no change) ≈ 8 N’s (-84 since WS97) ≈ Split into 17 superlinks ≈ 3268 genome sequences ≈ regularly submitted to EMBL/GenBank/DDBJ

Advisory Board Meeting, Caltech 2004 (re)annotation of a genome Painting by numbersPainting the Forth Rail Bridge

Advisory Board Meeting, Caltech 2004 (re)annotating a genome ≈ We adopted the ‘paint by numbers’ approach involving automated appraisal of all gene models on a regular basis. ≈ Generation of lists of genes/features to be checked by human annotators. Appraise Curate

Advisory Board Meeting, Caltech : What have we done? ≈ Repeat sequences ≈ Database identifiers and connections ≈ Gene prediction ≈Update of progress ≈Alternate isoforms in WormBase ≈The ‘gene model’ ≈Tracking history of predictions

Advisory Board Meeting, Caltech 2004 Repeats ≈ Use RepeatMasker rather than hmmfs ≈ Updated library using RECON (Bao & Eddy), and REPBASE ≈ Some work to remove erroneous repeats which are multi-gene families (Bao/Chen/Durbin) ≈ This is process is incomplete ≈ WS121 contains 1690 overlaps between CDS sequences and RepeatMasker motifs. Approximately 1/3 rd of these are matches to the low-complexity sequences.

Advisory Board Meeting, Caltech 2004 Connections to other databases ≈ WormBase maintains nucleotide and peptide records for the CDS structures which are propagated to the public sequence databases. ≈ Regular (i.e. within the time-frame of a WormBase release) submissions to GenBank/EMBL ≈Maintain proteinID connections to CDS features ≈Maintain gi number connections to CDS features ≈ Public protein databases (UNIPROT = SwissProt,TrEMBL) inherit peptide sequences from the nucleotide entries. ≈Maintain UNIPROT connections to wormpep entries

Advisory Board Meeting, Caltech 2004 wormpep: a C.elegans protein dataset ≈ Snapshot of our ‘best guess’ CDS predictions. ≈ Somewhat quirky, entries for each CDS but have accessions which are related to the peptide sequence. (i.e. multiple entries can have the same accession) ≈ The ‘blessed’ view of the C.elegans proteome which WormBase releases to the world.

Advisory Board Meeting, Caltech 2004 wormpep: raw statistics ≈ Gene number increased by 346 (1.8%) (0.6%) ≈ Isoform number increased by 471 (24.9%) (97.2%)

Advisory Board Meeting, Caltech 2004 Isoforms in WormBase ≈ Nomenclature of alternate isoforms in WormBase is the standard name (clone.number) with a suffix [a-z] ≈ Isoforms are only created when there is direct transcript evidence for the difference ≈ Add an Isoform tag to encapsulate the evidence for the isoform ≈Aids quick identification of alternate isoforms ≈Standardize the mark-up within WormBase.

Advisory Board Meeting, Caltech 2004 Predicted Partially confirmed Confirmed Confirming genes

Advisory Board Meeting, Caltech 2004 ad-hoc update of Matching_cDNA tags Orfeome tags used in confirmation Automated assignment of Matching-cDNA tags Orfeome Last gap closed

Advisory Board Meeting, Caltech 2004 Gene model validation ≈ 1,101 more confirmed genes (29.3% increase) ≈ 3,179 more partially confirmed genes (35.8% increase)

Advisory Board Meeting, Caltech 2004 wormpep cityscape: wormpep ‘live’ ABM 2002 ABM 2003 ABM 2001 ABM 2004 elegans-briggsae comparison

Advisory Board Meeting, Caltech 2004 wormpep cityscape: ≈ 3245 of 3378 entries extant (96.1%) ≈ 140 CDS modifications/additions per release CDS change C.elegans-C.briggsae comparison

Advisory Board Meeting, Caltech 2004 Toward a better model of gene predictions ≈ Over the past year WormBase has extended how we model gene predictions. ≈ This is part of the new ‘Gene model’ ≈ Incorporates additional sequence features to the exon/intron structures previously modelled.

Advisory Board Meeting, Caltech 2004 A simple ‘Gene model’ ≈ Historically, WormBase has had a simple concept of a gene. ≈initiation methionine ≈coding exons ≈termination codon ≈ This has meant that we ‘lose’ a lot of data pertaining to gene structures and control regions ATGATG STOPSTOP EXON n

Advisory Board Meeting, Caltech 2004 A better ‘Gene model’ ≈ New CDS class for the ATG -> STOP coding sequence ≈ New Coding_transcript objects to represent full-length structure ≈ SL1 & SL2 feature objects for the 5’ end ≈ polyA_signal_sequence & polyA_site feature objects for the 3’ end ATGATG STOPSTOP EXON n 5’-UTR3’-UTR polyA-signal and site TSL acceptor

Advisory Board Meeting, Caltech 2004 Toward a better ‘Gene Model’ Trans-splice leader acceptor site 5’-UTR 3’-UTR polyA_signal and polyA site

Advisory Board Meeting, Caltech 2004 Non-coding transcripts Standard CDS prediction >K10H10.3a dhs-8: Alcohol/other dehydrogenases, short chain type MSLSTTNTVSPEDDINRCEETIRKGMTMGRSIKGSGGYILISSDPLFGLL FLQLSKTKMSQANRVRLFHSRTHAFEVLKGIDVSGKTFAITGTTSGIGIN TAEVLALAGAHVVLMNRNLHESENQKKRILEKKPSAKVDIIFCDLSDLKT VRKAGEDYLAKNWPIHGLILNAGVFRPAAAKTKDGFESHYGVNVVAHFTL LRILLPVVRRSAPSRVVFLSSTLSSKHGFKKSMGISEKMSILQGEDSSAS TLQMYGASKMADMLIAFKLHRDEYKNGISTYSVHPGSGVRTDIFRNSLLG KFIGFVTTPFTKNASQGAATTVYCATHPEVEKISGKYWESCWDNDKIDKK TARDEELQEALWKKLEQIDDRINGSIDTF Non-productive transcript (?NMD target) >K10H10.3b dhs-8: MSLSTTNTVSPEDDINRCEETIRKGMTMGRSIKGSGSKRH*RLRKNICNH RNNIWNWNKHSRSSGLSRSTCRFDEQEPARVGKSEEENFGEEAECESRYY FL*PQ*LEDSTQSGRGLFG*KLANPRTNPECRSIPPSSCKNQRWIRIPLW CQCSCSFYTSSHPSPGCSSLRSIQSSLPLLNFEFQTRFQKIYGDF*KDEY SPRRRFVGVHTSDVRSFKDGRYVDCIQIAQR*V*KWN*HIFRAPWKWSQN *YFQKLPTWKIHRIRHHTIHKER*SRSSNYSILCYSPRS*KNLWKILGVL LG*R*N**EDS*R*GVTGSVVEEIGAN**SNQWIN*YLLX 103 miRNA genes 707 tRNA genes 76 snRNA …

Advisory Board Meeting, Caltech 2004 Non-coding transcripts rpl-3 locus Contains unproductively spliced mRNA (from Mitrovich et al (2000))

Advisory Board Meeting, Caltech 2004 Tracking gene prediction changes ≈ A mechanism for leaving better documentation about how, when & why gene predictions have been modified. ≈ Each incarnation of a gene prediction persists as a CDS object in the database. ≈ These can be shown in the ACEDB to aid curators and as a track on the website for all users

Advisory Board Meeting, Caltech 2004 Curation histories Identify a problem gene prediction AH6.1

Advisory Board Meeting, Caltech 2004 Curation histories Make a history object for the current prediction Identify a problem gene prediction AH6.1:wp100

Advisory Board Meeting, Caltech 2004 Curation histories Make a history object for the current prediction Identify a problem gene prediction Make the new prediction AH6.1 AH6.1:wp100

Advisory Board Meeting, Caltech 2004 Curation histories Make a history object for the current prediction Identify a problem gene prediction Make the new prediction AH6.1 AH6.1:wp100 Leave a remark relating to the modification

Advisory Board Meeting, Caltech 2004 Wormpep histories How many history predictions have we made? 5,255 history objects in wormpep121 How many should we make (based on wormpep8)? 15,747 potential history objects since wormpep8 How are we going to resurrect the missing ones?

Advisory Board Meeting, Caltech 2004 Resurrecting historical predictions ≈ Problem: Making a CDS from a known peptide sequence ≈ Retrieve old predictions: ≈From archived WormBase releases ≈From archived GenBank/EMBL entries ≈ Generate predictions again: ≈By script using a tool such as Genewise ≈By hand using TBlastN similarity data ≈ There are caveats to this process in that some predictions can not be modelled in the current sequence consensus because of the corrections (e.g. deleted bases).

Advisory Board Meeting, Caltech 2004 Gene family analysis ≈ Construct gene families using blast or Pfam ≈ Make multiple-sequence alignments (clustal) ≈ Appraise manually ≈ Make gene prediction changes as necessary ≈ What do you get? ≈Better gene predictions ≈Better curation (Brief_identification/Gene names CGC)

Advisory Board Meeting, Caltech 2004 Multiple-gene family analysis

Advisory Board Meeting, Caltech 2004

Modification of a gene prediction based on multiple-sequence alignment Protein insertion highlighted in pink The prediction needs to truncated exon 3 (note this is supported by WABA briggsae- elegans comparison).

Advisory Board Meeting, Caltech 2004

Modification of a gene prediction based on multiple-sequence alignment Protein insertion highlighted in pink The prediction needs to have an in-frame intron to be added (note this is supported by WABA briggsae-elegans comparison).

Advisory Board Meeting, Caltech 2004 Plans for ≈ Gene prediction ≈Use of C.briggsae similarity data ≈Use of blastx (DNA v Protein) data from other model species ≈C.elegans Pfam/family analysis ≈ Sequence features ≈More trans-splice leaders (TEC-RED) ≈Catch-up of sequence features through Caltech literature searches ≈ Functional annotation ≈Celera PANTHER annotation system ≈COGs analysis ≈C.elegans Pfam/family analysis

Advisory Board Meeting, Caltech 2004 Plans for ≈ More nematode genome sequence is becoming available ≈ Short term ≈ Brugia malayii ≈ Medium term ≈several Caenorhabditis species due to be sequenced

Advisory Board Meeting, Caltech 2004 hasta luego

Advisory Board Meeting, Caltech 2004

Generation of Coding_transcripts ≈ The ‘full-length’ transcript objects in WormBase are made using the transcript data (BLAT mappings) and the existing exon/intron structures. ≈ UTR regions are inferred from transcript data and added to the CDS regions to form the longer transcript prediction. ≈ This is separate to the unspliced UTR predictions within WormBase (themselves a replacement for the Worm Transciptome Project (WTP) spans).

Advisory Board Meeting, Caltech 2004 Transposons ≈ There are still many reverse transcriptase’s in the wormpep dataset. ≈ We plan to remove them from wormpep by changing the tag markup in WormBase to not be included in the wormpep files. ≈ Overhaul of the transposon nomenclature and inclusion in WormBase

Advisory Board Meeting, Caltech 2004 wormpep modifications