Download presentation
Presentation is loading. Please wait.
Published byJohn Anderson Modified over 8 years ago
1
Advisory Board Meeting, Caltech 2004 Sequence curation in WormBase Sanger Institute, Hinxton & GSC, St Louis
2
Advisory Board Meeting, Caltech 2004 Genome sequence ≈ Length 100,277,975 bp ≈ 13,894 bp increase ≈ All chromosomes contiguous ≈ 0 gaps (no change) ≈ 8 N’s (-84 since WS97) ≈ Split into 17 superlinks ≈ 3268 genome sequences ≈ regularly submitted to EMBL/GenBank/DDBJ
3
Advisory Board Meeting, Caltech 2004 (re)annotation of a genome Painting by numbersPainting the Forth Rail Bridge
4
Advisory Board Meeting, Caltech 2004 (re)annotating a genome ≈ We adopted the ‘paint by numbers’ approach involving automated appraisal of all gene models on a regular basis. ≈ Generation of lists of genes/features to be checked by human annotators. Appraise Curate
5
Advisory Board Meeting, Caltech 2004 2003-2004: What have we done? ≈ Repeat sequences ≈ Database identifiers and connections ≈ Gene prediction ≈Update of progress ≈Alternate isoforms in WormBase ≈The ‘gene model’ ≈Tracking history of predictions
6
Advisory Board Meeting, Caltech 2004 Repeats ≈ Use RepeatMasker rather than hmmfs ≈ Updated library using RECON (Bao & Eddy), and REPBASE ≈ Some work to remove erroneous repeats which are multi-gene families (Bao/Chen/Durbin) ≈ This is process is incomplete ≈ WS121 contains 1690 overlaps between CDS sequences and RepeatMasker motifs. Approximately 1/3 rd of these are matches to the low-complexity sequences.
7
Advisory Board Meeting, Caltech 2004 Connections to other databases ≈ WormBase maintains nucleotide and peptide records for the CDS structures which are propagated to the public sequence databases. ≈ Regular (i.e. within the time-frame of a WormBase release) submissions to GenBank/EMBL ≈Maintain proteinID connections to CDS features ≈Maintain gi number connections to CDS features ≈ Public protein databases (UNIPROT = SwissProt,TrEMBL) inherit peptide sequences from the nucleotide entries. ≈Maintain UNIPROT connections to wormpep entries
8
Advisory Board Meeting, Caltech 2004 wormpep: a C.elegans protein dataset ≈ Snapshot of our ‘best guess’ CDS predictions. ≈ Somewhat quirky, entries for each CDS but have accessions which are related to the peptide sequence. (i.e. multiple entries can have the same accession) ≈ The ‘blessed’ view of the C.elegans proteome which WormBase releases to the world.
9
Advisory Board Meeting, Caltech 2004 wormpep: raw statistics ≈ Gene number increased by 346 (1.8%) 2002-2003 112 (0.6%) ≈ Isoform number increased by 471 (24.9%) 2002-2003 934 (97.2%)
10
Advisory Board Meeting, Caltech 2004 Isoforms in WormBase ≈ Nomenclature of alternate isoforms in WormBase is the standard name (clone.number) with a suffix [a-z] ≈ Isoforms are only created when there is direct transcript evidence for the difference ≈ Add an Isoform tag to encapsulate the evidence for the isoform ≈Aids quick identification of alternate isoforms ≈Standardize the mark-up within WormBase.
11
Advisory Board Meeting, Caltech 2004 Predicted Partially confirmed Confirmed Confirming genes
12
Advisory Board Meeting, Caltech 2004 ad-hoc update of Matching_cDNA tags Orfeome tags used in confirmation Automated assignment of Matching-cDNA tags Orfeome Last gap closed
13
Advisory Board Meeting, Caltech 2004 Gene model validation ≈ 1,101 more confirmed genes (29.3% increase) ≈ 3,179 more partially confirmed genes (35.8% increase)
14
Advisory Board Meeting, Caltech 2004 wormpep cityscape: wormpep ‘live’ ABM 2002 ABM 2003 ABM 2001 ABM 2004 elegans-briggsae comparison
15
Advisory Board Meeting, Caltech 2004 wormpep cityscape: 2003-2004 ≈ 3245 of 3378 entries extant (96.1%) ≈ 140 CDS modifications/additions per release CDS change C.elegans-C.briggsae comparison
16
Advisory Board Meeting, Caltech 2004 Toward a better model of gene predictions ≈ Over the past year WormBase has extended how we model gene predictions. ≈ This is part of the new ‘Gene model’ ≈ Incorporates additional sequence features to the exon/intron structures previously modelled.
17
Advisory Board Meeting, Caltech 2004 A simple ‘Gene model’ ≈ Historically, WormBase has had a simple concept of a gene. ≈initiation methionine ≈coding exons ≈termination codon ≈ This has meant that we ‘lose’ a lot of data pertaining to gene structures and control regions ATGATG STOPSTOP EXON n
18
Advisory Board Meeting, Caltech 2004 A better ‘Gene model’ ≈ New CDS class for the ATG -> STOP coding sequence ≈ New Coding_transcript objects to represent full-length structure ≈ SL1 & SL2 feature objects for the 5’ end ≈ polyA_signal_sequence & polyA_site feature objects for the 3’ end ATGATG STOPSTOP EXON n 5’-UTR3’-UTR polyA-signal and site TSL acceptor
19
Advisory Board Meeting, Caltech 2004 Toward a better ‘Gene Model’ Trans-splice leader acceptor site 5’-UTR 3’-UTR polyA_signal and polyA site
20
Advisory Board Meeting, Caltech 2004 Non-coding transcripts Standard CDS prediction >K10H10.3a dhs-8: Alcohol/other dehydrogenases, short chain type MSLSTTNTVSPEDDINRCEETIRKGMTMGRSIKGSGGYILISSDPLFGLL FLQLSKTKMSQANRVRLFHSRTHAFEVLKGIDVSGKTFAITGTTSGIGIN TAEVLALAGAHVVLMNRNLHESENQKKRILEKKPSAKVDIIFCDLSDLKT VRKAGEDYLAKNWPIHGLILNAGVFRPAAAKTKDGFESHYGVNVVAHFTL LRILLPVVRRSAPSRVVFLSSTLSSKHGFKKSMGISEKMSILQGEDSSAS TLQMYGASKMADMLIAFKLHRDEYKNGISTYSVHPGSGVRTDIFRNSLLG KFIGFVTTPFTKNASQGAATTVYCATHPEVEKISGKYWESCWDNDKIDKK TARDEELQEALWKKLEQIDDRINGSIDTF Non-productive transcript (?NMD target) >K10H10.3b dhs-8: MSLSTTNTVSPEDDINRCEETIRKGMTMGRSIKGSGSKRH*RLRKNICNH RNNIWNWNKHSRSSGLSRSTCRFDEQEPARVGKSEEENFGEEAECESRYY FL*PQ*LEDSTQSGRGLFG*KLANPRTNPECRSIPPSSCKNQRWIRIPLW CQCSCSFYTSSHPSPGCSSLRSIQSSLPLLNFEFQTRFQKIYGDF*KDEY SPRRRFVGVHTSDVRSFKDGRYVDCIQIAQR*V*KWN*HIFRAPWKWSQN *YFQKLPTWKIHRIRHHTIHKER*SRSSNYSILCYSPRS*KNLWKILGVL LG*R*N**EDS*R*GVTGSVVEEIGAN**SNQWIN*YLLX 103 miRNA genes 707 tRNA genes 76 snRNA …
21
Advisory Board Meeting, Caltech 2004 Non-coding transcripts rpl-3 locus Contains unproductively spliced mRNA (from Mitrovich et al (2000))
22
Advisory Board Meeting, Caltech 2004 Tracking gene prediction changes ≈ A mechanism for leaving better documentation about how, when & why gene predictions have been modified. ≈ Each incarnation of a gene prediction persists as a CDS object in the database. ≈ These can be shown in the ACEDB to aid curators and as a track on the website for all users
23
Advisory Board Meeting, Caltech 2004 Curation histories Identify a problem gene prediction AH6.1
24
Advisory Board Meeting, Caltech 2004 Curation histories Make a history object for the current prediction Identify a problem gene prediction AH6.1:wp100
25
Advisory Board Meeting, Caltech 2004 Curation histories Make a history object for the current prediction Identify a problem gene prediction Make the new prediction AH6.1 AH6.1:wp100
26
Advisory Board Meeting, Caltech 2004 Curation histories Make a history object for the current prediction Identify a problem gene prediction Make the new prediction AH6.1 AH6.1:wp100 Leave a remark relating to the modification
27
Advisory Board Meeting, Caltech 2004 Wormpep histories How many history predictions have we made? 5,255 history objects in wormpep121 How many should we make (based on wormpep8)? 15,747 potential history objects since wormpep8 How are we going to resurrect the missing ones?
28
Advisory Board Meeting, Caltech 2004 Resurrecting historical predictions ≈ Problem: Making a CDS from a known peptide sequence ≈ Retrieve old predictions: ≈From archived WormBase releases ≈From archived GenBank/EMBL entries ≈ Generate predictions again: ≈By script using a tool such as Genewise ≈By hand using TBlastN similarity data ≈ There are caveats to this process in that some predictions can not be modelled in the current sequence consensus because of the corrections (e.g. deleted bases).
29
Advisory Board Meeting, Caltech 2004 Gene family analysis ≈ Construct gene families using blast or Pfam ≈ Make multiple-sequence alignments (clustal) ≈ Appraise manually ≈ Make gene prediction changes as necessary ≈ What do you get? ≈Better gene predictions ≈Better curation (Brief_identification/Gene names CGC)
30
Advisory Board Meeting, Caltech 2004 Multiple-gene family analysis
31
Advisory Board Meeting, Caltech 2004
32
Modification of a gene prediction based on multiple-sequence alignment Protein insertion highlighted in pink The prediction needs to truncated exon 3 (note this is supported by WABA briggsae- elegans comparison).
33
Advisory Board Meeting, Caltech 2004
34
Modification of a gene prediction based on multiple-sequence alignment Protein insertion highlighted in pink The prediction needs to have an in-frame intron to be added (note this is supported by WABA briggsae-elegans comparison).
35
Advisory Board Meeting, Caltech 2004 Plans for 2004-2005 ≈ Gene prediction ≈Use of C.briggsae similarity data ≈Use of blastx (DNA v Protein) data from other model species ≈C.elegans Pfam/family analysis ≈ Sequence features ≈More trans-splice leaders (TEC-RED) ≈Catch-up of sequence features through Caltech literature searches ≈ Functional annotation ≈Celera PANTHER annotation system ≈COGs analysis ≈C.elegans Pfam/family analysis
36
Advisory Board Meeting, Caltech 2004 Plans for 2004-2005 ≈ More nematode genome sequence is becoming available ≈ Short term ≈ Brugia malayii ≈ Medium term ≈several Caenorhabditis species due to be sequenced
37
Advisory Board Meeting, Caltech 2004 hasta luego
38
Advisory Board Meeting, Caltech 2004
39
Generation of Coding_transcripts ≈ The ‘full-length’ transcript objects in WormBase are made using the transcript data (BLAT mappings) and the existing exon/intron structures. ≈ UTR regions are inferred from transcript data and added to the CDS regions to form the longer transcript prediction. ≈ This is separate to the unspliced UTR predictions within WormBase (themselves a replacement for the Worm Transciptome Project (WTP) spans).
40
Advisory Board Meeting, Caltech 2004 Transposons ≈ There are still many reverse transcriptase’s in the wormpep dataset. ≈ We plan to remove them from wormpep by changing the tag markup in WormBase to not be included in the wormpep files. ≈ Overhaul of the transposon nomenclature and inclusion in WormBase
41
Advisory Board Meeting, Caltech 2004 wormpep modifications
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.