05/04/2005 Informatics Meeting C. elegans – “Back To The Future”. Paul Davis (aka Huey)
05/04/2005 Informatics Meeting Overview ≈ C. elegans Gene Prediction ≈Past. ≈Overview of genome project. ≈1 st Pass annotation ≈Present. ≈Script based list generation. ≈Gene Refinement (Transcript Based). ≈Small peptides. ≈C. briggsae comparison. ≈Large external gene family analysis. ≈Future. ≈Un-annotated Overlap between gene predictors ≈Gene Family curation. ≈Multiple species comparison. ≈ Summary.
05/04/2005 Informatics Meeting Past ≈ Genome Project ≈C. elegans 1 st multicellular organism genome published ≈97-Mb of sequence made up of ≈2527 cosmids, ≈257 YACs, ≈113 fosmids, ≈44 PCR products. ≈5 gaps closed by ≈Annotated to find 19,099 protein coding genes. ≈ 1 st pass annotation Genefinder (Phil Green WASHU). ≈ Curators appraised gene predictions on a clone by clone basis as they were finished.
05/04/2005 Informatics Meeting Genome View Predicted Partially Confirmed Confirmed Colour corresponds to strand not confidence.
05/04/2005 Informatics Meeting Stats for WS141 ≈ Currently 22,436 gene predictions. ≈ 11,169 “un-touched” ≈+ good 1 st pass annotation. ≈+ re-annotated >50%. ≈2,576 Confirmed status. ≈Unlikely to change. ≈5,624 Partially Confirmed. ≈Potentially modified. ≈2,969 Predicted. ≈Potentially removed or altered.
05/04/2005 Informatics Meeting Present (re)annotation of a genome Painting by numbersPainting the Forth Rail Bridge
05/04/2005 Informatics Meeting (re)annotating a genome ≈ We adopted a ‘paint by numbers’ approach involving automated appraisal of all gene models on a regular basis. ≈ Generation of lists of genes/features to be checked by human annotators. Appraise Curate Process and report Release and synchronise
05/04/2005 Informatics Meeting Script Based Targeted Annotation ≈ Create a number of curation lists ≈Confirmed introns not in gene models ≈ESTs/mRNAs in introns. ≈Overlapping Gene predictions. ≈Predictions overlapping known repeats. ≈Short Genes <150bp ≈Short introns <40bp
05/04/2005 Informatics Meeting Transcript Based Refinements ≈ Automatic import of transcript data during our build cycle. ≈C. elegans mRNAs/cDNAs. ≈C. elegans ESTs. ≈Nematode ESTs. ≈ Processed and aligned to genome. ≈ This produces data for our curation lists
05/04/2005 Informatics Meeting Gene Refinement Fmap View ≈ EST data points to 5’ extension and 3’ extension. ≈ Identified due to confirmed introns not in a gene model 5’ 3’ Transcript Data Refined Prediction Old prediction Confirmed intron.
05/04/2005 Informatics Meeting Not all <150bp Predictions are Bad? ≈ Small peptides can be real. ≈H12D21.1 is a 34 aa peptide that appeared on curation list. ≈Investigated. ≈Prediction had peptide similarity to 2 other elegans proteins. ≈Multi sequence alignment proved interesting.
05/04/2005 Informatics Meeting H12D Homols Fmap View & M.S.A. SignalP cleavage site Gene Prediction Protein Homology Blocks
05/04/2005 Informatics Meeting New Family Members ≈ Used tBlastn to identify other regions in genome, ≈ Annotated these ORFs to give. ≈ 9 additional family members ≈ These have been called nspa-1 to 12 ≈Nematode Specific Peptide family A Pseudogene Expanded Family
05/04/2005 Informatics Meeting C. briggsae Comparison ≈ C. elegans vs C. briggsae ≈C. briggsae hybrid gene set analysis (Avril Coghlan). ≈Detailed in PloS Biol : “The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics.” ≈ WormBase Has worked to incorporate the ~1300 new genes reported.
05/04/2005 Informatics Meeting Coding Gene Predictions Over Time. Increase in CDS due to 1 st round of new genes identified by comparison with briggsae WS21WS24WS27WS30 WS33 WS36 WS39WS42WS45 WS48 WS51 WS54WS57WS60WS73WS76WS79WS82WS85WS88WS91WS94WS97 WS100WS103WS106WS109WS112WS115WS118WS121WS124 Release Number Predictions Including Isoforms Coding Genes briggsae hybrid gene set
05/04/2005 Informatics Meeting Large family analysis ≈ Worm Community Members. ≈ Multi Sequence Alignments of some large Families. ≈7 TM receptor families ≈1700 family members ≈Sub families have been worked on by multiple worm community members. ≈Hugh Robertson (University of Illinois) ≈Jim Thomas (University of Washington Seattle) ≈Jack Chen (CSH Laboratories)
05/04/2005 Informatics Meeting Future ≈ Identify new avenues for gene refinement and identification. ≈ Looking at predictor overlaps ≈(Genefinder/Twinscan overlaps) vs (WormBase Gene set) ≈ In house protein family analysis ≈ Multiple species comparisons
05/04/2005 Informatics Meeting Predictor Overlaps. Genefinder Prediction Twinscan Prediction New CDS Prediction Strong Splicing Good briggsae DNA::DNA Alignment
05/04/2005 Informatics Meeting Gene Family Analysis ≈ Protein alignments of multiple family members can refine gene predictions. ≈ClustalW ≈blast ≈Main problems identified ≈Incorrect splicing ≈Truncations ≈Invalid extensions
05/04/2005 Informatics Meeting Example of a Small Family Analysis. ≈ Problematic alignment ≈F56H6.9 appears to have 18aa extra sequence. ≈E03H4.4 seems to be lacking sequence.
05/04/2005 Informatics Meeting Fmap View of F56H6.9
05/04/2005 Informatics Meeting Example of Problem. ≈ Problematic alignment ≈ Alignment following annotation.
05/04/2005 Informatics Meeting Multiple Species Comparison. ≈ More nematode genomes are on their way ≈C. remanei ≈shotgun in progress ≈Blast server available ≈PB2801 ≈shotgun in progress ≈C. japonica ≈shotgun in progress
05/04/2005 Informatics Meeting elegans/briggsae/remanei Alignment for nspa- like peptides.
05/04/2005 Informatics Meeting Summary ≈ Gene (Re)annotation >7 years. ≈New genes are still being discovered. ≈ Primarily Transcript driven. ≈ More work on protein families ≈ New strategies for gene prediction and refinement. ≈Using multiple gene predictors ≈Multi species comparison
05/04/2005 Informatics Meeting Acknowledgements ≈ Genome Sequencing Center St. Louis ≈Sequencing and finishing teams etc. ≈WormBase team Tamberlyn Bieri Darin Blasiar Phil Ozersky John Spieth ≈ Wellcome Trust Sanger Institute ≈Sequencing and finishing teams etc. ≈WormBase team Richard Durbin Anthony Rogers Dan LawsonMary Ann Tuli ≈AceDB Ed GriffithsRoy Storey