Presentation is loading. Please wait.

Presentation is loading. Please wait.

05/04/2005 Informatics Meeting C. elegans – “Back To The Future”. Paul Davis (aka Huey)

Similar presentations


Presentation on theme: "05/04/2005 Informatics Meeting C. elegans – “Back To The Future”. Paul Davis (aka Huey)"— Presentation transcript:

1 05/04/2005 Informatics Meeting C. elegans – “Back To The Future”. Paul Davis (aka Huey)

2 05/04/2005 Informatics Meeting Overview ≈ C. elegans Gene Prediction ≈Past. ≈Overview of genome project. ≈1 st Pass annotation ≈Present. ≈Script based list generation. ≈Gene Refinement (Transcript Based). ≈Small peptides. ≈C. briggsae comparison. ≈Large external gene family analysis. ≈Future. ≈Un-annotated Overlap between gene predictors ≈Gene Family curation. ≈Multiple species comparison. ≈ Summary.

3 05/04/2005 Informatics Meeting Past ≈ Genome Project ≈C. elegans 1 st multicellular organism genome published 1998. ≈97-Mb of sequence made up of ≈2527 cosmids, ≈257 YACs, ≈113 fosmids, ≈44 PCR products. ≈5 gaps closed by 2002. ≈Annotated to find 19,099 protein coding genes. ≈ 1 st pass annotation Genefinder (Phil Green WASHU). ≈ Curators appraised gene predictions on a clone by clone basis as they were finished.

4 05/04/2005 Informatics Meeting Genome View Predicted Partially Confirmed Confirmed Colour corresponds to strand not confidence.

5 05/04/2005 Informatics Meeting Stats for WS141 ≈ Currently 22,436 gene predictions. ≈ 11,169 “un-touched” ≈+ good 1 st pass annotation. ≈+ re-annotated >50%. ≈2,576 Confirmed status. ≈Unlikely to change. ≈5,624 Partially Confirmed. ≈Potentially modified. ≈2,969 Predicted. ≈Potentially removed or altered.

6 05/04/2005 Informatics Meeting Present (re)annotation of a genome Painting by numbersPainting the Forth Rail Bridge

7 05/04/2005 Informatics Meeting (re)annotating a genome ≈ We adopted a ‘paint by numbers’ approach involving automated appraisal of all gene models on a regular basis. ≈ Generation of lists of genes/features to be checked by human annotators. Appraise Curate Process and report Release and synchronise

8 05/04/2005 Informatics Meeting Script Based Targeted Annotation ≈ Create a number of curation lists ≈Confirmed introns not in gene models ≈ESTs/mRNAs in introns. ≈Overlapping Gene predictions. ≈Predictions overlapping known repeats. ≈Short Genes <150bp ≈Short introns <40bp

9 05/04/2005 Informatics Meeting Transcript Based Refinements ≈ Automatic import of transcript data during our build cycle. ≈C. elegans mRNAs/cDNAs. ≈C. elegans ESTs. ≈Nematode ESTs. ≈ Processed and aligned to genome. ≈ This produces data for our curation lists

10 05/04/2005 Informatics Meeting Gene Refinement Fmap View ≈ EST data points to 5’ extension and 3’ extension. ≈ Identified due to confirmed introns not in a gene model 5’ 3’ Transcript Data Refined Prediction Old prediction Confirmed intron.

11 05/04/2005 Informatics Meeting Not all <150bp Predictions are Bad? ≈ Small peptides can be real. ≈H12D21.1 is a 34 aa peptide that appeared on curation list. ≈Investigated. ≈Prediction had peptide similarity to 2 other elegans proteins. ≈Multi sequence alignment proved interesting.

12 05/04/2005 Informatics Meeting H12D21.1 + Homols Fmap View & M.S.A. SignalP cleavage site Gene Prediction Protein Homology Blocks

13 05/04/2005 Informatics Meeting New Family Members ≈ Used tBlastn to identify other regions in genome, ≈ Annotated these ORFs to give. ≈ 9 additional family members ≈ These have been called nspa-1 to 12 ≈Nematode Specific Peptide family A Pseudogene Expanded Family

14 05/04/2005 Informatics Meeting C. briggsae Comparison ≈ C. elegans vs C. briggsae ≈C. briggsae hybrid gene set analysis (Avril Coghlan). ≈Detailed in PloS Biol 2003 1:166-192 “The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics.” ≈ WormBase Has worked to incorporate the ~1300 new genes reported.

15 05/04/2005 Informatics Meeting Coding Gene Predictions Over Time. Increase in CDS due to 1 st round of new genes identified by comparison with briggsae. 17500 18000 18500 19000 19500 20000 20500 21000 21500 22000 22500 WS21WS24WS27WS30 WS33 WS36 WS39WS42WS45 WS48 WS51 WS54WS57WS60WS73WS76WS79WS82WS85WS88WS91WS94WS97 WS100WS103WS106WS109WS112WS115WS118WS121WS124 Release Number Predictions Including Isoforms Coding Genes briggsae hybrid gene set

16 05/04/2005 Informatics Meeting Large family analysis ≈ Worm Community Members. ≈ Multi Sequence Alignments of some large Families. ≈7 TM receptor families ≈1700 family members ≈Sub families have been worked on by multiple worm community members. ≈Hugh Robertson (University of Illinois) ≈Jim Thomas (University of Washington Seattle) ≈Jack Chen (CSH Laboratories)

17 05/04/2005 Informatics Meeting Future ≈ Identify new avenues for gene refinement and identification. ≈ Looking at predictor overlaps ≈(Genefinder/Twinscan overlaps) vs (WormBase Gene set) ≈ In house protein family analysis ≈ Multiple species comparisons

18 05/04/2005 Informatics Meeting Predictor Overlaps. Genefinder Prediction Twinscan Prediction New CDS Prediction Strong Splicing Good briggsae DNA::DNA Alignment

19 05/04/2005 Informatics Meeting Gene Family Analysis ≈ Protein alignments of multiple family members can refine gene predictions. ≈ClustalW ≈blast ≈Main problems identified ≈Incorrect splicing ≈Truncations ≈Invalid extensions

20 05/04/2005 Informatics Meeting Example of a Small Family Analysis. ≈ Problematic alignment ≈F56H6.9 appears to have 18aa extra sequence. ≈E03H4.4 seems to be lacking sequence.

21 05/04/2005 Informatics Meeting Fmap View of F56H6.9

22 05/04/2005 Informatics Meeting Example of Problem. ≈ Problematic alignment ≈ Alignment following annotation.

23 05/04/2005 Informatics Meeting Multiple Species Comparison. ≈ More nematode genomes are on their way ≈C. remanei ≈shotgun in progress ≈Blast server available http://genome.wustl.edu/projects/cremanei/ ≈PB2801 ≈shotgun in progress ≈C. japonica ≈shotgun in progress

24 05/04/2005 Informatics Meeting elegans/briggsae/remanei Alignment for nspa- like peptides.

25 05/04/2005 Informatics Meeting Summary ≈ Gene (Re)annotation >7 years. ≈New genes are still being discovered. ≈ Primarily Transcript driven. ≈ More work on protein families ≈ New strategies for gene prediction and refinement. ≈Using multiple gene predictors ≈Multi species comparison

26 05/04/2005 Informatics Meeting Acknowledgements ≈ Genome Sequencing Center St. Louis ≈Sequencing and finishing teams etc. ≈WormBase team Tamberlyn Bieri Darin Blasiar Phil Ozersky John Spieth ≈ Wellcome Trust Sanger Institute ≈Sequencing and finishing teams etc. ≈WormBase team Richard Durbin Anthony Rogers Dan LawsonMary Ann Tuli ≈AceDB Ed GriffithsRoy Storey


Download ppt "05/04/2005 Informatics Meeting C. elegans – “Back To The Future”. Paul Davis (aka Huey)"

Similar presentations


Ads by Google