05/04/2005 Informatics Meeting C. elegans – “Back To The Future”. Paul Davis (aka Huey)

Slides:



Advertisements
Similar presentations
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Homology Based Analysis of the Human/Mouse lncRNome
ABSTRACT WormBase is a freely available information resource primarily for the nematode Caenorhabditis elegans but which progressively includes data from.
Anthony Rogers* WormBase Consortium *Wellcome Trust Sanger Institute California Institute of Technology Cold Spring Harbor Laboratory Washington University.
Origins of recently gained introns in Caenorhabditis Avril Coghlan and Kenneth H. Wolfe Department of Genetics, Trinity College Dublin, Ireland.
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
Sequence-Structure-Function Sequence Structure Function Threading Ab initio BLAST Folding: impossible but for the smallest structures Function prediction.
Bioinformatics Resources and Tools on the Web: A Primer.
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Genome Annotation BCB 660 October 20, From Carson Holt.
Urbana, IL| MAY 22, 2009 Anatomical Localization BeeSpace 5 th Annual Workshop Institute for Genomic Biology University of Illinois at Urbana-Champaign.
Sequencing a genome (a) outline the steps involved in sequencing the genome of an organism; (b) outline how gene sequencing allows for genome-wide comparisons.
BioInformatics (2). Physical Mapping - I Low resolution  Megabase-scale High resolution  Kilobase-scale or better Methods for low resolution mapping.
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Mouse Genome Sequencing
The Ensembl Gene set The “Genebuild” 21 April 2008.
Genomics of Microbial Eukaryotes Igor Grigoriev, Fungal Genomics Program Head US DOE Joint Genome Institute, Walnut Creek, CA.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
Tomato Chromosome 4: A Mapping & Sequencing Update 28 th September 2005 Christine Nicholson Mapping Core Group Welcome Trust Sanger Institute, UK.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
NCBI’s Genome Annotation: Overview Incremental processing Re-annotation ( batch ) Post-annotation review Case studies NOTE: limiting discussion to annotation.
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
Analysis of the RNAseq Genome Annotation Assessment Project by Subhajyoti De.
Chapter 21 Eukaryotic Genome Sequences
Recombinant DNA Technology and Genomics A.Overview: B.Creating a DNA Library C.Recover the clone of interest D.Analyzing/characterizing the DNA - create.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Genome Annotation Rosana O. Babu.
SPIDA Substitution Periodicity Index and Domain Analysis Combining comparative sequence analysis with EST alignment to identify coding regions Damian Keefe.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Motif discovery and Protein Databases Tutorial 5.
Mark D. Adams Dept. of Genetics 9/10/04
Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.
August 2008Bioinformatics Tools for Comparative Genomics of Vectors1 Genomes Daniel Lawson EBI.
Copyright OpenHelix. No use or reproduction without express written consent1.
By Michael Han Sanger Wormbase Group SAB 2008 Comparative Genomics with.
Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute.
Chapter 3 The Interrupted Gene.
Advisory Board Meeting, Caltech 2004 Genome Sequence Updates. Paul Davis The Sanger Institute.
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
13 th January 2008 Plant & Animal Genome Conference Progress with Sequencing Tomato Chromosome 4 Clare Riddle Tomato Project Group Wellcome Trust Sanger.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Finding genes in the genome
Sequence Curation Paul Davis Sanger Institute. Overview Sequence curation within WormBase consortium. Import of sequence data. Prediction stats. Work.
Annotation of eukaryotic genomes
Plasmodium falciparum (3D7) - published in Draft coverage. No sequence updates for a year. No new annotation since? Leishmania major Friedlin - version.
Welcome to the combined BLAST and Genome Browser Tutorial.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Advisory Board Meeting, Caltech 2004 Sequence curation in WormBase Sanger Institute, Hinxton & GSC, St Louis.
Sequence-Structure-Function Sequence Structure Function Threading Ab initio BLAST Folding: impossible but for the smallest structures Function prediction.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Mahmuda Akter, Paige Fairrow-Davis, and Rebecca Seipelt-Thiemann
Virginia Commonwealth University
Human Genome Project.
VectorBase genome annotation
This paper is about RNA can inhibit gene expression
CSE182-L12 Gene Finding.
Today… Review a few items from last class
Plant & Animal Genome Conference
Strategies for annotation of a genome
1. C. briggsae sequence curation 2. SNP data handling
Usher Syndrome Type III: Revised Genomic Structure of the USH3 Gene and Identification of Novel Mutations  Randall R. Fields, Guimei Zhou, Dali Huang,
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Presentation transcript:

05/04/2005 Informatics Meeting C. elegans – “Back To The Future”. Paul Davis (aka Huey)

05/04/2005 Informatics Meeting Overview ≈ C. elegans Gene Prediction ≈Past. ≈Overview of genome project. ≈1 st Pass annotation ≈Present. ≈Script based list generation. ≈Gene Refinement (Transcript Based). ≈Small peptides. ≈C. briggsae comparison. ≈Large external gene family analysis. ≈Future. ≈Un-annotated Overlap between gene predictors ≈Gene Family curation. ≈Multiple species comparison. ≈ Summary.

05/04/2005 Informatics Meeting Past ≈ Genome Project ≈C. elegans 1 st multicellular organism genome published ≈97-Mb of sequence made up of ≈2527 cosmids, ≈257 YACs, ≈113 fosmids, ≈44 PCR products. ≈5 gaps closed by ≈Annotated to find 19,099 protein coding genes. ≈ 1 st pass annotation Genefinder (Phil Green WASHU). ≈ Curators appraised gene predictions on a clone by clone basis as they were finished.

05/04/2005 Informatics Meeting Genome View Predicted Partially Confirmed Confirmed Colour corresponds to strand not confidence.

05/04/2005 Informatics Meeting Stats for WS141 ≈ Currently 22,436 gene predictions. ≈ 11,169 “un-touched” ≈+ good 1 st pass annotation. ≈+ re-annotated >50%. ≈2,576 Confirmed status. ≈Unlikely to change. ≈5,624 Partially Confirmed. ≈Potentially modified. ≈2,969 Predicted. ≈Potentially removed or altered.

05/04/2005 Informatics Meeting Present (re)annotation of a genome Painting by numbersPainting the Forth Rail Bridge

05/04/2005 Informatics Meeting (re)annotating a genome ≈ We adopted a ‘paint by numbers’ approach involving automated appraisal of all gene models on a regular basis. ≈ Generation of lists of genes/features to be checked by human annotators. Appraise Curate Process and report Release and synchronise

05/04/2005 Informatics Meeting Script Based Targeted Annotation ≈ Create a number of curation lists ≈Confirmed introns not in gene models ≈ESTs/mRNAs in introns. ≈Overlapping Gene predictions. ≈Predictions overlapping known repeats. ≈Short Genes <150bp ≈Short introns <40bp

05/04/2005 Informatics Meeting Transcript Based Refinements ≈ Automatic import of transcript data during our build cycle. ≈C. elegans mRNAs/cDNAs. ≈C. elegans ESTs. ≈Nematode ESTs. ≈ Processed and aligned to genome. ≈ This produces data for our curation lists

05/04/2005 Informatics Meeting Gene Refinement Fmap View ≈ EST data points to 5’ extension and 3’ extension. ≈ Identified due to confirmed introns not in a gene model 5’ 3’ Transcript Data Refined Prediction Old prediction Confirmed intron.

05/04/2005 Informatics Meeting Not all <150bp Predictions are Bad? ≈ Small peptides can be real. ≈H12D21.1 is a 34 aa peptide that appeared on curation list. ≈Investigated. ≈Prediction had peptide similarity to 2 other elegans proteins. ≈Multi sequence alignment proved interesting.

05/04/2005 Informatics Meeting H12D Homols Fmap View & M.S.A. SignalP cleavage site Gene Prediction Protein Homology Blocks

05/04/2005 Informatics Meeting New Family Members ≈ Used tBlastn to identify other regions in genome, ≈ Annotated these ORFs to give. ≈ 9 additional family members ≈ These have been called nspa-1 to 12 ≈Nematode Specific Peptide family A Pseudogene Expanded Family

05/04/2005 Informatics Meeting C. briggsae Comparison ≈ C. elegans vs C. briggsae ≈C. briggsae hybrid gene set analysis (Avril Coghlan). ≈Detailed in PloS Biol : “The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics.” ≈ WormBase Has worked to incorporate the ~1300 new genes reported.

05/04/2005 Informatics Meeting Coding Gene Predictions Over Time. Increase in CDS due to 1 st round of new genes identified by comparison with briggsae WS21WS24WS27WS30 WS33 WS36 WS39WS42WS45 WS48 WS51 WS54WS57WS60WS73WS76WS79WS82WS85WS88WS91WS94WS97 WS100WS103WS106WS109WS112WS115WS118WS121WS124 Release Number Predictions Including Isoforms Coding Genes briggsae hybrid gene set

05/04/2005 Informatics Meeting Large family analysis ≈ Worm Community Members. ≈ Multi Sequence Alignments of some large Families. ≈7 TM receptor families ≈1700 family members ≈Sub families have been worked on by multiple worm community members. ≈Hugh Robertson (University of Illinois) ≈Jim Thomas (University of Washington Seattle) ≈Jack Chen (CSH Laboratories)

05/04/2005 Informatics Meeting Future ≈ Identify new avenues for gene refinement and identification. ≈ Looking at predictor overlaps ≈(Genefinder/Twinscan overlaps) vs (WormBase Gene set) ≈ In house protein family analysis ≈ Multiple species comparisons

05/04/2005 Informatics Meeting Predictor Overlaps. Genefinder Prediction Twinscan Prediction New CDS Prediction Strong Splicing Good briggsae DNA::DNA Alignment

05/04/2005 Informatics Meeting Gene Family Analysis ≈ Protein alignments of multiple family members can refine gene predictions. ≈ClustalW ≈blast ≈Main problems identified ≈Incorrect splicing ≈Truncations ≈Invalid extensions

05/04/2005 Informatics Meeting Example of a Small Family Analysis. ≈ Problematic alignment ≈F56H6.9 appears to have 18aa extra sequence. ≈E03H4.4 seems to be lacking sequence.

05/04/2005 Informatics Meeting Fmap View of F56H6.9

05/04/2005 Informatics Meeting Example of Problem. ≈ Problematic alignment ≈ Alignment following annotation.

05/04/2005 Informatics Meeting Multiple Species Comparison. ≈ More nematode genomes are on their way ≈C. remanei ≈shotgun in progress ≈Blast server available ≈PB2801 ≈shotgun in progress ≈C. japonica ≈shotgun in progress

05/04/2005 Informatics Meeting elegans/briggsae/remanei Alignment for nspa- like peptides.

05/04/2005 Informatics Meeting Summary ≈ Gene (Re)annotation >7 years. ≈New genes are still being discovered. ≈ Primarily Transcript driven. ≈ More work on protein families ≈ New strategies for gene prediction and refinement. ≈Using multiple gene predictors ≈Multi species comparison

05/04/2005 Informatics Meeting Acknowledgements ≈ Genome Sequencing Center St. Louis ≈Sequencing and finishing teams etc. ≈WormBase team Tamberlyn Bieri Darin Blasiar Phil Ozersky John Spieth ≈ Wellcome Trust Sanger Institute ≈Sequencing and finishing teams etc. ≈WormBase team Richard Durbin Anthony Rogers Dan LawsonMary Ann Tuli ≈AceDB Ed GriffithsRoy Storey