Sequence Curation Paul Davis Sanger Institute. Overview Sequence curation within WormBase consortium. Import of sequence data. Prediction stats. Work.

Slides:



Advertisements
Similar presentations
Request Tracker IT Partners Conference Oliver Thomas 19 April 2005.
Advertisements

Enabling Access to Sound Archives through Integration, Enrichment and Retrieval WP1. Project Management.
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
ABSTRACT WormBase is a freely available information resource primarily for the nematode Caenorhabditis elegans but which progressively includes data from.
SAB 2008 LITERATURE CURATION Overview & Integrated Phenotype Curation.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Genome Browsers Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
PAZAR DATABASE CHIP-SEQ DEPOSIT Wyeth Wasserman.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
UniProt - The Universal Protein Resource
The chapter will address the following questions:
Qatar Planning Council 1 Best Statistical Information to Support Qatar’s Progress Statistical Capacity Building for Information Society in Qatar.
Genome Informatics 2005 ~ 220 participants 1 keynote speaker: David Haussler 47 talks 121 posters.
LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.
The Ensembl Gene set The “Genebuild” 21 April 2008.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Mary Ann Tuli Advisory Board Meeting, CSHL 2005 WormBase and the CGC Mary Ann Tuli.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
05/04/2005 Informatics Meeting C. elegans – “Back To The Future”. Paul Davis (aka Huey)
Community Ontology Development Lessons from the Gene Ontology.
Use cases for Tools at the Bovine Genome Database Apollo and Bovine QTL viewer.
Sampleminded® Support Overview Last Updated: 1/22/
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
NCBI’s Genome Annotation: Overview Incremental processing Re-annotation ( batch ) Post-annotation review Case studies NOTE: limiting discussion to annotation.
ModENCODE August 20-21, 2007 Drosophila Transcriptome: Aim 2.2.
Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso.
INFO 637Lecture #101 Software Engineering Process II Review INFO 637 Glenn Booker.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
VectorBase BRC The evolving VectorBase gene build: mixing automated and manual approaches when annotating vector genomes Daniel Lawson VectorBase-EBI,
Bioinformatics Core Facility Guglielmo Roma January 2011.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Annotator Interface Sharon Diskin GUS 3.0 Workshop June 18-21, 2002.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
The Havana-Gencode annotation GENCODE CONSORTIUM.
The Stanley Neuropathology Consortium Integrative Database: A novel web-based tool for exploring neuropathological traits, gene expression and associated.
Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
Copyright OpenHelix. No use or reproduction without express written consent1.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute.
Advisory Board Meeting, Caltech 2004 Genome Sequence Updates. Paul Davis The Sanger Institute.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Accessing and visualizing genomics data
Gene Regulation Xiaodong Wang Erich Schwarz WormBase at Caltech 2008 Advisory Board Meeting.
Welcome to the combined BLAST and Genome Browser Tutorial.
Work Presentation Novel RNA genes in A. thaliana Gaurav Moghe Oct, 2008-Nov, 2008.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Advisory Board Meeting, Caltech 2004 Sequence curation in WormBase Sanger Institute, Hinxton & GSC, St Louis.
Mary Ann Tuli Presented by Anthony Rogers
Mary Ann Tuli Presented by Anthony Rogers
The Transcriptional Landscape of the Mammalian Genome
VectorBase genome annotation
Using ArrayExpress.
IT Partners Conference Oliver Thomas 19 April 2005
Experimental Verification Department of Genetic Medicine
LCGAA nightlies infrastructure
Department of Genetics • Stanford University School of Medicine
TSS Annotation Workflow
Functional Annotation of the Horse Genome
Health Ingenuity Exchange - HingX
Strategies for annotation of a genome
Yating Liu July 2018 G-OnRamp workshop
Genetic Data in Mary Ann Tuli.
1. C. briggsae sequence curation 2. SNP data handling
Part II SeqViewer AraCyc Help
Annotator Interface GUS 3.0 Workshop June 18-21, 2002.
Presentation transcript:

Sequence Curation Paul Davis Sanger Institute

Overview Sequence curation within WormBase consortium. Import of sequence data. Prediction stats. Work metrics and infrastructure. New Collaborations. Submission of data to Public data repositories. Sequence curation and modENCODE. SAB 2008

Sequence Curation Curation from multiple sources. –Transcript data: NDB (EMBL). –Anomalies Database. –1 st pass paper curation – CalTech. Talks this afternoon. –Direct user submissions pre and post publication. SAB 2008

Transcript Data Retrieval & Processing Retrieval of Transcript data for C. elegans and all tier II species. Transcript data is feature rich. Going to mention 2 Feature oriented classes. Sequences processed to identify Feature data. 2 fold application: Cleanup - masking problems for genomic placement. – Improves quality of coding transcripts (has been a problem in the past). Routine Identification of novel features. –Trans-splice leader sequences (SL1/2). –PolyA features. SAB 2008

Feature Data for Improvement & Enrichment. TypeWS170WS190 PolyA PolyA_site PolyA_signal Trans-splice leader TSL SL SL Unknown3250 Blat_discrepancies Low_complexity15237 Misc3755 Total SAB 2008

Annotated Features SAB 2008 Binding sites and new Feature type initiative in re-start phase. Automated & Paper curation. Features annotated from: Feature generation from non-redundant feature data. 1 st pass paper curation. No. Feature type

Race Sequence Tags (RST) reads the RACE project submitted following IWM (International Worm UCLA). –Assumption: 5’ reads have TSL sequences. 3’ reads have polyA sequence based on experiment methodology. 5’ reads. –82% SL1/SL2 canonical sequences. –Additional analysis revealed 18% have SL-like sequences. –Experimental confirmation of mixed sequencing reaction (SL1 + SL2). Example Cleanup with Collaborative Feedback (pre publication).

Continued……. 3’ reads. –0% using standard code base. –New code looks for polyA runs >10nt –Evaluate sequence post polyA and score. –72% PolyA tail identification and masking. Remainder mis-primed to genomic polyA…… New code implemented. Feature data was used to identify 472 new unique features. SAB 2008

Current WormBase Gene Status. Coding genes only Only utilises transcript data evidence. Exploring option to upgrade. SAB 2008 Predicted – No available transcript evidence. Partially confirmed – Some but not all bp are covered by transcript evidence. Confirmed – Every base has supporting transcript data.

Curation Stats 07/08 WS170 (19 th Jan 07) – WS190 (Current Live site) SAB 2008 Data TypeWS170WS190% change CDS % CDS changes - ~1800 Isoform % WB Status Confirmed (35.5%) % Partially Confirmed (46%) % Predicted (18.5%) % Pseudogenes % (~30% ↑ CDS) RNA Genes % Total number of genes* % * Genes with a known sequence and structure

Curation Tool and Anomalies Database. Gary introduced the development of the tools. Curation tool is essential for day to day curation. Utilised by both sequence curation sites. –Tracking. –Prioritisation. SAB 2008

C. elegans Curation Time Scale. Expect to take between 5-12 months to finish C. elegans. Estimate based on ~1500 anomalies month – Assuming no new anomaly data is added… which there will be!!! SAB 2008 No. of anomalies flagged as seen.

Infrastructure for Distributed Curation Sequence curation based at 2 centres –Anomalies tool for consistent prioritisation. –Request Tracker (RT) systems for curation ticket generation. Utilised by CalTech 1 st pass curation flagging: –Gene model curation discrepancies/new data. –Feature annotation. –Etc. Curator::curator interaction as projects are split between curators –e.g. C. elegans is split into 12 regions for curation. SAB 2008

Submission of Data to NDB –Submission of sequence updates for C. elegans back to the NDBs. –Synchronised to build cycle. –HSF (Hinxton Sequence Forum). Collaboration at Wellcome Trust Genome campus. –Weekly meetings. HSF presentation brought about change in how we represent ncRNAs in our submissions. Include ncRNA_class and description. SAB 2008 GenBank

modENCODE Data. Integration and collaboration with UTRome project. Annotated UTRs along side WormBase coding transcripts. Binding site data will also be annotated. –Requires model changes to accommodate available data. Link out for detailed experimental results. SAB 2008

Summary C. elegans manual annotation necessary as new data identifies gene refinements. Tools in place to allow for distributed curation. Collaborating with external groups to refine data and achieve better representation. Always looking to integrate new data. SAB 2008