Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute.

Slides:



Advertisements
Similar presentations
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Advertisements

Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
ABSTRACT WormBase is a freely available information resource primarily for the nematode Caenorhabditis elegans but which progressively includes data from.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
Anthony Rogers* WormBase Consortium *Wellcome Trust Sanger Institute California Institute of Technology Cold Spring Harbor Laboratory Washington University.
January 25, Current and Future Database (CH)  Indexing vgd_common (JM; 1Q)  Fully implement Taxonomy tables (JO, DD; 2Q)  Allow subspecies-level.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Predicting the Function of Single Nucleotide Polymorphisms Corey Harada Advisor: Eleazar Eskin.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
NGS Analysis Using Galaxy
WormBase: A Resource for the Biology & Genome of C. elegans Lincoln D. Stein.
GUS Overview June 18, GUS-3.0 Supports application and data integration Uses an extensible architecture. Is object-oriented even though it uses.
Sequence Analysis with Artemis & Artemis Comparison Tool (ACT) South East Asian Training Course on Bioinformatics Applied to Tropical Diseases (Sponsored.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Mary Ann Tuli Advisory Board Meeting, CSHL 2005 WormBase and the CGC Mary Ann Tuli.
05/04/2005 Informatics Meeting C. elegans – “Back To The Future”. Paul Davis (aka Huey)
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso.
Part I: Identifying sequences with … Speaker : S. Gaj Date
BIOINFORMATIK I UEBUNG 2 mRNA processing.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Predicting protein degradation rates Karen Page. The central dogma DNA RNA protein Transcription Translation The expression of genetic information stored.
Annotator Interface Sharon Diskin GUS 3.0 Workshop June 18-21, 2002.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.
Copyright OpenHelix. No use or reproduction without express written consent1.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
By Michael Han Sanger Wormbase Group SAB 2008 Comparative Genomics with.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Advisory Board Meeting, Caltech 2004 Genome Sequence Updates. Paul Davis The Sanger Institute.
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
IMDB: A Generic Insertional Mutagenesis Database Xiaokang Pan and Lincoln Stein Cold Spring Harbor Laboratory.
Sequence Curation Paul Davis Sanger Institute. Overview Sequence curation within WormBase consortium. Import of sequence data. Prediction stats. Work.
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
Welcome to the combined BLAST and Genome Browser Tutorial.
Legend Global = Subgraph call Make Data Dir = Step Load Genomic Sequence & Annotation = Subgraph reference Proteome Analysis = Optional step [Taxon] Pk.
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Information Representation Working Group WG Meeting September 5, 2008.
Advisory Board Meeting, Caltech 2004 Sequence curation in WormBase Sanger Institute, Hinxton & GSC, St Louis.
Bioinformatics Computing 1 CMP 807 – Day 4 Kevin Galens.
Mary Ann Tuli Presented by Anthony Rogers
Mary Ann Tuli Presented by Anthony Rogers
Annotating with GO: an overview
Data Mining with BioMart
Genome Sequence Annotation Server
UniProt: Universal Protein Resource
Genome Annotation Continued
GEP Annotation Workflow
Visualization of genomic data
Visualization of genomic data
Ensembl Genome Repository.
BLAT Blast Like Alignment Tool
Genetic Data in Mary Ann Tuli.
1. C. briggsae sequence curation 2. SNP data handling
Annotator Interface GUS 3.0 Workshop June 18-21, 2002.
Presentation transcript:

Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute.

Advisory Board Meeting, CSHL 2005 Overview The build procedure Stats for the year Team changes Model changes. “new gene model” Variation Future plans InterPro improved mapping of data to genes move off wormsrv2 new nematodes new data types

Advisory Board Meeting, CSHL 2005 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory Washington University in St. Louis California Institute of Technology ● RNAi ● Microarray ● Anatomy / Cell ● Homology groups (KOGS) ● SAGE data ● Gene Ontology ● Papers / References ● Person / Author ● Detailed Functional Annotation ● Gene prediction annotation ● SNPs ● PCR_products / Oligos ● 3D structures ● Yeast 2 Hybrid interactions Website and tools Gene prediction annotation Genetic Data Alleles Gene name info ( incl unique ids ) Strains Data Integration and analysis The WormBase Consortium

Advisory Board Meeting, CSHL 2005 Build Overview To FTP site and CSHL Dev site CalTechSangerCSHLWashU WormBase EMBL Align all cDNAs and build transcripts Map expt data eg RNAi, oligos, Alleles mysql WORMPEP DNA Sanger Compute Farm Blastx, blastp, RepeatMask PFAM, tmhmm etc Load homology data Export GFF, agp, DNA files. Build release files

Advisory Board Meeting, CSHL 2005 Release cycle From WS124 (March 2004) – WS150 (October 2005) - 26 releases. All but 2 of these were on schedule. Those that were late were due to Sanger wide systems problems associated with moving to new building. After W134 changed (with SAB approval) to three weekly cycle. If releases on time - Why? Increases in data meant gradual increase in time. Lots of releases were “Just in time” Time pressure meant that fixes weren’t been made properly. Reduced staff meant that less development was being done.

Advisory Board Meeting, CSHL 2005 Gene stats More polyA / TSL etc and fixing BLAT errors

Advisory Board Meeting, CSHL 2005 Experimental Data Stats I New data class

Advisory Board Meeting, CSHL 2005 Experimental Data Stats II Incorporation of genome wide experiments

Advisory Board Meeting, CSHL 2005 Other classes of interest InParanoid

Advisory Board Meeting, CSHL 2005 Staff Changes Mary Ann Tuli Gary Williams Great improvement in documentation of procedures. Gene structure curation Allele curation genetic map functions in acedb Sequence feature annotation ( polyA, TSL) Fresh view of methods for doing things. Keith Bradnam Choa-Kung Chen Dan Lawson Michael Han

“Where is the new Gene model Keith!?!”

Advisory Board Meeting, CSHL 2005 The problem ≈ Worm genes first existed as Locus objects ≈ e.g. dpy-1 ≈ Then genes existed as Sequence objects ≈ e.g. F31D4.3 ≈ Some genes exist as both Locus and Sequence objects ≈ Gene names change…a lot!

Advisory Board Meeting, CSHL 2005 LocusSequence C09D8.1 ptp-3 ptp-1 ypp-1 YPP/1 C09D8.1a C09D8.1b ptp-3aptp-3b Gene WBGene Other names Main CGC name Sequence name CDS ptp-1 The Plan

Advisory Board Meeting, CSHL 2005 Linking to a gene Paper [cgc4265]AntibodyAllele C09D8.1 ptp-3 ptp-1 ypp-1 YPP/1 C09D8.1a C09D8.1b ptp-3aptp-3b Gene WBGene C09D8.1c abc-1 RNAi result

Advisory Board Meeting, CSHL 2005 Progress! The (no longer new) Gene model is in place. All Genes now have Gene_ids Gene history tracking info stored merges, splits etc Next part of the plan was to have a central database serving ids

Advisory Board Meeting, CSHL 2005 Working version Sanger “single sign-on” User specific operations Operation selection Not just WBGene_ids - Variation, RNAi, Person

Advisory Board Meeting, CSHL 2005 Variation Model Locus SNPs Classical Genes Gene Clusters Allele Deletions Transposon_insertions Lots of shared data structures (Tags) eg Mapping data, Names, connections to CDSs Variation Greater code efficiency and managability for both build and web Easier to search

Advisory Board Meeting, CSHL 2005 Imminent arrivals and the Future InterPro Refined Mapping Moving build machine New nematodes New data types

Advisory Board Meeting, CSHL 2005 InterPro Useful data used in many other resources so a good ‘point of reference for non-worm specialists. We previously got ours from UniProt or ad hoc from St Louis. Many databases are covered by InterPro. Prosite, Prints, Pfam, SMART, PIRSF, etc. Usual way of searching for database hits is to use interproscan, but this is incompatible with Sanger farm. Run each database search individually using existing architecture from BLAST etc and stores the results. We merge hits with the same InterPro ID

Advisory Board Meeting, CSHL 2005 Merging hits from databases Protein Results similar but not identical to iprscan

Advisory Board Meeting, CSHL 2005 InterPro hits per protein 15 Proteins with >100 domains (max. 186)

Advisory Board Meeting, CSHL 2005 Improved Mapping of Variations to Genes We can describe much more accurately how a mutation affects a gene.. - donor and acceptor splice sites - introns / exons - motifs like polyAs and TSLs... and for coding changes give the amino acid differences. Variations

Advisory Board Meeting, CSHL 2005 sra-9 ttc tta F L Currently only connection to Gene Future will specify that the SNP is in coding sequence and that it causes a specified amino acid change. Described by Tags in the database, so searchable. Predicted snp_AH6[1]

Advisory Board Meeting, CSHL 2005 Implementation x One table per chromosome, so all can be loaded together GFF data exons, introns, transcripts, SNPS, alleles etc I II IIIIV V X All chromosomes can be run in parallel cbi1 = 3 x 2cpu

Advisory Board Meeting, CSHL 2005 Death of wormsrv2 5 years ago Sanger network = bad Bought shiney fast new computer Become too slow and isolation is a pain Now Sanger network = Good ! Move to use informatics cluster - fast and parallel Means modification of majority of code base

Advisory Board Meeting, CSHL 2005 New nematodes New nematode genomes C. briggsae is a forerunner... semi-curated geneset brigpep2 protein annotation ( PFAM, tmhmm, signalp ) ortholog assignment ( InParanoid - Erich Sonnhammer ) blastp blastx waba ( Jim Kent’s genome alignment tool ) We intend to do all of this for each of the new genomes. Mostly done for C.remanei

Advisory Board Meeting, CSHL 2005 New Data Types Any new data types impact on build new model development scripts to integrate and check the data Eg Mass spec data: Been in contact with Gennifer Merrihew