Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA.

Slides:



Advertisements
Similar presentations
12-3 RNA and Protein Synthesis
Advertisements

Model Organism Databases and Community Annotation
Gene Structure Annotation David Swarbreck ASPB Plant Biology, June 29, 2008, Merida.
Making best use of TAIR tools and datasets Philippe Lamesch Donghui Li The Arabidopsis Information Resource contact us:
TAIR: Bringing together data for the global plant biology community Philippe Lamesch Kate Dreher The Arabidopsis Information Resource
1 Gene Ontology and Functional Annotation Donghui Li ASPB Plant Biology, June 29, 2008, Merida.
Annotation of Gene Function …and how thats useful to you.
TAIR: Bringing together data for the global plant biology community kate dreher curator TAIR/PMN.
Arabidopsis as a model for plant development Eva Huala.
Gene Structure Annotation Philippe Lamesch International Arabidopsis conference July 23, 2008, Montreal.
Kate Dreher AraCyc, TAIR, PMN Carnegie Institution for Science
Part I: Tips and techniques from curators Kate Dreher TAIR, AraCyc, PMN Carnegie Institution for Science.
The Human Genome Project Main reference: Nature (2001) 409,
1 IMDS Tutorial Integrated Microarray Database System.
Analyzing Genes and Genomes
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
Gene Ontology John Pinney
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Sequence-Structure-Function Sequence Structure Function Threading Ab initio BLAST Folding: impossible but for the smallest structures Function prediction.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
UniProt - The Universal Protein Resource
Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.
Alternative Splicing. mRNA Splicing During RNA processing internal segments are removed from the transcript and the remaining segments spliced together.
TAIR resources for plant biology research kate dreher curator TAIR/PMN.
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
New data and tools at TAIR (The Arabidopsis Information Resource)
GO and OBO: an introduction. Jane Lomax EMBL-EBI What is the Gene Ontology? What is OBO? OBO-Edit demo & practical What is the Gene Ontology? What is.
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
TAIR Workshop Model Organism Databases and Community Annotation Plant and Animal Genome XVI Conference, San Diego January 13, 2008.
GENE ONTOLOGY FOR THE NEWBIES Suparna Mundodi, PhD The Arabidopsis Information Resources, Stanford, CA.
Molecular Biology Primer. Starting 19 th century… Cellular biology: Cell as a fundamental building block 1850s+: ``DNA’’ was discovered by Friedrich Miescher.
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
Organizing information in the post-genomic era The rise of bioinformatics.
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Protein and RNA Families
Getting Started: a user’s guide to the GO GO Workshop 3-6 August 2010.
1 Gene function annotation. 2 Outline  Functional annotation  Controlled vocabularies  Functional annotation at TAIR  Resources and tools at TAIR.
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
Generic Database. What should a genome database do? Search Browse Collect Download results Multiple format Genome Browser Information Genomic Proteomic.
Bioinformatics and Computational Biology
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Building and Refining AraCyc: Data Content, Sources, and Methodologies Kate Dreher TAIR, AraCyc, PMN Carnegie Institution for Science.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
2006 ICAR: TAIR workshop Organizers: Katica Ilic and Peifen Zhang Location: Reception Room, 4th floor A general overview of TAIR website and demonstration.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
生物資料庫搜尋 ( 第八組 ) 連威森 王鼎 黃智楹 張鈞淵
Considerations for multi-omics data integration Michael Tress CNIO,
Introduction to Genes and Genomes with Ensembl
Department of Genetics • Stanford University School of Medicine
Functional Annotation of the Horse Genome
Genomes and Their Evolution
Introduction to Bioinformatics
Ensembl Genome Repository.
Presentation transcript:

Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Introduction Matt Duffin: The Library

Introduction Matt Duffin: The Library PubMed contains 18,792,257 entries 50,000 papers indexed per month

Introduction Matt Duffin: The Library In Feb 2009: 67,406,898 interactive pubmed searches done 92,216,786 entries were viewed

> 900 complete genomes to date from

Model Organism Databases (MODs)

Nucleotide Sequence Databases International Nucleotide Sequence Database Collaboration 3 Coding and non-coding DNA 42 Gene structure, introns and exons, splice sites 25 Transcriptional regulator sites and transcription factors 63 RNA sequence databases 72 Protein sequence databases: General Sequence Databases 15 Protein properties 16 Protein localization and targeting 23 Protein sequence motifs and active sites 25 Protein domain databases; protein classification 38 Databases of individual protein families 73 Structure Databases Small molecules 18 Carbohydrates 9 Nucleic acid structure 15 Protein structure 84 Immunological databases 27 Plant Databases Databases in NAR 2009 Genomics Databases (non-vertebrate) 2 Genome annotation terms, ontologies and nomenclature 2 Taxonomy and identification 11 General genomics databases 12 Viral genome databases 28 Prokaryotic genome databases 68 Unicellular eukaryotes genome databases 19 Fungal genome databases 31 Invertebrate genome databases 54 Metabolic and Signaling Pathways Enzymes and enzyme nomenclature 13 Metabolic pathways 23 Protein-protein interactions 77 Signalling pathways 6 Human and other Vertebrate Genomes Model organisms, comparative genomics 68 Human genome databases, maps and viewers 16 Human ORFs 28 Human Genes and Diseases General human genetics databases 15 General polymorphism databases 32 Cancer gene databases 25 Gene-, system- or disease-specific databases 56 Microarray Data and other Gene Expression Databases 67 Proteomics Resources 20 Other Molecular Biology Databases Drugs and drug design 22 Molecular probes and primers 10 Organelle databases General 8 Mitochondrial genes and proteins 16

Responsibilities of a MOD curator Gene function curation Gene structure annotation Integration of new data types into database Implementation of new tools on the website Improve website Community support Grant writing Community outreach

Responsibilities of a MOD curator Gene function Gene structure curation

Functional genome annotation

12 Its defined as the process of collecting information about a genes biological identity: molecular function (transcription factor) biological roles (trichome development) subcellular localization (nucleus) mutant phenotype expression domain interaction with other genes and gene products What is functional annotation?

A long way to go : Human functional genome annotation Number of Human Genes in Uniprot: 20,331 Timeline of manual functional GO annotation of human genes 57% of all human genes manually annotated

A long way to go : Arabidopsis functional genome annotation Number of Arabidopsis genes in TAIR9: 33,518 genes 26% of Arabidopsis genes manually annotated

Functional annotation: step-by-step

Prioritizing journals Year # of articles 200 papers/month for 2.5 curators High priority journal list was established

Too much data, not enough curators

Prioritizing journals CELL CURRENT BIOLOGY DEVELOPMENT GENES AND DEVELOPMENT NATURE NATURE CELL BIOLOGY NATURE GENETICS NUCLEIC ACIDS RESEARCH PLoS biology PNAS SCIENCE THE EMBO JOURNAL THE PLANT CELL THE PLANT JOURNAL TRENDS IN PLANT SCIENCE Based on Journal High Priority list Gene based prioritize papers with unannotated genes prioritize papers with novel genes

Functional annotation: step-by-step

Identifying the gene/organism of interest can be hard Nomenclature standards and collaborative efforts strive to give ortholog genes the same symbol. Example: BRCA1 exists in > 12 species Same symbol for genes within a species. Example: PAP1 in A. thaliana Purple Acid Phosphatase I Phosphatidic Acid Phosphatase I Production of anthocyanin pigment I Phytochrome Associated Protein I Gene duplicates sharing a root symbol term Example: wnt8 in Zebrafish wnt8a and want8b

Solution Submit Sequence Identifier and other useful information clarifying what genes is discussed in the publication Authors need to be aware of nomenclature process Publishers and reviewers not to be more stringent about gene names in the paper are approved and that necessary sequence identifiers are provided Identifying the gene/organism of interest can be hard

Identifying relevant data Goal: identify every novel experimental result, add it to appropriate section in database, and connect it to already existing data Distinguish experimentally supported from speculative assertions Gather experimental results, not censor them!

Identifying relevant data Goal: identify every novel experimental result, add it to appropriate section in database, and connect it to already existing data Distinguish experimentally supported from speculative assertions Gather experimental results, not censor them!

Example of a gene pages at TAIR Computational description

Example of a gene page at TAIR Summary

Example of a gene page at TAIR GO annotations

Example of a gene page at TAIR

28 An annotation is a statement that a gene product … …has a particular molecular function …is involved in a particular biological process …is located in a certain cellular component …as determined by a particular method …as described in a particular reference Annotations have four key components: What is an Gene Ontology annotation? Adapted from Harold J Drabkin, The Jackson Laboratory

29 Adapted from Harold J Drabkin, The Jackson Laboratory Smith et al. (2006) determined by an enzyme assay that ABC2 has protein kinase activity. Reference Method Term Gene product

30 Same name, different concept Cell

31 glucose biosynthesis glucose synthesis glucose formation glucose anabolism gluconeogenesis Different name, same concept noncarbohydrate precursors (pyruvate, amino acids and glycerol) glucose

32 The solution: Controlled vocabularies A standardized, restricted set of defined terms designed to reduce ambiguity in describing a concept. e.g. = Gluconeogenesis Applicable to many organisms, thus allowing cross-species comparisons. glucose biosynthesis glucose synthesis glucose formation glucose anabolism gluconeogenesis

Gene structure Annotation Arabidopsis genome sequenced almost 9 years ago High quality sequence with few gaps TIGR did initial genome annotation TAIR took over responsibility in 2005 Current stats: 27,379 protein coding genes 4827 pseudogenes or transposable elements 1312 ncRNAs

Gene structure annotation in Arabidopsis NEW: 282 genes; 1056 exons UPDATED: 1254 models; 1144 exons NEW: 1291 genes; 683 exons UPDATED: 3811 models; 4007 exons NEW: 681 genes; 828 exons UPDATED: 10,792 models and 14,050 exons TAIR6

Gene structure annotation in Arabidopsis Novel genes

Gene structure annotation in Worm > 600 C. elegans gene models added since 2004 > 6000 gene model structure updates from

Gene structure annotation in Worm > 600 C. elegans gene models added since 2004 > 6000 gene model structure updates from

Gene structure annotation in Human The Consensus CDS (CCDS) project is a collaborative effort to identify a core set of human and mouse protein coding regions that are consistently annotated and of high quality. Collaborators: European Bioinformatics Institute (EBI) National Center for Biotechnology Information (NCBI) Wellcome Trust Sanger Institute (WTSI) University of California, Santa Cruz (UCSC)

Gene structure annotation in Human 1.18 splice-variants/gene identified by the CCDS project

Gene structure annotation in Human

Gene structure annotation of model organisms: Remaining challenges Updating exon-intron structures of existing gene models Identifying all splice-variants of known loci Annotating specific gene types: Small genes Pseudogenes Transposable element genes RNA coding genes Anti-sens genes Genes withing the UTR of other genes …

How do MOD curators annotate genomes? Experimental & Computational Evidence Automatic pipeline Manual annotation Genome annotation

How do MOD curators annotate genomes? Experimental & Computational Evidence Automatic pipeline Manual annotation Genome annotation

Automated pipeline at TAIR Program for aligned sequence(PASA) Clustered transcripts NCBI

Automated pipeline at TAIR Program for aligned sequence(PASA) Clustered transcripts Resulting gene model Previous gene model NCBI

Automated pipeline at TAIR Program for aligned sequence(PASA) Clustered transcripts Resulting gene model Previous gene model NCBI comparison

Automated pipeline at TAIR Program for aligned sequence(PASA) Clustered transcripts Resulting gene model Previous gene model Based on a set of rules a decision is made comparison NCBI

How do MOD curators annotate genomes? Experimental & Computational Evidence Automatic pipeline Manual annotation Genome annotation

How do MOD curators annotate genomes? Experimental & Computational Evidence Automatic pipeline Manual annotation Genome annotation

Manual annotation at different MODs Genome editing tool Evidence set Set of annotation rules + +

Manual annotation at different MODs Genome editing tool Evidence set Set of annotation rules + + Nucleotide sequence Short peptides Protein similarity Alternative predictions … Apollo (Arabidopsis, Fly) Aceview (Worm) Zmap/Otterlace (Human) Artemis (Pathogen Project) … Exon size Intron size Number of UTRs Coding/Non-coding ratio Splice-junctions …

ESTs cDNAs Radish sequence alignments Eugene prediction dicot sequence alignments monocot sequence alignments Aceview gene predictions 2 gene isoforms Manual annotation at TAIR: Apollo Short MS peptide

Recent genome annotation projects at TAIR Comparing TAIR models to those of 4 alternative prediction tools Integrating newly published large-scale datasets into the annotation: Short MS peptide sequences (Baerenfaller et al, Castellana et al) Short single-exon genes (Hanada et al) Transposable elements (Quesneville et al) Development of a Gene Confidence Ranking Improve pseudogene annotation

Gene confidence ranking

Other responsibilities of gene structure curator Analyse large datasets submitted by community Represent data in a useful manner Update the genome assembly based on newly found indels/contaminations Generate downloadable datasets for users Implement new tools Do community outreach at conferences and in schools

Too much data, not enough curators More papers are published than curators can read Many databases have 2 or 3 curators to analyze tens of thousands of genes For many newly sequenced genomes no database exists to annotate genes

Involving the community in scientific curation Have publishers become more involved (PlantPhys) Direct data submission from user to database Designate experts for specific genes/families Wikis Get community and students involved in annotating genomes New tools such as Biolit, Microsoft Plugin to markup original publication Other new ways of disseminating data see Scivee

Involving the community in genome annotation: Direct submission by the community Submit data in standardized format using MOD submission forms Requires a lot of work from community

Involving the community in genome annotation: Partnership between journals and databases TAIR: collaboration with Plant Physiology

Involving the community in genome annotation: Direct editing of the database by registered experts

Involving the community in genome annotation: Wikis

Wikis

Involving the community in genome annotation: Gene structure annotation in the classroom

Involving the community in genome annotation: New tools: Microsoft Ontology Add-in

Involving the community in genome annotation: New tools: Biolit

Involving the community in genome annotation: New tools: Scivee and Pubcasts

The Constituents are Changing

Acknowledgments PIs Eva Huala Sue Rhee Curators David Swarbreck Donghui Li Tanya Berardini Kate Dreher Peifen Zhang TAIR Tech Team: Vanessa Kirkuo Chris Wilks Tom Meyer Cindy Lee Raymond Chetty Bob Muller All my colleagues from other MODs

Establish semantic consistency Start to provide semantic enrichment of the literature in a way that is consistent Given that author is expert on the own work, they should annotate their own data. Work with microsoft work to Create plugin, that creates semantic consistency in the authoring process, a bit like a spellchecker. Every word types is checked against the onotlogy or if your common term should be changed to the systematic name, or tag the term with the systematic name while still using the common term

A curator is not a reviewer While we do not control the quality of the data, thorough annotation and user-friendly database tools are the keys to making the database useful.

Availability of web servers

Introduction PubMed contains 18,792,257 entries 50,000 papers indexed per month In Feb 2009: – 67,406,898 interactive searches were done – 92,216,786 entries were viewed