AgBase: bioinformatics enabling knowledge generation from agricultural omics data Fiona McCarthy.

Slides:



Advertisements
Similar presentations
Annotation of Gene Function …and how thats useful to you.
Advertisements

Modeling Functional Genomics Datasets CVM Lesson 3 13 June 2007Fiona McCarthy.
European Bioinformatics Institute The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO Nicky Mulder
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Bioinformatics resources for IITA Crops GO Workshop 3-6 August 2010.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
COG and GO tutorial.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
Comprehensive Annotation System for Infectious Disease Data Alexander Diehl University at Buffalo/The Jackson Laboratory IDO Workshop /9/2010.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
POC tutorial #1: Introduction This tutorial will run automatically in Quicktime. To run the tutorial at your own pace use the internal controllers within.
GO Enrichment analysis COST Functional Modeling Workshop April, Helsinki.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Data Curation and Management activities within the UCT Computational Biology Group Dr Nicky Mulder.
Introduction to the Gene Ontology and GO annotation resources
Gramene Objectives Develop a database and tools to store, visualize and analyze data on genetics, genomics, proteomics, and biochemistry of grass plants.
Bioinformatics and medicine: Are we meeting the challenge?
The aims of the Gene Ontology project are threefold: - to compile vocabularies to describe components, functions and processes - to produce tools to query.
Managing Data Modeling GO Workshop 3-6 August 2010.
Adding GO for Large Datasets COST Functional Modeling Workshop April, Helsinki.
DONNA MAGLOTT, PH.D. PRO AND MEDICAL GENETICS RESOURCES AT NCBI.
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
What is an Ontology? An ontology is a specification of a conceptualization that is designed for reuse across multiple applications and implementations.
Grup.bio.unipd.it CRIBI Genomics group Erika Feltrin PhD student in Biotechnology 6 months at EBI.
Gene Ontology TM (GO) Consortium Jennifer I Clark EMBL Outstation - European Bioinformatics Institute (EBI), Hinxton, Cambridge CB10 1SD, UK Objectives:
Strategies for functional modeling TAMU GO Workshop 17 May 2010.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Genomics and Arabidopsis. What is ‘genomics’? Study of an organism’s entire genome –All the DNA encoded in the organism –Nucleus, mitochondria, chloroplasts.
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
GO-based tools for functional modeling TAMU GO Workshop 17 May 2010.
From Functional Genomics to Physiological Model: Using the Gene Ontology Fiona McCarthy, Shane Burgess, Susan Bridges The AgBase Databases, Institute of.
Workshop Aims NMSU GO Workshop 20 May Aims of this Workshop  WIIFM? modeling examples background information about GO modeling  Strategies for.
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
The Plant Ontology Consortium Lincoln Stein 1, Susan McCouch 2, Elizabeth Kellogg 3, Seung Rhee 4, Pankaj Jaiswal 2, Doreen Ware 1, Peter Stevens 5 1 Cold.
Getting Started: a user’s guide to the GO GO Workshop 3-6 August 2010.
1 Gene function annotation. 2 Outline  Functional annotation  Controlled vocabularies  Functional annotation at TAIR  Resources and tools at TAIR.
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Increasing GO Annotation Through Community Involvement Fiona McCarthy*, Nan Wang*, Susan Bridges** and Shane Burgess** GO.
Getting Started: a user’s guide to the GO TAMU GO Workshop 17 May 2010.
Rice Proteins Data acquisition Curation Resources Development and integration of controlled vocabulary Gene Ontology Trait Ontology Plant Ontology
Ontologies Working Group Agenda MGED3 1.Goals for working group. 2.Primer on ontologies 3.Working group progress 4.Example sample descriptions from different.
Introduction to the Gene Ontology GO Workshop 3-6 August 2010.
Bioinformatics and Computational Biology
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.
9/10/06 GO Users Meeting 2006 Seattle, Washington The AgBase GO Annotation Tools Susan Bridges 1,3, Fiona McCarthy 2,3, Nan Wang 1,3, G. Bryce Magee 1,3,
Update Susan Bridges, Fiona McCarthy, Shane Burgess NRI
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Getting GO: how to get GO for functional modeling Iowa State Workshop 11 June 2009.
AgBase Shane Burgess, Fiona McCarthy Mississippi State University.
Prioritization of Avian GO Annotation , , Chicken ,06949,5163.4Rat ,69664, Mouse ,83036, Human.
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
High throughput biology data management and data intensive computing drivers George Michaels.
1 LS DAM Overview August 7, 2012 Current Core Team: Ian Fore, D.Phil., NCI CBIIT, Robert Freimuth, Ph.D., Mayo Clinic, Mervi Heiskanen, NCI-CBIIT, Joyce.
Getting GO annotation for your dataset
Building a community for genome and proteome annotation
Introduction to the Gene Ontology
Department of Genetics • Stanford University School of Medicine
Functional Annotation of the Horse Genome
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
A User’s Guide to GO: Structural and Functional Annotation
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Fiona McCarthy, Carl Schmidt, Parker Antin, Shane Burgess
Presentation transcript:

AgBase: bioinformatics enabling knowledge generation from agricultural omics data Fiona McCarthy

Summary ‘omics’ technologies: the ‘data deluge’ organising data: bioinformatics and biocuration data sharing and analysis: bio-ontologies from data to knowledge making sense of agricultural data

Databases and Biological Data The number of databases has increased Sequence repositories: NCBI, EMBL, DDJB Model Organism Databases (MODs) Specialist biological databases or ‘knowledge databases’ (eg, InterPro, interaction databases, gene expression data) Need to connect information in different databases Databases are increasing in size and complexity

2 4 6 8 10 12 14 16 18 70 75 80 85 90 95 00 05 No. x 106 5000 10000 15000 20000 25000 ‘00 ‘01 ‘02 ‘03 ‘04 ‘05 ‘06 ‘07 ‘08 ‘09 No.

Generating Biological Data Amount of biological data is increasing exponentially Completed and ongoing genome sequencing projects High throughput “omics” technologies New sequencing technologies Existing microarrays Proteomics

Biocomputing Technologies enable ‘omics’ technologies to move from large database/consortiums into individual laboratories Managing this data: acquire store access analyze visualize share

NIH WORKING DEFINITION OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.

Bioinformatics Managing data Adding value Organizing different file formats linking between different databases Adding value multiple levels of information from one ‘omics’ data set re-analysis linking data sets Organizing annotating data biocuration - annotation

Annotation ANNOTATE: to denote or demarcate Genome annotation is the process of attaching biological information to genomic sequences. It consists of two main steps: identifying functional elements in the genome: “structural annotation” attaching biological information to these elements: “functional annotation”

Community Annotation Researchers are the domain experts – but relatively few contribute to annotation time 'reward' & 'employer/funding agency recognition' training – easy to use tools, clear instructions Required submission Community annotation Groups with special interest do focused annotation or ontology development As part of a meeting/conference or distributed (eg. wikis) Students!

Biocuration biocurators are biologists who are trained to annotate biological data (using database structures, bio-ontologies, etc). databases use biocuration to enhance value of biological data “knowledge databases” but how to ensure data consistency between databases?

What Are Ontologies? “An ontology is a controlled vocabulary of well defined terms with specified relationships between those terms, capable of interpretation by both humans and computers.” Bio-ontologies are used to capture biological information in a way that can be read by both humans and computers annotate data in a consistent way allows data sharing across databases allows computational analysis of high-throughput “omics” datasets Objects in an ontology (eg. genes, cell types, tissue types, stages of development) are well defined. The ontology shows how the objects relate to each other

relationships between terms Ontologies relationships between terms digital identifier (computers) description (humans) Gene Ontology version 1.1348 (27/07/2010): 32,091 terms, 99.3% defined 19,169 biological process 2,745 cellular component 8,736 molecular function 1,441 obsolete terms (not included in figures above)

Relationships: the True Path Rule Why are relationships between terms important? TRUE PATH RULE: all attributes of children must hold for all parents so if a protein is annotated to a term, it must also be true for all the parent terms this enables us to move up the ontology structure from a granular term to a broader term Premise of many GO anaylsis tools

Genomic Annotation Structural Annotation: Open reading frames (ORFs) predicted during genome assembly predicted ORFs require experimental confirmation Functional Annotation: annotation of gene products = Gene Ontology (GO) annotation initially, predicted ORFs have no functional literature and GO annotation relies on computational methods (rapid) functional literature exists for many genes/proteins prior to genome sequencing Gene Ontology annotation does not rely on a completed genome sequence

Other annotations using other bio-ontologies e.g. Genomic Annotation Structural Annotation including Sequence Ontology Other annotations using other bio-ontologies e.g. Anatomy Ontology Nomenclature (species’ genome nomenclature committees) Functional annotation using Gene Ontology

Expression/Tissue Ontologies Infectious Disease Ontology Cell Ontology http://obo.sourceforge.net/ Gene Ontology Plant Ontology Sequence Ontology Trait Ontology Expression/Tissue Ontologies Infectious Disease Ontology Cell Ontology

Bio-ontology requirements bio-ontologies (Open Biomedical Ontologies) computational pipelines (‘breadth’) for computational annotations useful for gene products without published information manual biocuration (‘depth’) requires trained biocurators community annotation efforts each species has its own body of literature biocuration co-ordination MODs? Consortium? Community? biocuration prioritization co-ordination with existing Dbs, annotation, nomenclature initiatives data updates

Gene Ontology (GO) de facto method for functional annotation Assigns functions based upon Biological Process, Molecular Function, Cellular Component Widely used for functional genomics (high throughput) Many tools available for gene expression analysis using GO http://www.geneontology.org

Plant Ontology (PO) describes plant structures and growth and developmental stages Currently used for Arabidopsis, maize, rice – more being added (soybean, tomato, cotton, etc) Plant Structure: describes morphological and anatomical structures representing organ, tissue and cell types Growth and developmental stages: describes (i) whole plant growth stages and (ii) plant structure developmental stages http://www.plantontology.org/

Use GO for……. Determining which classes of gene products are over-represented or under-represented. Grouping gene products. Relating a protein’s location to its function. Focusing on particular biological pathways and functions (hypothesis-testing).

Functional Understanding Pathways & Networks Ontologies Functional Understanding GO Cellular Component GO Biological Process GO Molecular Function BRENDA Pathway Studio 5.0 Ingenuity Pathway Analyses Cytoscape Interactome Databases

http://www.agbase.msstate.edu/

Provides structural annotation for agriculturally important genomes Provides functional annotation (GO) Provides tools for functional modeling Provides bioinformatics & modeling support for research community

Avian Gene Nomenclature

GO & PO: literature annotation for rice, computational annotation for rice, maize, sorghum, Brachypodia Literature annotation for Agrobacterium tumefaciens, Dickeya dadantii, Magnaporthe grisea, Oomycetes Computational annotation for Pseudomonas syringae pv tomato, Phytophthora spp and the nematode Meloidogyne hapla. Literature annotation for chicken, cow, maize, cotton; Computational annotation for agricultural species & pathogens. literature annotation for human; computational annotation for UniProtKB entries (237,201 taxa).

Gene Products annotated Comparing AgBase & EBI-GOA Annotations 14,000 computational 12,000 manual - sequence 10,000 manual - literature Gene Products annotated 8,000 Complementary to EBI-GOA: Genbank proteins not represented in UniProt & EST sequences on arrays 6,000 4,000 2,000 AgBase EBI-GOA AgBase EBI-GOA Chick Chick Cow Cow Project

Contribution to GO Literature Biocuration AgBase EBI GOA Chicken 97.82% EBI-IntAct Roslin HGNC < 0.50% UCL-Heart project MGI Cow Reactome 88.78% < 1.50%

AgBase Quality Checks & Releases AgBase Biocurators ‘sanity’ check AgBase biocuration interface ‘sanity’ check & GOC QC AgBase database GO analysis tools Microarray developers ‘sanity’ check UniProt db QuickGO browser GO analysis tools Microarray developers EBI GOA Project ‘sanity’ check: checks to ensure all appropriate information is captured, no obsolete GO:IDs are used, etc. ‘sanity’ check & GOC QC Public databases AmiGO browser GO analysis tools Microarray developers GO Consortium database

Quality improvement Microarray annotations

IITA Crops cowpea – “reduced representation” sequencing underway soybean - preliminary assembly banana - sequencing in progress yam - genome sequencing for Dioscorea alata – EST development (IITA & VSU) cassava - genome sequencing in progress maize - genome sequencing completed; other subspecies being sequenced

Cowpea 54,123 genome sequences 187,483 ESTs Annotated via homology to Arabidopsis & other plants GO annotation via homology – availability?

Soybean NCBI: 1,459,639 ESTs, 34,946 proteins, 2,882 genes UniProt: 12,837 proteins (EBI GOA automatic GO annotation) UniGene assemblies available multiple microarrays available

Banana 7,102 genome sequences 14,864 ESTs 1,399 NCBI proteins; 680 UniProt Musa acuminata (sweet banana): 3,898 GO annotations to 491 proteins Musa acuminata AAA Group (Cavendish banana): 579 annotations to 96 proteins

Plantain Musa ABB Group (taxon:214693) - cooking banana or plantain 11,070 ESTs, 112 proteins 173 GO annotations to 53 proteins functional genomics based on banana?

Yams 55577 Dioscorea rotundata white yam 55571 Dioscorea alata water yam 29710 Dioscorea cayenensis yellow yam Dioscorea (taxon:4672) & subspecies NCBI: 31 ESTs, 623 proteins Genome sequencing for Dioscorea alata – EST development (IITA & VSU) 183 GO annotations to 25 proteins

Cassava ESTs: 80,631 NCBI proteins: 568, UniProt:253 2,251 GO annotations assigned to 218 proteins 2 Euphorbia esula (leafy spurge) /cassava arrays

Maize Zea mays (taxon:4577) Genome sequencing completed by Washington University – other subspecies being sequenced Active GO annotation project - 131,925 GO annotations to 20,288 proteins

AgBase Collaborative Model How can we help you? Can make GO annotations public via the GO Consortium Have computational pipelines to do rapid, first pass GO annotation (including transcript/EST sequences) Provide bioinformatics support for collaborators Developing new tools Training/support for modeling data

Dr Teresia Buza Dr Susan Bridges Cathy Grisham Divya Pedinti Lakshmi Pillai Philippe Chouvarine Seval Ozkan Hui Wang