COG and GO tutorial
The Clusters of Orthologous Groups (COGs) Database The protein database of Clusters of Orthologous Groups (COGs) is an attempt to phylogenetically classify the complete complement of proteins (both predicted and characterized) encoded by complete genomes. Each COG is a group of three or more proteins that are inferred to be orthologs, i.e., they are direct evolutionary counterparts.
An example
Ontologies for Molecular Biology “Ontologies provide controlled, consistent vocabularies to describe concepts and relationships, thereby enabling knowledge sharing” (Gruber 1993) Gene Ontologies (GO) - Ontologies for molecular biology domains developed and supported by the Gene Ontology Consortium for gene and gene product annotations for all organisms Not enough to achieve integration through the use of accession IDs to identify common objects. Machine-interpretable definitions of basic concepts in a domain and relations amoung them. Defines a common vocabulary for researchers who need to share information in a domain. Need to define biological terms for communication of concepts ‘ectoderm determination’. These concept definitions need to be valid for all relevant organisms…plants and animals and microbial systems…if we are to be able to exploit the information we have to achieve greater understanding of cellular functions term: ectoderm specificationGO id: NEWdefinition: The processes involved in the specification of cell identityin the ectoderm. Once specification has taken place, a cell will becommitted to differentiate down a specific pathway if left in itsnormal environment.definition_reference: GO:curators
Gene Ontology Objectives GO represents concepts used to classify specific parts of our biological knowledge: Biological Process Molecular Function Cellular Component GO develops a common language applicable to any organism GO terms can be used to annotate gene products from any species, allowing comparison of information across species GO is the designation of a project as well as the product of the project. Starting with the cellular level, we are not distinguishing cell types, organs, etc. Gene Ontology is a collaboration between the fly (FlyBase), mouse (MGD) genome databases, and yeast (SGD). All three groups had started independent projects to produce controlled vocabularies for the biology of their organisms. You will all be familiar with hierarchical system to classify enzymes (EC) or functions (YPD, SwissPROT, MIPS, …). We have divided our project into the creation of three ontologies. These are not necessarily hierarchical rather they can be a network of associations -- a directed acyclic graph (DAG). Process: cell cycle, nutrient transport, behavior, Function: alcohol dehydrogenase, Cellular Location: organelle, protein complex, subcellular compartment
What GO is NOT: Not a way to unify biological databases Not a dictated standard Not a database of gene products, protein domains, or motifs Does not define evolutionary relationships
The 3 Gene Ontologies Molecular Function = elemental activity/task the tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity Biological Process = biological goal or objective broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions Cellular Component = location or complex subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme GO is the designation of a project as well as the product of the project. Starting with the cellular level, we are not distinguishing cell types, organs, etc. Gene Ontology is a collaboration between the fly (FlyBase), mouse (MGD) genome databases, and yeast (SGD). All three groups had started independent projects to produce controlled vocabularies for the biology of their organisms. You will all be familiar with hierarchical system to classify enzymes (EC) or functions (YPD, SwissPROT, MIPS, …). We have divided our project into the creation of three ontologies. These are not necessarily hierarchical rather they can be a network of associations -- a directed acyclic graph (DAG). Process: cell cycle, nutrient transport, behavior, Function: alcohol dehydrogenase, Cellular Location: organelle, protein complex, subcellular compartment
Terms, Definitions, IDs term: MAPKKK cascade (mating sensu Saccharomyces) goid: GO:0007244 definition: OBSOLETE. MAPKKK cascade involved in transduction of mating pheromone signal, as described in Saccharomyces. definition_reference: PMID:9561267 comment: This term was made obsolete because it is a gene product specific term. To update annotations, use the biological process term 'signal transduction during conjugation with cellular fusion ; GO:0000750'. definition: MAPKKK cascade involved in transduction of mating pheromone signal, as described in Saccharomyces
Directed Acyclic Graph EBI GOA
Evidence Codes for GO Annotations http://www.geneontology.org/doc/GO.evidence.html
IEA Inferred from Electronic Annotation ISS Inferred from Sequence Similarity IEP Inferred from Expression Pattern IMP Inferred from Mutant Phenotype IGI Inferred from Genetic Interaction IPI Inferred from Physical Interaction IDA Inferred from Direct Assay RCA Inferred from Reviewed Computational Analysis TAS Traceable Author Statement NAS Non-traceable Author Statement IC Inferred by Curator ND No biological Data available
Useful information and links COG: http://www.ncbi.nih.gov/COG Science 1997 Oct 24;278(5338):631-7 BMC Bioinformatics 2003 Sep 11;4(1):41 GO: http://www.geneontology.org/ Amigo: http://www.godatabase.org/cgi-bin/amigo/go.cgi GOst: http://www.godatabase.org/cgi-bin/gost/gost.cgi GOA: http://www.ebi.ac.uk/GOA/
Homework 1. Please explain how to annotate a given DNA sequence by using COG (The public tools are not ready to do this. Please explain how will you do it.) 2. Please look for the function of the gene(s) in the given DNA sequence by using gene ontology