Prokaryotic Annotation at TIGR Michelle Gwinn Giglio June, 2005.

Slides:



Advertisements
Similar presentations
Business Development Suit Presented by Thomas Mathews.
Advertisements

Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Pfam(Protein families )
Orthology, paralogy and GO annotation Paul D. Thomas SRI International.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
The Protein Data Bank (PDB)
Comprehensive Microbial Resource Bioinformatics Visualization Workshop Owen White May 30, 2002.
Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Enzymatic Function Module (KEGG, MetaCyc, and EC Numbers)
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Advancing Science with DNA Sequence Data Curation in IMG-ER Natalia Ivanova MGM Workshop May 16, 2012.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Introduction to Gene Mining Part B: How similar are plant and human versions of a gene? After completing part B, you will demonstrate How to use NCBI BLASTp.
Protein Sequence Alignment and Database Searching.
Moodle (Course Management Systems). Assignments 1 Assignments are a refreshingly simple method for collecting student work. They are a simple and flexible.
Pathway Assignments. The assignment – Annotating Pathways KEGG Pathway Database.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
Sequence-based Similarity Module (BLAST & CDD only ) & Horizontal Gene Transfer Module (Ortholog Neighborhood & GC content only)
Grup.bio.unipd.it CRIBI Genomics group Erika Feltrin PhD student in Biotechnology 6 months at EBI.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
1 SRI International Bioinformatics GO Term Integration and Curation in Pathway Tools and EcoCyc Ingrid M. Keseler Bioinformatics Research Group SRI International.
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Protein and RNA Families
Copyright OpenHelix. No use or reproduction without express written consent1.
Operated by Los Alamos National Security, LLC for NNSA Bioscience Discovering virulence genes present in novel strains and metagenomes Chris Stubben IC.
This tutorial will describe how to navigate the section of Gramene that provides descriptions of alleles associated with morphological, developmental,
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.
Copyright OpenHelix. No use or reproduction without express written consent1.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Getting GO: how to get GO for functional modeling Iowa State Workshop 11 June 2009.
Guidelines for sequence reports. Outline Summary Results & Discussion –Sequence identification –Function assignment –Fold assignment –Identification of.
Group discussion Name this protein. Protein sequence, from Aedes aegypti automated annotation >25558.m01330 MIHVQQMQVSSPVSSADGFIGQLFRVILKRQGSPDKGLICKIPPLSAARREQFDASLMFE.
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
Copyright OpenHelix. No use or reproduction without express written consent1.
(H)MMs in gene prediction and similarity searches.
InterPro Sandra Orchard.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
Welcome to the combined BLAST and Genome Browser Tutorial.
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Protein families, domains and motifs in functional prediction May 31, 2016.
Using BLAST to Identify Species from Proteins
Protein families, domains and motifs in functional prediction
Bio/Chem-informatics
Protein Families, Motifs & Domains.
Sequence based searches:
Department of Genetics • Stanford University School of Medicine
Genome Annotation Continued
Genome Center of Wisconsin, UW-Madison
Sequence Based Analysis Tutorial
Annotation Presentation
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
Presentation transcript:

Prokaryotic Annotation at TIGR Michelle Gwinn Giglio June, 2005

Prokaryotic Annotation at TIGR we work in a high-throughput environment our team of 7 annotators finish genes per month there is a constant backlog of genomes waiting for manual annotation most of our genomes have little, or no, experimentally characterized proteins we rely heavily on sequence similarity methods to determine the functions of proteins in our genomes nearly all of our projects receive complete manual annotation prior to publication (only a very few have been released with automatic annotation)

GO Annotation at TIGR our manual annotation process is the same whether we add GO terms to our proteins or not using GO to categorize our proteins allows us to capture information that we have discovered in the manual annotation process that would otherwise be lost GO offers a system for the unambiguous communication of annotation information in a format amenable to computer searching and easy exchange.

Some History TIGR always recognized the importance of grouping genes according to the functions and processes in which they were involved. with the first prokaryotic genome published, H. influenzae, we adapted Monica Riley’s E. coli role categories We have continued to modify that role scheme and still assign TIGR roles today In 1998 we recognized that it would be really useful to have a set of role categories that could be used by all species and we started a project in that direction Also in 1998, I met Michael Ashburner and Suzi Lewis and learned of their efforts with GO, we decided to stop our project and wait to use GO During , TIGR’s genome V. cholerae was annotated to GO In 2002 TIGR joined the GO consortium Currently, TIGR has 11 prokaryotic genomes deposited with the GO repository (and many more with manual GO annotation, waiting internally at TIGR for publication.)

Adding GO Annotation to our system…. required us all to learn the GO system, its rules, data formats, etc. required significant changes to our tools and databases for the visualization and storage of GO data took time, however, there are vastly more resources available today then there were 5 years ago when we were making the shift, when GO was still quite young

The Goal of the Annotation Process determine the function of the protein if possible assign annotation to the protein: common name, gene symbol, EC number, TIGR role, GO terms, comments as needed store evidence for the annotation (something we always did) annotation should only be as specific as evidence supports, err on the side of undercalling rather than overcalling

How do we determine the functions of the proteins? The best thing is to do an experiment on the protein - not really possible for us to do shared sequence implies shared function –we are well aware of cases where one amino acid change results in change of function –all of our functional assignments must be considered putative until experimentally confirmed collect and evaluate information from many sequence based search and prediction tools –BER (BLAST-extend-repraze) –HMM (Hidden Markov Model) –TMHMM (Transmembrane HMM) –SignalP (Signal peptide) –PROSITE –InterPro –Paralogous families –Genome Properties The ISS annotations in TIGR data are sequence evaluations performed by us, not from authors in the literature We use our annotation tool Manatee to view information and make annotations, you will see screen shots in following slides

The Manatee Gene Curatation Page

BER searches TIGR’s pairwise alignment tool initial BLAST to collect proteins with any similarity to the search protein modified Smith-Waterman alignment generated between search protein and each BLAST result result is a file containing one pairwise alignment for each match protein from the BLAST view alignments in our Manatee annotation tool we do the 2-step process because BLAST is fast and Smith- Waterman is slow, so it saves CPU time to only do the Smith- Waterman alignments on things that have any hope of matching

BER in pictures niaa BLAST modified Smith- Waterman Alignment genome.pep Significant hits put into mini-dbs for each protein (non-identical amino acid), mini-db for protein #1 mini-db for protein #2, mini-db for protein #3... mini-db for protein #3000 File of pairwise alignments

BER alignment from Manatee

Are all matches with equal alignment quality of equal value to annotation? NO! we want to see matches of our genome proteins to proteins from other species which have been experimentally characterized in that other species only such “characterized matches” can be used as evidence for functional annotation to help in our annotation process we have created a database storing accessions of proteins known to be experimentally characterized (does not contain all such proteins, but we add to it constantly) our tools highlight experimentally characterized proteins to help annotators see them

BER skim from Manatee

HMMs Hidden Markov Model statistical model of the patterns of amino acids in a multiple alignment of proteins (called the “seed) which share sequence and functional similarity at TIGR, each HMM is assigned to a category (called “isology type”) which describes the type of relationship the proteins in the model have to each other –equivalog –superfamily –subfamily –domain one can search proteins against HMMs, they receive a score indicating how well they match the model by comparing this score to the cutoff scores assigned to each model, one can determine whether or not the search protein is a member of the group defined by the HMM –“trusted cutoff’ - proteins scoring above this score are considered a member of the group defined by the HMM –“noise cutoff” - proteins scoring below this score are considered NOT to be a member of the group defined by the HMM –for proteins scoring between trusted and noise, the HMM evidence is not sufficient to determine whether the protein is a member of the functional group or not

Annotation is attached to HMMs TIGR00433 –isology: equivalog –name: biotin synthase –EC: –gene symbol: bioB –TIGR role: 77 (Biotin biosynthesis) –GO terms: GO: (biotin synthase activity), GO: (biotin biosynthesis) PF04055 –isology: domain –name: radical SAM domain protein –EC: not applicable –gene symbol: not applicable –TIGR role: 703 (enzymes of unknown specificity) –GO terms: GO: (catalytic activity), GO: (metabolism)

HMM section from Manatee

Things to ask yourself when using HMMs Does my protein score above the trusted cutoff? What isology type is the HMM? What annotation on the HMM can I use for my protein?

Genome Properties Used to get “the big picture” of an organism. Specifically to record and/or predict the presence/absence of: –metabolic pathways biotin biosynthesis –cellular structures outer membrane –traits anaerobic vs. aerobic optimal growth temperature Particular property has a given “state” in each organism, for example: –YES - the property is definitely present –NO - the property is definitely not present –Some evidence - the property may be present and more investigation is required to make a determination The state of some properties can be determined computationally –metabolic pathway the property is defined be several reaction steps which are modeled by HMMs HMM matches to steps in pathway indicate that the organism has the property Other property’s states must be entered manually (growth temp, anaerobic/aerobic, etc.) data for a particular genome viewable in Manatee –links from HMM section on the Gene Curation Page –links from gene list for role category –entire list of properties and states can be viewed Searchable across genomes on the Comprehensive Microbial Resource (CMR) site

Genome Property Report page from Manatee

Goals assign annotation to each protein –name, gene symbol, EC number, TIGR role, GO terms confirm coordinates of gene avoid transitive annotation

AutoAnnotate computationally gives preliminary annotation to each protein adds GO terms with IEA –from HMM match –from BER match AutoAnnotate designed for a system in which all annotations are manually reviewed If automatic annotation was the endpoint for our projects, we would have to change AutoAnnotate to be more strict and conservative in its decisions

Knowledge about function reflected in specificity of protein names high confidence - –“adenylosuccinate lyase”, purB, general function, lacks specificity –“carbohydrate kinase, FGGY family –no gene symbol, partial EC number family designation –“Cbby family protein” homolog designation –“recA homolog” hypotheticals –“hypothetical protein” –“conserved hypothetical protein” “putative recA” –used sparingly

Sample GO trees Function catalytic activity kinase activity carbohydrate kinase activity ribokinase activity glucokinase activity fructokinase activity Process metabolism carbohydrate metabolism monosaccharide metabolism hexose metabolism glucose metabolism fructose metabolism pentose metabolism ribose metabolism available evidence for 3 genes #1 -HMM for ribokinase -match to an experimentally characterized ribokinase #2 -HMM for kinase -match to experimentally characterized glucokinase and fructokinase #3 -HMM for kinase Knowledge about function reflected in specificity of GO terms

translation disruptions authentic frameshift authentic point mutation degenerate truncation deletion insertion interruption fusion fragment Get GO terms No GO terms TIGR role “disrupted reading frame”

Assigning GO terms Once we have found out all that we can about a protein, we assign GO terms to describe the protein things that facilitate finding a term –fast/easy ontology search tools –tools that make term suggestions –tools that format the evidence for you –tools that reduce copy/paste/typing as much as possible

Tools that suggest terms Mapping files –ec2go –tigrfams2go –interpro2go Manatee suggestions –Matches to V. cholerae, B. anthracis –Genome Properties –HMMs Automated assignments –From HMMs and good pairwise matches –Viewed as suggestions, not final annotation

Our Manatee Tool Prevents assignment of GO terms that are non- existent or obsolete Knows the correct format for the evidence fields –Allows addition of terms and evidence with one click –Uses correct abbreviations Rarely a need to copy and paste In many cases the term you need is suggested on the page somewhere already

Clicking on the various GO suggestions around the Manatee Gene Curation page puts the correct info into the correct fields in the correct format without the need to copy and paste.

Searching for terms in Manatee Searches of ontologies –go_id search (returns tree, term info) –GO term keyword search (searches synonyms too) Searches of annotations –Protein name keyword search –go_id search (returns lists of proteins assigned that term) –Correlations (input a go_id and receive a list of terms assigned in conjunction with input term and the percent of occurrence of each correlation) EC number search (input EC #, return go_id) GO BLAST page (searches all proteins annotated to GO)

GO search tools in Manatee

Keeping up with GO content TIGR downloads the newest version of the ontologies nightly into our db for use by our tools Periodically we check our annotations for the presence of obsolete or secondary ids and we send updates

Changing GO content TIGR has been contributing requests for ontology content changes continuously since we joined GO (close to 200 submissions) The SourceForge submission system works very well Most requests are handled within a few days, some more complicated things may take a few weeks, the rare really complicated thing may take a few months (again, that’s very rare, see PAMGO example) Initially there were some aspects of the ontologies that were incorrect for proks (ex. ATPsynthase), these have been fixed as they were discovered.

PAMGO effort

Future directions Develop prok GO slim –Use it where we now use TIGR roles –Cease use of TIGR roles Add more functionality to CMR GO tools –More refined searches –Search across all TIGR GO data, not just prokaryotic Use accumulated prok GO data more effectively to predict annotations for new proteins

More about Manatee TIGR’s main manual annotation tool web based Displays all known information about a protein interface for entry of annotation information into the database open source, freely available on SourceForge for downloading and local use (manatee.sourceforge.net) TIGR offers a hands on 3-day annotation course, 4 times per year which details our annotation process, the use of Manatee, installation of Manatee, and the use of the CMR Taught by Michelle Gwinn Giglio, Tanja Davidson, and Todd Creasy Next class June 28-30, Aug

The Manatee Gene Curatation Page

Annotation Engine clients send us a DNA sequence we run our entire pipeline up to the point where manual curation starts we return a MySQL database and associated files with all of the data so the client can do manual annotation of the genome the client can install Manatee locally and run it using the MySQL database the data is kept completely confidential if that is the desire of the client this service allows researchers access to TIGR’s infrastructure and tools, saving the need to expend the time and expense (which they might not have) to create infrastructure of their own

It’s all a team effort Owen White Prokaryotic annotation team –Bill Nelson, Bob Dodson, Scott Durkin, Sean Daugherty, Ramana Madupu, Lauren Brinkac, Steven Sullivan, Sagar Kothari Todd Creasy, Tanja Davidson Eukaryotic annotation team All of our tool developers GO group (last but definitely not least)