Microbial Genome Annotation

Microbial Genome Annotation
Nikos Kyrpides DOE Joint Genome institute 1

Two main goals of genome analysis:
Evolutionary analysis How does an organism compare to the rest? Metabolic reconstruction What can an organism do and how?

Overview of Annotation Steps
DNA sequence >Contig1 ataacaacacattagcggc asacacacaacaggatatt aggagagagagaaagttac Gene Finding Function Prediction Identify Genes (Proteins, RNAs) Identify Regulatory elements Identify Repeat elements Blast Clusters (BBH, COGs, TIGRFam) Motifs (HMM, Pfam, InterPro) Automatic Manual Gene QC Gene Context (Fusions, Operons, Regulons) Missing Genes

1. Finding the genes in microbial genomes
Introduction 2. Tools out there 3. Basic principles behind tools 4. Known problems of the tools: why you may need manual curation

Finding the genes in microbial genomes
features Sequence features in prokaryotic genomes: stable RNA-coding genes (rRNAs, tRNAs, RNA component of RNaseP, tmRNA) protein-coding genes (CDSs) transcriptional features (mRNAs, operons, promoters, terminators, protein-binding sites, DNA bends) translational features (RBS, regulatory antisense RNAs, mRNA secondary structures, translational recoding and programmed frameshifts, inteins) pseudogenes (tRNA and protein-coding genes) … Fairly big project in terms of number of institutes and people involved

Tools out there: finding protein-coding genes (not ORFs!)
Reading frames: translations of the nucleotide sequence with an offset of 0, 1 and 2 nucleotides (three possible translations in each direction) Open reading frame (ORF): reading frame between a start and stop codon

Finding features in microbial genomes
Well-annotated bacterial genome in Artemis genome viewer:

Introduction 2. Tools out there 3. Basic principles behind tools 4. Known problems of the tools: why you may need manual curation

Tools out there: servers for microbial genome annotation - I
IMG-ER IMG-ER submission page: RAST JCVI Annotation Service Output: stable RNA-encoding genes, CDSs, functional annotations output in GenBank format Output: rRNAs and tRNAs, CDSs, functional annotations output in several formats Output: CDSs, stable RNAs? functional annotations format?

Tools out there: servers for microbial genome annotation - II
AMIGENE RefSeq EasyGene Output: CDSs, output in gff format Output: CDSs, output in tbl format Output: CDSs, size restriction <1Mb

Major difference: viewer vs editor?
Tools out there: genome browsers for manual annotation of microbial genomes Artemis Manatee Argo Major difference: viewer vs editor? Windows and Linux versions; works with files in many formats, annotated by any pipeline Linux versions only; genome needs to be annotated by the JCVI Annotation Service Windows and Linux; works with files in many formats

Tools out there: tools for finding stable (“non-coding”) RNAs - I
Large structural RNAs (23S and 16S rRNAs) RNAmmer Small structural RNAs (5S rRNA, tRNAs, tmRNA, RNaseP RNA component) Rfam database, INFERNAL search tool ARAGORN tRNAScan-SE Web service: sequence search is limited to 2 kb Web service: sequence search is limited to 15 kb, finds tRNAs and tmRNAs only Web service: sequence search is limited to 5 Mb, finds tRNAs only

Tools out there: tools for finding “non-coding” RNAs - II
Short regulatory RNAs Rfam database, INFERNAL search tool Other (less popular) tools: Pipeline for discovering cis-regulatory ncRNA motifs: RNAz Web service: sequence search is limited to 2 kb; Provides list of pre-calculated RNAs for publicly available genomes

Tools out there: most popular CDS-finding tools
CRITICA Glimmer family (Glimmer2, Glimmer3, RBS finder) GeneMark family (GeneMark-hmm, GeneMarkS) EasyGene AMIGENE PRODIGAL (default JGI gene finder) Combinations and variations of the above RAST (Glimmer2 + pre- and post-processing)

Basic principles: finding CDSs using evidence-based vs ab initio algorithms
Two major approaches to prediction of protein-coding genes: “evidence-based” (ORFs with translations homologous to the known proteins are CDSs) Advantages: finds “unusual” genes (e. g. horizontally transferred); relatively low rate of false positive predictions Limitations: cannot find “unique” genes; low sensitivity on short genes; prone to propagation of false positive results of ab initio annotation tools ab initio (ORFs with nucleotide composition similar to CDSs are also CDSs) Advantages: finds “unique” genes; high sensitivity Limitations: often misses “unusual” genes; high rate of false positives

Introduction Tools out there Basic principles behind tools Known problems of the tools: why you may need manual curation

Known problems: CDSs Short CDSs: many are missed, others are overpredicted short ribosomal proteins (30-40 aa long) are often missed short proteins in the promoter region are often overpredicted N-terminal sequences are often inaccurate (many features of the sequence around start codon are not accounted for) Glimmer2.0 is calling genes longer than they should be GeneMark, Glimmer3.0 mostly call genes shorter Pseudogenes and sequencing errors (artificial frameshift) all tools are looking for ORFs (needs valid start and stop codons) “unique” genes are often predicted on the opposite strand of a pseudogene or a gene with a sequencing error Proteins with unusual translational features (recoding, programmed frameshifts) these genes are often mistaken for pseudogenes see pseudogenes

Known problems: CDSs Lack of Standards

Finding unique genes Obligate parasite of horses
Causes human disease in tropical areas (melioidosis)

Phylogenetic profiler finds 548 unique genes in B. mallei
However, 497 of them in fact exist in B. pseudomallei, but they have not been called as real genes. The difference in gene models reveals 89.2% error rate in unique genes

Pati A. et al, Nature Methods June 2010
GenePRIMP GenePRIMP Gene Prediction Improvement Pipeline GenePRIMP is a pipeline that consists of a series of computational units that identify erroneous gene calls and missed genes and correct a subset of the identified defective features. APPLICATIONS Identify gene prediction anomalies Benchmark the quality of gene prediction algorithms Benchmark the quality of combination / coverage of sequencing platforms Improve the sequence quality Pati A. et al, Nature Methods June 2010

GenePRIMP steps

Find missing genes Intergenic regions identify missed ORFs …

… and wrong ORFs or2654 is unique and hides a real CDS which
is acyl carrier protein

Everything looks perfect in this area of Nitrobacter winogradskyi, but …

… hides a real ORF

Guinness Book of protein-coding genes
The longest human gene is 2,220,223 nucleotides long. It has 79 exons, with a total of only 11,058 nucleotides, which specify the sequence of the 3,685 amino acids and codes for a protein dystrophin. It is part of a protein complex located in the cell membrane, which transfers the force generated by the actin-myosin structure inside the muscle fiber to the entire fiber. The smallest human gene is 252 nucleotides long, it specifies a polypeptide of 67 amino acids and codes for an insulin-like growth factor II. The longest bacterial gene is 110,418 nucleotides long, which specify the sequence of 36,805 amino acids. Its function is unknown, most likely a surface protein. The smallest bacterial gene is 54 nucleotides long, it specifies a polypeptide of 17 amino acids and codes for a regulatory protein in cyanobacteria

False positives Genome name CDSs with no hits < 100 aa
% with tBLASTn hit % tBLASTn hits with frameshifts/stop codons Prochlorococcus AS9601 18 88.9 68.8 Prochlorococcus MIT 9211 62 40.3 80 Prochlorococcus MIT 9215 24 58.3 64.2 Prochlorococus MIT 9301 12 75 66.7 Prochlorococcus MIT9303 501 83 61.8 Prochlorococcus MIT 9313 35 8.6 Prochlrococcus MIT 9515 32 81.3 50 Prochlorococcus NATL1A 209 95.2 48.2 Prochlorococcus CCMP1375 34 82.4 Synechococcus PCC 7942 39 Synechococcus CC9311 313 11.5 83.3 Synethococcus CC9605 38.6 Synechococcus CC9902 21 57.1 100 Synechococcus JA-2-3Ba 176 26.7 85.1 Synechococcus JA-3-3Ab 142 35.2 92 Synechococcus PCC 7002 93 17.2 56.3 Synechococcus RCC307 184 10.3 68.4S Synechococcus WH 7803 18.8 Synechococcus WH 8102 38.4 46.7

2. Finding the functions in microbial genomes
Introduction 2. Tools out there 3. Known problems

what is function? cobalamin biosynthetic enzyme, cobalt-precorrin-4 methyltransferase (CbiF) molecular/enzymatic (methyltransferase) Reaction (methylation) Substrate (cobalt-precorrin-4) Ligand (S-adenosyl-L-methionine) metabolic (cobalamin biosynthesis) physiological (maintenance of healthy nerve and red blood cells, through B12).

Functional characterization

Computational approaches to Functional characterization

Sequence Homology Two sequences are homologous, if there existed a molecule in the past that is ancestral to both of the sequences. Types of Homology: Orthology: bifurcation in molecular tree reflects speciation Paralogy: bifurcation in molecular tree reflects gene duplication

Homology & analogy The term homology is confounded & abused in the literature! sequences are homologous if they’re related by divergence from a common ancestor analogy relates to the acquisition of common features from unrelated ancestors via convergent evolution e.g., b-barrels occur in soluble serine proteases & integral membrane porins; chymotrypsin & subtilisin share groups of catalytic residues, with near identical spatial geometries, but no other similarities Homology is not a measure of similarity & is not quantifiable it is an absolute statement that sequences have a divergent rather than a convergent relationship the phrases "the level of homology is high" or "the sequences show 50% homology", or any like them, are strictly meaningless! Homology:bat’s wing and human’s hand Analogy: bat’s wing and butterfly’s wing

Function prediction Function transfer by homology Homology
implies a common evolutionary origin. not retention of similarity in any of their properties. Homology ≠ similarity of function. Punta & Ofran. PLOS Comp Biol. 2008

Dos and Don’ts Type Don’t Do Homology Same function
Probability for same function Orthology Paralogy Sequence similarity High sequence similarity Same sequence

Application areas of analysis tools
The scale indicates % identity between aligned sequences Alignment of 2 random seqs can produce ~20% identity less than 20% does not constitute a significant alignment around this threshold is the Twilight Zone, where alignments may appear plausible to the eye, but can’t be proved by conventional methods

Finding the functions in microbial genomes
Introduction 2. Tools out there 3. Known problems

Function prediction Similarity searches (BLAST). Domain identification(Pfam). Small sequence identification(PROSITE).

What if nothing is similar ?
Subcellular localization Gene context Structure Prediction of binding residues (DISIS, bindN) Cytoplasm S ~ S Periplasm

Annotation should make sense
Model pathway Substrate A Substrate B Substrate C Substrate D Enzyme 1 Enzyme 3 Enzyme 2 Genome annotation Enzyme 1 Enzyme 3 Enzyme 2 ?

Annotation should make sense

Databases Databases used for the analysis of biological molecules.
Databases contain information organized in a way that allows users/researchers to retrieve and exploit it. Why bother? Store information. Organize data. Predict features (genes, functions ...). Predict the functional role of a feature (annotation). Understand relationships (metabolic reconstruction).

Primary nucleotide databases
EMBL/GenBank/DDBJ ( Archive containing all sequences from: genome projects sequencing centers individual scientists patent offices The sequences are exchanged between the three centers on a daily basis. Database is doubling every 10 months. Sequences from >140,000 different species. 1400 new species added every month. Database name nt / nr Year Base pairs Sequences ,575,745,176 40,604,319 ,037,734,462 52,016,762 ,019,290,705 64,893,747 ,874,179,730 80,388,382 ,116,431,942 98,868,465

Primary protein sequence databases
Contain coding sequences derived from the translation of nucleotide sequences GenBank Valid translations (CDS) from nt GenBank entries. UniProtKB/TrEMBL (1996) Automatic CDS translations from EMBL. TrEMBL Release 40.3 (26-May-2009) contains 7,916,844 entries.

RefSeq Curated transcripts and proteins. reviewed by NCBI staff.
Model transcripts and proteins. generated by computer algorithms. Assembled Genomic Regions (contigs). Chromosome records.

Classification databases
Groups (families/clusters) of proteins based on… Overall sequence similarity. Local sequence similarity. Presence / absence of specific features. Structural similarity. ... These groups contain proteins with similar properties. Specific function, enzymatic activity. Broad function. Evolutionary relationship. …

Overall sequence similarity

Clusters of orthologous groups (COGs)
COGs were delineated by comparing protein sequences encoded in 43 complete genomes representing 30 major phylogenetic lineages. Each Cluster has representatives of at least 3 lineages A function (specific or broad) has been assigned to each COG.

How it works COG1 COG2 Reciprocal best hit Bidirectional best hit
Blast best hit Unidirectional best hit COG1 COG2

Profiles & Pfam A method for classifying proteins into groups exploits region similarities, which contain valuable information (domains/profiles). These domains/profiles can be used to detect distant relationships, where only few residues are conserved.

Regions similarity

Pfam HMMs of protein alignments (local) for domains,
HMMs of protein alignments (local) for domains, or global (cover whole protein)

PROSITE http://au.expasy.org/prosite/
R-Y-x-[DT]-W-x-[LIVM]-[ST]-T-P-[LIVM](3)

KEGG orthology

Composite pattern databases
To simplify sequence analysis, the family databases are being integrated to create a unified annotation resource – InterPro Release 32.0 (Apr 11) contains entries Central annotation resource, with pointers to its satellite dbs

* It is up to the user to decide if the annotation is correct *

KEGG Contains information about biochemical pathways, and protein interactions.

Summary We have main archives (Genbank), and currated databases (Refseq, SwissProt), and protein classification database (COG, Pfam). This is the tip of the iceberg of databases. They help predict the function, or the network of functions. Systems that integrate the information from several databases, visualize and allow handling of data in an intuitive way are required

Functional annotation in IMG
Automated protein product assignment pipeline Functional context in IMG KEGG Pathways, Modules, KEGG Orthology MetaCyc Pathways IMG Pathways No longer maintained: TIGR Role Categories TIGR Genome Properties COG Functional Categories

Lack of Standards

Microbial Genome Annotation

Similar presentations

Presentation on theme: "Microbial Genome Annotation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Microbial Genome Annotation

Similar presentations

Presentation on theme: "Microbial Genome Annotation"— Presentation transcript:

Similar presentations

About project

Feedback