Functional Annotation Background and Strategy

Functional Annotation Background and Strategy
Adrian Lawsin Dhruviben Patel Gayathri Suresh Kumar Jeffrey V Porubsky Krutika Satish Gaonkar Lori Gladney Xingyu Yang

Our Agenda for today: Objective Introduction and Background Functional Annotation Methods Gene Ontology Identification tool Pipeline Our Strategy

Objectives Annotate the genomes of the 25 Neisseria meningitidis strains provided to us by the gene prediction group using the .GFF files which contain the predicted location of protein encoding genes. To provide the annotated .GFF files to the comparative genomics group, which may be used for strain comparisons and discovery of unique features.

Functional Annotation
What is functional annotation and where does it fit in the process of computational genomics? Piece together the puzzle (reads) into a contiguous sequence or multiple contiguous sequences (contigs) Predict the location of protein coding genes and non-coding sequence Assign features: gene names, protein function, structures, etc. Compare features in different strains or look for unique features. The options are endless! Assembly Gene Prediction Functional Annotation Comparative Genomics

What is the biological importance of determining function?
Proteins: are molecules that are involved in cellular functions. have specific biological functions which may include providing structural support to the cell, movement, transport of nutrients, release of toxins and many more vary in structure and their functions constructed from 20 amino acids and have distinct three-dimensional shapes

Central Dogma of Biology
DNA sequence, arranged in sets of genes (Genotype) Proteins responsible for observable characteristics (Phenotype) RNA transcription translation

What is functional annotation?
Functional annotation aims to assign a biological meaning to genomic sequence, particularly the protein coding sequence. Biochemical function (molecules and their interactions) Biological function (in terms of a biological process) Gene regulation and interactions (how genes work together) Gene expression (to produce a phenotype)

Hierarchy of Functional Annotation
DNA Genes Operons Proteins Function Pathways, Domains, Motifs, Transmembrane regions, Signal peptides Operons are sets of genes that are regulated together and code for proteins which interact to carry out particular functions in a biological pathway.

Operons The regulatory gene codes for the repressor protein.
The promoter site is the attachment site for RNA polymerase. The operator site is the attachment site for the repressor protein. The repressor protein is a DNA- or RNA-binding protein that inhibits the expression of one or more genes by binding to the operator site. The structural genes code for the proteins.

Snapshot of Protein Functions

Functional Annotation Methods
Ab-initio methods are based on the intrinsic characteristics of the genes and proteins (i.e. sequence composition). Examples are LipoP, SignalP, TMHMM Extrinsic methods are homology-based, where gene sequences are compared to databases with known genes using alignment algorithms such as BLAST. Most widely used annotation strategy Databases may include experimentally and computationally determined genes/proteins Examples are Blast2GO, InterProScan, RPS-BLAST

Gene Ontology Identification tool Functional Annotation Methods
Blast2Go Intrinsic method Extrinsic method Automatic Annotation pipelines Ab-initio Homology Based SignalP LipoP TMHMM BLAST InterproScan RAST PROKKA PREDICTPROTEIN

Intrinsic characteristics
A transmembrane protein is a type of protein that spans from one side of a membrane through to the other side of the membrane. They function as channels that may conform to a particular shape to allow or deny the transport of specific substances in and out of the cell. Signal peptides direct the protein where it should go, which could be in or out of the cell. Examples where these structures are important are secretion (bacteria can secrete toxins out of the cell).

TMHMM (A hidden Markov model for predicting transmembrane helices in protein)
Combination of known knowledge on transmembrane helices A transmembrane protein is a type of protein that spans from one side of a membrane through to the other side of the membrane.

Important characteristic to make annotation:
Helix core Cap (both side) short Loop globular

Helices

Transmission Probability: Knowledge driven
Hidden Status: i,o,M Observed Status: AA Transmission Probability: Knowledge driven Standard inference technique Usage: cat xxx.faa | tmhmm

SignalP Signal Peptide Predictor(Latest: 4.1)

Signal peptide definition:
Broad sense: Any signal embedded in the amino acid sequence of a protein. Narrow sense: A N-terminal signal that makes the protein traverse across the ER membrane/plasma membrane in eukaryotes/prokaryotes, respectively. SignalP uses Narrow sense. If a protein has a signal peptide, it only means the protein enters the secretory pathway (not all are secreted) Eukaryotes – The protein can be retained in the ER, Golgi apparatus, plasma membrane, or be directed to lysosome/vacuole. Gram-positive bacteria – The protein can have transmembrane helices or be attached to the cell wall Gram-negative bacteria – The protein can have transmembrane helices or be retained in the periplasm or inserted in the outer membrane as a β-barrel transmembrane protein.

With v.4.1, SignalP is based on Neural Network (NN) method, not Hidden Markov Model (HMM) like previous versions. Neural Network method, in general, is pattern recognition based on statistics, mimicking machine learning, or artificial intelligence. Usually the “neurons” that are connected to each other are sets of data, with weights given to them by a “learning” algorithm, resulting in approximations to non-linear functions.

Machine Learning techniques have been used to predict subcellular locations for almost 20 years
Improvement in data sets and the data annotation of proteins (cleavage site) significantly improves performance SignalP includes the notion that cleavage site position and amino acid composition of the signal peptide are correlated Predicts the presence of signal peptidase I cleavage sites, not signal peptidase II cleavage sites found in lipo-proteins (LipoP)

Example Output # SignalP-4.1 euk predictions
>sp_Q9BS26_ERP44_HUMAN Endoplasmic reticulum resident protein 44 OS_Homo sapiens GN_ERP44 PE_1 SV_1

C-score (raw cleavage site score)
The output from the CS networks, which are trained to distinguish signal peptide cleavage sites from everything else. Note the position numbering of the cleavage site: the C-score is trained to be high at the position immediately after the cleavage site (the first residue in the mature protein). S-score (signal peptide score) The output from the SP networks, which are trained to distinguish positions within signal peptides from positions in the mature part of the proteins and from proteins without signal peptides. Y-score (combined cleavage site score) A combination (geometric average) of the C-score and the slope of the S-score, resulting in a better cleavage site prediction than the raw C-score alone. This is due to the fact that multiple high-peaking C-scores can be found in one sequence, where only one is the true cleavage site. The Y-score distinguishes between C-score peaks by choosing the one where the slope of the S-score is steep. mean S The average S-score of the possible signal peptide (from position 1 to the position immediately before the maximal Y-score). D-score (discrimination score) A weighted average of the mean S and the max. Y scores. This is the score that is used to discriminate signal peptides from non-signal peptides.

LipoP Produces predictions of lipoproteins and discriminates between lipoprotein signal peptides, other signal peptides and n-terminal membrane helices.

LipoP Example SpI: signal peptide (signal peptidase I)
SpII: lipoprotein signal peptide (signal peptidase II) TMH: n-terminal transmembrane helix. CYT: cytoplasmic. It really just means all the rest. CleavI: Cleavage sites for (signal peptidase I). CleavII: Cleavage sites for (signal peptidase II).

Extrinsic characteristics
These characteristics of sequence comparison imply that genes/proteins may be related by shared ancestry, but may or may not have a similar function: High sequence identity between the query and the subject High sequence coverage of the query to the subject Low E-values; indicates the probability of the alignment occurring by chance is low *Criteria (thresholds) must be established to aid in gene naming.

Advantages and Disadvantages
Homology does not equate to the same function; however, orthology may indicate a probability of same function. Orthologs are genes in different species that have diverged by vertical descent from a common ancestor. Databases may be contaminated with bad sequences, incorrect gene predictions and annotations. Judge wisely!

BLAST “The BLAST algorithm was written balancing speed and increased sensitivity for distant sequence relationships. Instead of relying on global alignments (commonly seen in multiple sequence alignment programs) BLAST emphasizes regions of local alignment to detect relationships among sequences which share only isolated regions of similarity (Altschul et al., 1990). Therefore, BLAST is more than a tool to view sequences aligned with each other or to calculate percent homology, but a program to locate regions of sequence similarity with a view to comparing structure and function. “ NCBI

Blast options blastp compares an amino acid query sequence against a protein sequence database. blastn compares a nucleotide query sequence against a nucleotide sequence database. blastx compares a nucleotide query sequence translated in all reading frames against a protein sequence database. tblastn compares a protein query sequence against a nucleotide sequence database translated in all reading frames. tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

BLAST Based on the Smith-Waterman algorithm

BLAST Steps 1.Remove low-complexity region or sequence repeats in the query sequence. "Low-complexity region" means a region of a sequence composed of few kinds of elements. These regions might give high scores that confuse the program to find the actual significant sequences in the database, so they should be filtered out. The regions will be marked with an X (protein sequences) or N (nucleic acid sequences) and then be ignored by the BLAST program. To filter out the low-complexity regions, the SEG program is used for protein sequences and the program DUST is used for DNA sequences. On the other hand, the program XNU is used to mask off the tandem repeats in protein sequences. 2.Make a k-letter word list of the query sequence. Take k=3 for example, we list the words of length 3 in the query protein sequence (k is usually 11 for a DNA sequence) "sequentially", until the last letter of the query sequence is included

BLAST Steps 3.List the possible matching words.
BLAST only cares about the high-scoring words. The scores are created by comparing the word in the list in step 2 with all the 3-letter words. By using the scoring matrix (substitution matrix) to score the comparison of each residue pair, there are 20^3 possible match scores for a 3-letter word. For example, the score obtained by comparing PQG with PEG and PQA is 15 and 12, respectively. For DNA words, a match is scored as +5 and a mismatch as -4, or as +2 and -3. After that, a neighborhood word score threshold T is used to reduce the number of possible matching words. The words whose scores are greater than the threshold T will remain in the possible matching words list, while those with lower scores will be discarded. For example, PEG is kept, but PQA is abandoned when T is 13.

BLAST Steps 4.Organize the remaining high-scoring words into an efficient search tree. This allows the program to rapidly compare the high-scoring words to the database sequences. 5.Repeat step 3 to 4 for each k-letter word in the query sequence. 6.Scan the database sequences for exact matches with the remaining high-scoring words. The BLAST program scans the database sequences for the remaining high-scoring word, such as PEG, of each position. If an exact match is found, this match is used to seed a possible un-gapped alignment between the query and database sequences.

BLAST Steps 7. Extend the exact matches to high-scoring segment pair (HSP). The original version of BLAST stretches a longer alignment between the query and the database sequence in the left and right directions, from the position where the exact match occurred. The extension does not stop until the accumulated total score of the HSP begins to decrease. A simplified example is presented in figure 2. Fig. 2 The process to extend the exact match. Fig. 3 The positions of the exact matches. To save more time, a newer version of BLAST, called BLAST2 or gapped BLAST, has been developed. BLAST2 adopts a lower neighborhood word score threshold to maintain the same level of sensitivity for detecting sequence similarity. Therefore, the possible matching words list in step 3 becomes longer. Next, the exact matched regions, within distance A from each other on the same diagonal in figure 3, will be joined as a longer new region. Finally, the new regions are then extended by the same method as in the original version of BLAST, and the HSPs' (High-scoring segment pair) scores of the extended regions are then created by using a substitution matrix as before. 8. List all of the HSPs in the database whose score is high enough to be considered. 9. We list the HSPs whose scores are greater than the empirically determined cutoff score S. By examining the distribution of the alignment scores modeled by comparing random sequences, a cutoff score S can be determined such that its value is large enough to guarantee the significance of the remaining HSPs.

BLAST Steps

Software which scans sequences (protein or
nucleotide) against interpro 's signatures, predictive models, provided by several different databases

InterProScan InterProScan provides annotations based on homology and GO terms, but uses an HMM discovery algorithm Provides functional analysis of protein sequences by classifying them into families and predicting the presence of domains and important sites Relies on a larger number of sources for its annotations: Gene3D, Superfamily, PIRSF, TIGER, Panther, Pfam, SMART, PRINTS, HAMAP, ProSite, ProDom Can be run as a standalone or from the web server Input : any nucleotide or protein sequences (such as raw, FASTA or EMBL) Output : In Raw, html, gff3 format

HMMS SMART : annotation of genetically mobile domains
TIGRFAMs : collection of protein families identifying functionally related proteins based on sequence homology. Superfamily : database of structural and functional annotation for all proteins and genomes PIRSF : Superfamilies based on evolutionary relationships PFAM : large collection of multiple sequence alignments and hidden Markov models covering many common protein domains GENE3D: database describes protein families and domain architectures in complete genomes. PANTHER: The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System was designed to classify proteins (and their genes) in order to facilitate high-throughput analysis.

PRINTS : protein fingerprints which is a group of conserved motifs used to characterise a protein family or domain Prosite: consist of biologically significant sites, patterns and profiles that help to identify to which known family a new sequence belongs. HaMAP : identification of proteins family Prodom : protein domain database identification of homologous domains

Automatic Annotation Pipelines

RAST:Rapid Annotation using Subsystem technology(www.nmpdr.org.)
The Subsystem Technology involves expert assertion of the function roles of the subsystems The service identifies protein-encoding, rRNA and tRNA genes Assigns functions to genes, predicts which subsystems are represented in the genome Uses this information to reconstruct the metabolic network FIGfams: Yet another set of protein families Proteins are placed in the same family: If both are present in the manually curated subsystem spreadsheet and has 70% similarity If they come from the closely related genomes, if the adjacent genes are seen to be correspond.

Annotation tRNA: tRNAscan-SE rRNA: Neils Larsen
Protein Encoding Genes: GLIMMER2 Targeted search using closely related genomes Process the remaining genes against the FIG fam and non redundant protein database Construct a metabolic reconstruction

Prokka Prokka is a software tool for the rapid annotation of prokaryotic genomes. BioPerl NCBI-BLAST+ Used for similarity searching against protein sequence libraries Aragorn Finds transfer RNA features (tRNA) .The Program employs heuristic algorithms to predict tRNA secondary structure ,based on homology with recognized tRNA consensus sequences and ability to form a base paired clover-leaf Infernal: Infernal (“INFERence of RNA Alignment”) is for searching DNA sequence databases for RNA structure and sequence similarity. Prodigal Prodigal: prokaryotic gene recognition and translation initiation site identification. Finds protein-coding features (CDS) HMMER3 Used for similarity searching against protein family profiles

Annotation Organism details: --genus [X] Genus name (default 'Genus') --species [X] Species name (default 'species') --strain [X] Strain name (default 'strain') --plasmid [X] Plasmid name or identifier (default '') Annotations: --kingdom [X] Annotation mode: Archaea|Bacteria|Viruses (default 'Bacteria') --gcode [N] Genetic code / Translation table (set if --kingdom is set) (default '0') --gram [X] Gram: -/neg +/pos (default '') --usegenus Use genus-specific BLAST databases (needs --genus) (default OFF) Computation: --fast Fast mode - skip CDS /product searching (default OFF) --cpus [N] Number of CPUs to use [0=all] (default '8') --evalue [n.n] Similarity e-value cut-off (default '1e-06') --rfam Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default '0') --norrna Don't run rRNA search (default OFF) --notrna Don't run tRNA search (default OFF)

Output .gff This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV. .fna Nucleotide FASTA file of the input contig sequences. .faa Protein FASTA file of the translated CDS sequences. .ffn Nucleotide FASTA file of all the annotated sequences, not just CDS. .sqn An ASN1 format "Sequin" file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc. .err Unacceptable annotations - the NCBI discrepancy report .txt Statistics relating to the annotated features found

PredictProtein 20 Methods-In-One

Began by just predicting 2⁰ structure and looking up protein families
Began by just predicting 2⁰ structure and looking up protein families.Now calculates: Multiple sequence alignment (PSI-BLAST of SWISSPROT,TrEMBL,PDB Databases) ProSite sequence motifs domain assignment (CHOP) low-complexity regions (SEG) Nuclear localisation signals (PredictNLS) secondary structure (PHDsec or PROFsec) solvent accessibility (PHDacc or PROFacc) globular regions (GLOBE) transmembrane helices (PHDhtm) coiled-coil regions (COILS) structural switch regions (ASP) B-value (PROFBVAL) disordered regions (NORSp) intra-residue contacts (PROFcon) protein protein and protein/DNA binding sites (PROFISIS) sub-cellular localization (LOCTREE) beta barrels (PROFTMB) cysteine predictions and disulphide bridges (DISULFIND)

Example Output

Gene Ontology Identification tool:
BLAST2GO

Gene Ontology A major bioinformatics initiative
Unifies the representation of gene and gene product attributes across all species The project aims to: **** Maintain and develop its controlled vocabulary of gene and gene product attributes **** Annotate genes and gene products, and assimilate and disseminate annotation data **** Provide tools for easy access to all aspects of the data provided by the project.

Gene Ontology consortium and The ontologies
The GO Consortium: Set of model organism and protein databases and biological research communities actively involved in the development and application of the Gene Ontology Began as a collaboration between three model organism databases ——> FlyBase (Drosophila), the Saccharomyces Genome Database (SGD) and the Mouse Genome Database (MGD), in 1998 GO consortium has grown since then and now it includes many databases : several world's major repositories for plant, animal and microbial genomes

contd… The Gene Ontology project provides an ontology of defined terms representing gene product properties The ontology covers three domains: -- Cellular component(the parts of a cell or its extracellular environment) -- Molecular function(the elemental activities of a gene product at the molecular level, such as binding or catalysis) -- Biological process(operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs, and organisms) For example: the gene product cytochrome c can be described by the molecular function term oxidoreductase activity, the biological process terms oxidative phosphorylation and induction of cell death, and the cellular component terms mitochondrial matrix and mitochondrial inner membrane.

GO terms The Gene Ontology project provides an ontology of defined terms representing gene product properties Each GO term within the ontology has a term name, which may be a word or string of words; a unique alphanumeric identifier; a definition with cited sources; and a namespace indicating the domain to which it belongs. Terms may also have synonyms, which are classed as being exactly equivalent to the term name, broader, narrower, or related; references to equivalent concepts in other databases; and comments on term meaning or usage. Example of GO term

Annotation and GO Genome annotation is the practice of capturing data about a gene product GO annotations use terms from the GO ontology to do so The members of the GO Consortium submit their annotation for integration and dissemination on the GO website, where they can be downloaded directly or viewed online using AmiGO. In addition to the gene product identifier and the relevant GO term ,GO annotations have the following data: # The reference used to make the annotation # An evidence code denoting the type of evidence upon which the annotation is based # The date and the creator of the annotation Example annotation:

Problems faced during GO analysis
When trying to perform GO-based analysis in poorly characterized organisms ,the following issues are encountered: These tools are not designed for high-throughput sequence annotation Are limited in their mining and visualization capabilities Accept only gene or probe identifiers as input data, making them restrictive to annotated sequences already deposited in public databases

Solution: Blast2GO (B2G)
A universal GO annotation, visualization and statistics framework that brings advanced functional analysis to the genomics research of non-model species B2G has been design to (1) allow automatic and high throughput sequence annotation and (2) integrate functionality for annotation-based data mining

BLAST2GO: OVERVIEW B2G uses BLAST to find homologs to fasta formatted input sequences OBTAINING GO TERMS:The program extracts GO terms to each obtained hit by mapping to existent annotation associations ANNOTATION:An annotation rule finally assigns GO terms to the query sequence VISUALIZATION:Annotation and functional analysis can be visualized in a graph form reconstructing the GO relationships and color-highlighting the most relevant areas

Conesa A et al. Bioinformatics 2005;21:3674-3676

(1) Blasting: a group of selected sequences is blasted against either the NCBI or custom databases
(2) Mapping: GO terms are mapped on the blast results using annotation files provided by the GO Consortium that are downloaded on a monthly basis at the Blast2GO server (3) Annotation: sequences are annotated using an annotation rule that takes parameters provided by the user (4) Statistical analysis: optionally, analysis of GO term distribution differences between groups of sequences can be performed (5) Visualization: annotation and statistics results can be visualized on the GO DAG NOTE:Different charts are available,at each of these steps, to evaluate the progress of the analysis and data can be saved and exported in different formats.

Quick steps: At any point of the progress of the BLAST search, four different charts can be generated for a global visualization of the results

Blast2Go performs four different mappings
BLAST result accessions are used to retrieve gene names or Symbols making use of two mapping files provided by NCBI (gene info, gene2accession). Identified gene names are then searched in the species specific entries of the gene-product table of the GO database BLAST result GO identifiers are used to retrieve UniProt IDs making use of a mapping file from PIR (Non-redundant Reference Protein Database) including PSD, UniProt, Swiss-Prot, TrEMBL, RefSeq, GenPept and PDB Accessions are searched directly in the dbxref table of the GO database BLAST result accessions are searched directly in the gene-product table of the GO database

InterproScan from Blast2Go
InterPro is an integrated database of predictive protein “signatures” used for the classification and automatic annotation of proteins and genomes InterPro classifies sequences at superfamily, family and subfamily levels, predicting the occurrence of functional domains, repeats and important sites InterProScan allows to query sequences against the InterPro database, obtaining in-depth annotation results, including GO-terms

Annex (Annotation Augmentation) Tool
Annex is a set of relationships between the three GO categories, consisting of over 6000 manually reviewed relations between molecular function terms involved in biological processes and molecular function terms acting in cellular components As it is performed to augment and complete previous functional annotations, Annex is run after BLAST and InterProScan results are annotated, usually achieving between 10% and 15% extra annotations and confirming around 30% of GO terms.

Kegg Maps Blast2Go also lets you retrieve Kegg maps
for the annotated genes

Annotation Results The following results will be obtained:
Successfully annotated sequences Mapped sequences without assigned annotation Assigned Gene Ontology terms Assigned enzyme codes Sequences with enzyme codes assigned Average graph level of GO terms:(minimal distance from a term to the root)

Summary By joining annotation to function analysis B2G provides a powerful data mining tool ideally suited to support genomic research in non-model species Its species-independent character and different data input fronts makes it a valuable mining resource for potentially any organism B2G combines high-throughput analysis, statistical evaluation and biology framed visualization with a high degree of user interaction.

To learn more about Blast2Go:
Read "Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research” Visit Blast2Go webpage Read if interested to know more about Evidence code ontology Also check out this tutorial if you plan to use Blast2Go (It is excellent in terms of explaining each step of Blast2Go)

VFDB: Virulence Factors Database
Neisseria meningitidis Gram-negative diplococci Human pathogenic species Colonizes the nasopharynx after the inhalation of infected respiratory droplets Causes epidemic meningitis Remarkable abilities to change surface structures (capsule antigens) These virulence genes may be downloaded and blasted locally or run on the server Cataloging the virulence genes will provide a basis for the comparative group

Databases: Pangenome of Neisseria meningitidis vs. Public
The pangenome includes the set of genes that are present in all strains (core genome), genes that are present in two or more strains (dispensable genome) and genes that are unique to a strain. BLAST Reference database Discover high confidence matches specific to Neisseria meningitidis Well curated public database (uniprot/refseq) will be used to discover new genes or possibly horizontally transferred genes that may not be found in the pangenome.

Gene Naming “Absolute function” E-value < 1e -6 (identity>=85% and coverage>=50%), “High confidence similarity” E-value <1e-6 (identity <85% and >=30% and coverage>=50%) “Low confidence similarity” E-value >1e-6 and <=0.001, but not enough for high similarity). If the hit satisfies the above conditions and contains the term “hypothetical” then its “conserved hypothetical” If not present in the databases then its “hypothetical”

Questions?

Functional Annotation Background and Strategy

Similar presentations

Presentation on theme: "Functional Annotation Background and Strategy"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Functional Annotation Background and Strategy

Similar presentations

Presentation on theme: "Functional Annotation Background and Strategy"— Presentation transcript:

Similar presentations

About project

Feedback