Download presentation
Presentation is loading. Please wait.
1
Functional Annotation Group
Preliminary Results Functional Annotation Group
2
Annotation Overview 1) Annotate 1.1) Protein coding genes
Functional Clusters Pathways 1.2) Non coding genes
3
Topics of interest Protein coding: Non-coding: Structural
Regulatory Trans-membrane Signal Peptide Receptors Virulence, Toxins Signal Transduction Non-coding: rRNA, tRNA , tmRNA CRISPR Riboswitches nesseria_FSE Betaproteobacteria_toxic_sRNA etc.
4
Protein coding genes
6
Structure A protein domain is a conserved part of a given protein sequence and structure that can evolve, function, and exist independently of the rest of the protein chain. Each domain forms a compact three- dimensional structure and often can be independently stable and folded A sequence motif is distinguished from a structural motif, a motif formed by the three- dimensional arrangement of amino acids, which may not be adjacent.
7
Batch Cd search tool → Search for conserved domains and conserved features → Use RPS blast to compare a query protein sequence against conserved domain models → Web based analytical tool → Databases used by CD search CDD,SMART,PFAM,TIGER,COG,PRK
9
Outputs Sample Cd search # of Domain found M22471 9436 M07149 9341
9253 M22813 6603 M23413 9487
10
Interproscan InterProScan, like Blast2Go, provides annotations based on homology and GO terms, but uses an HMM discovery algorithm and relies on a larger number of sources for its annotations: Gene3D, Superfamily, PIRSF, TIGER, Panther, Pfam, SMART, PRINTS, HAMAP, ProSite, ProDom. InterProScan v46.0 Complete documentation regarding downloading, installing and using the latest version of InterProScan 5 *recommended version*. HTML Input :Any fasta or embl sequence(nucleotide or protein) ./interproscan.sh -i <quary seq> -f <formate> -b <outputfile> (for protein sequence) ./interproscan.sh -t n -i <quary seq> -f <formate> -b <outputfile> (for nucleotide sequnces)
11
Output Format TSV format
TSV: a simple tab-delimited file format XML GFF3: The GFF 3.0 format HTML: An HTML representation of the protein matches TSV format ABC_Contig24_G80 a0cbcb32f883581d3c844a38f680353b Pfam PF03466 LysR substrate binding domain E-42 T Protein Accession ( ABC_Contig24_G80 ) Sequence MD5 digest (a0cbcb32f883581d3c844a38f680353b ) Sequence Length (299) Analysis (e.g. Pfam / PRINTS / Gene3D) Signature Accession (e.g. PF03466) Signature Description (LysR substrate binding domain ) Start location Stop location Score - is the e-value of the match reported by member database method (e.g. 3.1E-42) Status - is the status of the match (T: true) Date - is the date of the run
12
Output Sample Total number of proteins
Interproscan: # of proteins annotated 07149 2118 851 09293 1915 822 22718 1851 766 22813 1163 485 23413 1927 829
13
Interproscan Output Sample Total Coils Gene3D Hamap Pfam PIRSF PRINTS ProDom ProSitePatterns ProSiteProfiles SMART SUPERFAMILY TIGRFAM M07149 2118 25 603 237 805 129 143 20 235 201 169 114 395 M09293 1915 15 586 241 781 115 142 21 239 202 149 386 M22718 1851 17 548 226 730 118 121 19 220 167 145 92 369 M22813 1163 13 353 140 455 82 84 10 103 86 59 232 M23413 1927 22 580 252 782 132 144 23 195 139 113 409
14
Interproscan summary The 8 means proteins annotated by 8 databases
15
Signal transduction Signal transduction occurs when an extracellular signaling molecule activates a cell surface receptor. In turn, this receptor alters intracellular molecules creating a response. There are two stages in this process: 1.A signaling molecule activates a specific receptor protein on the cell membrane. 2.A second messenger transmits the signal into the cell, eliciting a physiological response. In either step, the signal can be amplified. Thus, one signaling molecule can cause many responses.] A signal transduction functions much like a switch. SignalP,Signal 3L
16
Signal peptide A signal peptide (sometimes referred to as signal sequence, leader sequence or leader peptide) is a short (5-30 amino acids long) peptide present at the N-terminus of the majority of newly synthesized proteins that are destined towards the secretory pathway. These proteins include those that reside either inside certain organelles (the endoplasmic reticulum, golgi or endosomes), secreted from the cell, or inserted into most cellular membranes. The core of the signal peptide contains a long stretch of hydrophobic amino acids that has a tendency to form a single alpha-helix At the end of the signal peptide there is typically a stretch of amino acids that is recognized and cleaved by signal peptidase. SignalP,Signal 3L
17
SignalP SignalP,Signal 3L
18
SignalP output http://www.ncbi.nlm.nih.gov/pubmed/17880924
SignalP,Signal 3L
19
Membrane Proteins Integral membrane proteins are permanently attached to the membrane. They can be classified according to their relationship with the bilayer: Integral polytopic proteins, also known as "transmembrane proteins," are integral membrane proteins that span across the membrane at least once. They have one of two tertiary structures: Helix bundle proteins, which are present in all types of biological membranes; Beta barrel proteins, which are found only in outer membranes of Gram-negative bacteria, lipid-rich cell walls of a few Gram-positive bacteria, and outer membranes of mitochondria and chloroplasts. Integral monotopic proteins are integral membrane proteins that are attached to only one side of the membrane and do not span the whole way across. TMHMM TOPCANS(consensus results from different TMprediction softwares)
20
Philius SignalP,Signal 3L
21
Philius output http://www.ncbi.nlm.nih.gov/pubmed/17880924
SignalP,Signal 3L
22
Phobius SignalP,Signal 3L
23
Phobius output http://www.ncbi.nlm.nih.gov/pubmed/17880924
SignalP,Signal 3L
25
OUTPUT Sample Total number of Ps TMHMM TOPCONS M07149 2118 204 441
1915 116 382 M22718 1851 183 379 M22813 1163 177 254 M23413 1927 187 402 TOPCONS is a consensus from five different topology prediction algorithms: SCAMPI (single sequence mode), SCAMPI (multiple sequence mode), PRODIV-TMHMM, PRO-TMHMM and OCTOPUS.
26
Regulatory proteins Regulatory proteins (RPs) such as transcription factors (TFs) and two- component system (TCS) proteins control how prokaryotic cells respond to changes in their external and/or internal state. Identification and annotation of TFs and TCSs is non-trivial, and between- genome comparisons are often confounded by different standards in annotation.
27
P2RP(Predicted Prokaryotic Regulatory Proteins)
→ A web-based framework for the identification and analysis of regulatory proteins in prokaryotic genomes.
29
Output for P2RP Sample Two-Component Systems Transcription Factors
DNA-binding Proteins Total Regulatory proteins M22471 4 24 32 M07149 7 20 5 M09293 6 23 34 M22813 12 3 19 M23413 30
30
Toxins and Virulence factors
NM possess the ability to colonize human mucosal tissues and enter cells to cause septicemia and/or meningitis (intracellular pathogen). Enormous capability to vary their surface structures (host may be more susceptible to a new antigen). Figure from the VFDB website 1. Capsule/LOS-Invade the immune system and phagocytosis by being ‘invisible’ 2. Pilus- Colonize the cells by attaching to cell surface receptors 3. Outer membrane proteins- gain entry into the cell 4.LOS (lipidA)- Acts as a toxin and induces the inflammatory response...meningitis
31
VFDB: Virulence Factors Database
Most complete and up to date database of microbial VFs (compared to MvirDB) Scalable: sequences may be downloaded and blasted on local system Major virulence factors in Neisseria in the VFDB: Adherence LOS Type IV pili Antiphagocytosis Capsule Efflux pump FarAB MtrCDE IgA1 Protease IgA1 protease Immune modulator NspA fHbp Invasion Op Opc Porin Iron uptake FbpABC HmbR HpuAB Lbp Tbp Stress protein KatA MntABC MsrAB RecN
32
VFDB: Virulence Factors Database
Usage: blast -query <input> -subject <database> -outfmt 6 |sort -nr -k 12,12 Example Tabular ‘6’ Output: ABC_Contig144_G1 VFG ABC_Contig162_G1 VFG ABC_Contig151_G45 VFG ABC_Contig186_G4 VFG Hits to virulence genes observed.. such as lactoferrin binding proteins and iron (III) ABC transporter (scavenge iron), omp outermembrane protein ,capsule genes, pilus genes, LOS genes
33
CARD: Comprehensive Antibiotic Resistance Database
Bacteria are becoming increasingly resistant to antimicrobials due to overuse and inappropriate use of them. Genes associated with mechanisms of antibiotic resistance are curated in the CARD. Other Databases are available (e.g. ARDB, RAC); however, CARD is currently maintained while ARDB is not. RAC is not scalable as sequences are not downloadable. Hits observed: Mtr ABC transporter loci and mac loci in Neisseria gonorrhoeae (azithromycin and erythromycin resistance) rpoB (rifampin resistance) gyrA (fluoroquinolone resistance)
34
Cataloging the virulence genes (toxins, secretion systems etc) and antimicrobial resistance genes will provide a basis for the comparative group to understand the pathogenic aspects of NmW and why it may be such a ‘fit’ pathogen.
35
Output Sample # Hits to VFDB # Hits to CARD M22718 70 13 M09293 97 12
47 M22813 69 4 M22471 209 9 M22471 had the highest number of contigs of these strains. Perhaps, a higher number of hits is an artifact of the sequence quality?
36
Cutoffs For CARD blast results: Coverage cut off is 90%
E-value cut off is 10e-30/ 1e-50 (stringent or soft criteria) Alignment length cut off is 100. For vfdb blast results: Coverage cut off is 92.5%/82.5% (stringent or soft criteria) E-value cut off don't show a good cut off point. 1e-50 might be OK, but can be bias. Alignment length cut off is 100/550. (stringent or soft criteria)
37
It’s all about you, Neisseria meningitidis
38
BLAST: Neisseria Pangenome/Panproteome Database
The pangenome includes the set of genes that are present in all strains (core genome), genes that are present in two or more strains (dispensable genome) and genes that are unique to a strain. BLAST Reference database for uniprot Discover high confidence matches specific to Neisseria meningitidis and matches to other closely related Neisseria species
39
BLAST: Neisseria Pangenome/Panproteome Database
Sample # Hits to uniprot M07149 26929 M09293 25017 M22718 24306 M22813 16187 M23413 25270
40
Cutoffs Databases: UniProt/Swissprot Blast type: blastp
Criteria for absolute function: E-value<10e-50 Alignment length>100 Coverage>60%
41
Orthologs Orthologs are defined as genes that have diverged after a speciation event. Orthologs can be defined as "genes that have diverged after a speciation event... [that] tend to have similar function" (Fulton et al. 2006). Thus, orthologs are genes whose encoded proteins fulfill similar roles in different species.
43
But proteins don’t function alone..
44
Operon Operon: family of co-regulated proteins.
These protein sets are highly conserved during evolutionary selection. The sets are adjacent to each other in the same orientation. They will not be separated by promoters or terminators as they are expressed to form overall functional system The distance between them in nucleotides Whether the genes are conserved near each other in other genomes, based on MicrobesOnline Ortholog Groups The correlation of their expression patterns, if gene expression data is available Whether they both belong to a narrow GO category Whether they share a COG functional category * MEGA maybe
45
OperonDB For each conserved gene pair, they calculate an estimate of probability that the genes belong to the same operon. The algorithm takes into account several alternative possibilities, like functionally unrelated that were adjacent in a common ancestor, by chance alone, or due to horizontal transfer of the gene pair. Algorithm Identification of conserved pairs Identification of orthologs using BLAST Finding conserved gene clusters – Homologyteams software. More the evolutionary distance the higher chance of belonging to a gene pair
46
Steps to find conserved gene clusters:Operon
Create operonDB and obtain outputs createoperondb.pl <list_of_ptt_files> <directory_of_all_blast_searches> <homologyteam_program>
47
Capsule biosynthesis operon (DOOR2)
48
Pathways Protein interactions and the pathways involved in the cell is important to get a holistic view of the genome. It would allow us to visualize biosynthetic, signalling and metabolic pathways. Pathways will be used to check the extent to which the genes have been predicted from the assemblies in a particular biological system. As well as to get complete view of each protein in the big picture for researchers interested in the particular protein. Blat2go Kegg etc
49
Softwares 1.Blast2Go: Functional annotation in 3 steps: BLAST to find homologous sequences, mapping to retrieve GO terms and annotation to select reliable functions. 2.KASS: provides functional annotation of genes by BLAST comparisons against the manually curated KEGG GENES database. The result contains KO (KEGG Orthology) assignments BBH (bi-directional best hit)and automatically generated KEGG pathways 3.KOBAS 2.0: Annotate queries to either KEGG genes or KEGG Orthology (KO) terms. The results contain KEGG PATHWAY, PID, BioCarta (from PID), Reactome, BioCyc, and PANTHER pathway databases and OMIM, KEGG DISEASE, FunDO, GAD, and NHGRI GWAS Catalog disease databases. Also, integrates Gene Ontology.KOBAS 2.0 can parse and annotate BLAST result (in tabular format) of a set of nucleotide / protein sequences Output format: #Query Gene ID|Gene name|Hyperlink ABC_Contig7_G nmc:NMC1397|| bin/www_bget?nmc:NMC1397
50
KOBAS 2.0
51
Non coding genes
52
tRNAscan-SE Aragorn Rfam RNAmmer misc RNA Rfam
53
Non coding RNA A non-coding RNA (ncRNA) is a functional RNA molecule that is not translated into a protein. A Transfer RNA serves as the physical link between the nucleotide sequence of nucleic acids (DNA and RNA) and the amino acid sequence of proteins. A Transfer-messenger RNA is a bacterial RNA molecule with dual tRNA-like and messenger RNA-like properties. A Ribosomal RNA is the RNA component of the ribosome, and is essential for protein synthesis Infernal:homologous
54
miscRNA A riboswitch is a regulatory segment of a messenger RNA molecule that binds a small molecule, resulting in a change in production of the proteins encoded by the mRNA. A CRISPR RNA direct repeat element are transcribed from this CRISPR locus. The crRNAs are then incorporated into effector complexes, where the crRNA guides the complex to the invading nucleic acid and the Cas proteins degrade this nucleic acid. TPP,SAM,nesseria_FSE,Betaproteobacteria_toxic_sRNA etc
55
Annotation Rfam-Infernal: rfam_scan.pl -blastdb Rfamdb Rfam.cm seqfinal.fa rRNA:RNAmmer tRNA/tmRNA: tRNAscan-SE Aragorn(with -m option) aragorn -m -gc11 -o tmRNAfile assembly_file Other Non coding RNA annotation cmsearch --tblout table_file <Bacteria_Covariance_matrix> <assembly_file> Parse the file to get Rfam ID and annotate using the Rfam.fasta trnascan-SE
56
Output for noncoding RNA
Sample rRNA tRNA+tmRNA others M22471 15 63+1 137 M07149 8 55+1 101 M22813 12 56+1 268 M22718 6 70+1 214 M09293 9 66+1 98
57
CRISPR CRISPRs (clustered regularly interspaced short palindromic repeats) are DNA loci that contain multiple, short, direct repetitions of base sequences. Each repetition contains a series of base pairs followed by the same or a similar series in reverse and then by 30 or so base pairs known as "spacer DNA". They are found in the genomes of approximately 40% of sequenced bacteria. The spacers are short segments of DNA from a virus and serve as a 'memory' of past exposures The CRISPR-Cas system functions as a prokaryotic immune system, in that it confers resistance to exogenous genetic elements such as plasmids and phages and provides a form of acquired immunity.
58
CRISPRs Database
59
Output Sample Confirmed_CRISPRS Questionable_CRISPRS M07149 6 M22471 2
6 M22471 2 13 M09293 10 M22813 3 8 M22718 9
60
Insertion sequences ISFinder: At present, only individual sequences can be downloaded one by one for comparison. An on-line BLAST facility is available and in future versions direct access to additional analytical tools will be provided on line. Output: Shows the hits to the IS database for each contig ISsaga: Toooo slow; need to obtain login details
62
Result:Merged annotation
and naming
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.