Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms Moscow, Russia Gos NII Genetika Moscow, Russia
The International Nucleotide Sequence Database Collaboration (INSDC) GenBank at NCBI: EMBL Nucleotide Sequence Database: DNA Data Bank of Japan (DDBJ): Corresponding protein databases: GenPept, UniProtKB/TrEMBL, and DDBJ Curated protein database Swiss-Prot: Three dimensional structures of proteins (3D) PDB: (database) SCOP: (classification)
Search of homologues
BLOSUM-62 matrix
Overprediction is annotation of sequences at a greater level of functional specificity than available evidence supports.
- Select a protein - Determine the domain structure of the selected protein - Select a domain to be analyzed - Has the protein domain family been annotated in a database? - Updating of the family list or searching for homologous domains - Cheek each "atypical" sequence (probably it will be edited or removed) - Preliminary division into subfamilies - Multiple sequence alignment (consensus?) - Phylogenetic analysis - Phylogenetic tree visualization - Subfamily structure - Interfamily relationship (superfamilies, clans, etc.) - 2D and 3D analysis (prediction) A Protein Family Analysis (
ADDA - Automatic Domain Decomposition Algorithm 33,879 domain families (79,965 if redundant sequences were used) according to Heger A, Holm L. Exhaustive enumeration of protein domain families. J Mol Biol. 2003, 328(3):
- Select a protein - Determine the domain structure of the selected protein - Select a domain to be analyzed - Has the protein domain family been annotated in a database? - Updating of the family list or searching for homologous domains - Cheek each "atypical" sequence (probably it will be edited or removed) - Preliminary division into subfamilies - Multiple sequence alignment (consensus?) - Phylogenetic analysis - Phylogenetic tree visualization - Subfamily structure - Interfamily relationship (superfamilies, clans, etc.) - 2D and 3D analysis (prediction) A Protein Family Analysis (
Let’s use this protein as a query sequence for BLAST
BLAST results (Descriptions) E-value < 0.01 or 0.001
BLAST results (Graphic overview) Domain IDomain IIDomain III
GH27NGH27C GH27N GH27CCBM13 GH27NGH27CCBM6 GH27NGH27CCBM6CBM13 GH27NCBM13GH27C NEW1GH27NCBM13GH27C NEW1GH27NGH27C NEW2NEW1GH27NGH27C GH27NGH27CNEW3NEW2 GH27NGH27CNEW3 GH27NGH27C Dockerin GH27NGH27CCBM1CE1 N-terminal domain of GH27 family C -terminal domain of GH27 family CE1 domain of carbohydrate esterases Carbohydrate-binding module CBM1 Carbohydrate-binding module CBM6 Carbohydrate-binding module CBM13 Dockerin I domain Uncharacterized domain Uncharacterized domain (NPCBM) Uncharacterized domain CBM13 CBM6 Dockerin NEW1 NEW2 NEW3 CBM1 CE1 GH27C GH27N Domain structure of proteins of the GH27 family according to Naumoff D.G. Phylogenetic analysis of α-galactosidases of the GH27 family. Molecular Biology (Engl Transl), 2004, 38(3): PDF:
ADDA December InterPro PUMA October2009http://pfam.janelia.org/ Pfam KOG COG 3902 June2009http://scop.mrc-lmb.cam.ac.uk/scop/ SCOP Jan2010http:// Jan2010http://www-cryst.bioc.cam.ac.uk/homstrad/ HOMSTRAD Number of families DateAddressDatabase Universal Protein Domain Databases ADDA 11082
Databases of individual protein families (
Sequence Based Classification of the Carbohydrate-Active Enzymes at the CAZy server ( Glycoside Hydrolases (including transglycosidases) => 118 GH families (14 clans) Glycosyltransferases => 92 GT families Polysaccharide Lyases => 21 PL families Carbohydrate Esterases => 16 CE families Carbohydrate-Binding Modules => 59 CBM families
Family GH72 of Glycoside Hydrolases (
Multiple Sequence Alignment: – Automatic (ClustalW or ClustalX) >50% of sequence identity only one domain no protein fragments – Manual (BioEdit) (take into account BLAST pairwise sequence alignment!) <30% of sequence identity long insertions / deletions facultative N-terminal part Local dissimilarities of very similar sequences: – Local frameshift – Exon-intron structure – Stop codon
BioEdit (
Phylip ( Maximum Parsimony (ProtPars) Distance program (Neighbor-Joining)
An infile for the Phylip package programs
Maximum Parsimony (protpars.exe) from the Phylip package
Phylogenetic tree visualization: TreeView program ( Slanted cladogram Radial Rectangular cladogram Phylogram
Subfamily criteria (for glycosidases) 1.Pairwise sequence similarity (>30% of identity) 2.Order of sequence appearance during BLAST search (members of the same subfamily always appear at the top of BLAST results) 3.Monophyletic status
The maximum parsimony phylogenetic tree of family GH C1_LEIXY 97C1_PRERU 97C2_BACTH C1_MICDE 97C2_MICDE 97C1_BACTH 97C2_PRERU 97C3_PRERU D1_CAUCR 97D1_XANAX 97D1_XANCA B1_MICDE 97B4_BACTH 97B1_PRERU 97B1_BACTH B2_PRERU 97B1_BACFR 97B3_BACTH 97B2_BACFR 97B2_BACTH E1_BACTH 97E1_RHOBA 97A1_HALMA 97A1_SALRU 97A2_BACFR 97A3_BACTH A1_PRERU 97A1_PREIN A1_BACTH 97A1_TANFO A1_BACFR 97A2_BACTH 97A1_UNBAC A8_ENSEQ 97A1_AZOVI A5_ENSEQ 97A4_ENSEQ 97A3_ENSEQ A7_ENSEQ 97A6_ENSEQ A1_MICDE 97A1_SHEON 97A2_ENSEQ 97A1_ENSEQ A1_NOVAR 97A1_ERYLI A1_XANAX Subfamily 97a 97A1_XANCA Subfamily 97d Subfamily 97e Subfamily 97c Subfamily 97b -glucosidase activity [EC ]
The neighbor-joining phylogenetic tree of family GH97 97E1_RHOBA 97E1_BACTH 97C1_LEIXY 97C1_PRERU 97C2_BACTH 97C1_MICDE 97C2_MICDE 97C1_BACTH 97C2_PRERU 97C3_PRERU 97D1_CAUCR 97D1_XANCA 97D1_XANAX 97B1_MICDE 97B1_BACTH 97B4_BACTH 97B1_PRERU 97B2_PRERU 97B1_BACFR 97B3_BACTH 97B2_BACFR 97B2_BACTH 97A1_HALMA 97A1_PRERU 97A1_PREIN 97A1_TANFO 97A1_BACTH 97A1_BACFR 97A1_UNBAC 97A2_BACTH 97A1_SALRU 97A2_BACFR 97A3_BACTH 97A1_AZOVI 97A8_ENSEQ 97A5_ENSEQ 97A4_ENSEQ 97A3_ENSEQ 97A7_ENSEQ 97A6_ENSEQ 97A1_ERYLI 97A1_NOVAR 97A1_XANCA 97A1_XANAX 97A1_MICDE 97A1_SHEON 97A2_ENSEQ 97A1_ENSEQ Subfamily 97e Subfamily 97c Subfamily 97d Subfamily 97b Subfamily 97a [EC ]
The neighbor-joining phylogenetic tree of the α-galactosidase superfamily
Clans of Glycoside Hydrolases (β) 3 -solenoidinversion (axial orientation)28, 49GH-N (/)6(/)6 inversion (equatorial orientation)8, 48GH-M (/)6(/)6 inversion (axial orientation)15, 65GH-L (β/ ) 8 -barrel retention (equatorial orientation)18, 20, 85GH-K 5-fold β-propeller retention (β ‑ furanoside) 32, 68GH-J +β inversion (equatorial orientation)24, 46, 80GH-I (β/ ) 8 -barrel retention (axial orientation)13, 70, 77GH-H inversion (axial orientation)37, 63GH-G 5-fold β-propellerinversion (equatorial orientation)43, 62GH-F 6-fold β-propellerretention (equatorial orientation)33, 34, 83, 93GH-E (β/ ) 8 -barrel retention (axial orientation)27, 31, 36GH-D β-jelly rollretention (equatorial orientation)11, 12GH-C β-jelly rollretention (equatorial orientation)7, 16GH-B (β/ ) 8 -barrel retention (equatorial orientation)1, 2, 5, 10, 17, 26, 30, 35, 39, 42, 50, 51, 53, 59, 72, 79, 86, 113 GH-A Tertiary StructureOptical ConfigurationFamilies (GH)Clan (/)6(/)6
Rigden DJ. Iterative database searches demonstrate that glycoside hydrolase families 27, 31, 36, and 66 share a common evolutionary origin with family 13. FEBS Lett. 2002, 523(1-3):17 ‑ 22. clans GH-D GH-H
Nagano N, Porter CT, Thornton JM. The (β/α) 8 glycosidases: sequence and structure analyses suggest distant evolutionary relationships. Protein Eng. 2001, 14(11): clans:GH-HGH-AGH-K?
Screenshot of PSI Protein Classifier D.G. Naumoff and M. Carreras PSI Protein Classifier: a new program automating PSI-BLAST search results. Molecular Biology (Engl Transl). V.43. N.4. P
A hierarchical classification of the (β/α) 8 -type glycosyl hydrolases
A hierarchical structure of the -fructosidase (furanosidase) superfamily furanosidase superfamily GH32 GH68 GH43 GH62 GHLP clan GH-J clan GH-F GH32a GH32b GH32c GH32d GH68a GH68b GH43a GH43b GH43c GH43d GH43e GH43f GH43g
The Secondary Structure Prediction – 3D-PSSM ( – GOR IV ( – nnpredict ( – PredictProtein ( – Hydrophobic cluster analysis (HCA) The Tertiary Structure Prediction – The SWISS-MODEL modeling server (
Phylogenetic Analysis of a Protein Family – The first stage of a work Prediction of 3D structure and domain structure of the protein Prediction of the active center and residues for site-directed mutagenesis Prediction of the enzymatic activities – The only part of a work (bioinformatics) – The final stage of a work (interpretation of the experimental results) Comparison of the phylogenetic trees of each domain of a certain protein will allow to reveal the protein evolutionary history, viz. the role of gene duplication, lost, fusion, and horizontal transfer.