Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms.

Slides:



Advertisements
Similar presentations
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Advertisements

Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Pfam(Protein families )
Phylogenetic reconstruction
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Introduction to Bioinformatics
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Bioinformatics and Phylogenetic Analysis
Molecular Evidence Using DNA, RNA or Protein Sequences to Classify Organisms.
The Protein Data Bank (PDB)
Protein databases Henrik Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Protein Structure Prediction II
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
© Wiley Publishing All Rights Reserved. Searching Sequence Databases.
Chapter 5 Multiple Sequence Alignment.
Multiple sequence alignment
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Cédric Notredame (30/08/2015) Chemoinformatics And Bioinformatics Cédric Notredame Molecular Biology Bioinformatics Chemoinformatics Chemistry.
Bioinformatics.
Development of Bioinformatics and its application on Biotechnology
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
Secondary Databases Ansuman sahoo Roll: Y Bioinformatics Class Presentation 30 Jan 2013.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases Work Steps Extract and analyze results.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
NIGMS Protein Structure Initiative: Target Selection Workshop ADDA and remote homologue detection Liisa Holm Institute of Biotechnology University of Helsinki.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Copyright OpenHelix. No use or reproduction without express written consent1.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Protein Tertiary Structure. Protein Data Bank (PDB) Contains all known 3D structural data of large biological molecules, mostly proteins and nucleic acids:
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
Classification of protein and domain families Sequence to function Protein Family Resources and Protocols for Structural and Functional Annotation of Genome.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
(PSI-)BLAST & MSA via Max-Planck. Where? (to find homologues) Structural templates- search against the PDB Sequence homologues- search against SwissProt.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Guidelines for sequence reports. Outline Summary Results & Discussion –Sequence identification –Function assignment –Fold assignment –Identification of.
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Rita Casadio BIOCOMPUTING GROUP University of Bologna, Italy Prediction of protein function from sequence analysis.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
각종 생물정보 분석도구 의 실무적 활용 및 실습 김형용 개발팀 Insilicogen, Inc.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Bioinformatics Computing 1 CMP 807 – Day 4 Kevin Galens.
Bioinformatics Overview
Sequence similarity, BLAST alignments & multiple sequence alignments
Archives and Information Retrieval
Dot Plots, Path Matrices, Score Matrices
Genome Annotation Continued
Introduction to Bioinformatics
Dr Tan Tin Wee Director Bioinformatics Centre
Prediction of protein function from sequence analysis
Bioinformatics Lecture 2 By: Dr. Mehdi Mansouri
Basic Local Alignment Search Tool (BLAST)
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms Moscow, Russia Gos NII Genetika Moscow, Russia

The International Nucleotide Sequence Database Collaboration (INSDC) GenBank at NCBI: EMBL Nucleotide Sequence Database: DNA Data Bank of Japan (DDBJ): Corresponding protein databases: GenPept, UniProtKB/TrEMBL, and DDBJ Curated protein database Swiss-Prot: Three dimensional structures of proteins (3D) PDB: (database) SCOP: (classification)

Search of homologues

BLOSUM-62 matrix

Overprediction is annotation of sequences at a greater level of functional specificity than available evidence supports.

- Select a protein - Determine the domain structure of the selected protein - Select a domain to be analyzed - Has the protein domain family been annotated in a database? - Updating of the family list or searching for homologous domains - Cheek each "atypical" sequence (probably it will be edited or removed) - Preliminary division into subfamilies - Multiple sequence alignment (consensus?) - Phylogenetic analysis - Phylogenetic tree visualization - Subfamily structure - Interfamily relationship (superfamilies, clans, etc.) - 2D and 3D analysis (prediction) A Protein Family Analysis (

ADDA - Automatic Domain Decomposition Algorithm 33,879 domain families (79,965 if redundant sequences were used) according to Heger A, Holm L. Exhaustive enumeration of protein domain families. J Mol Biol. 2003, 328(3):

- Select a protein - Determine the domain structure of the selected protein - Select a domain to be analyzed - Has the protein domain family been annotated in a database? - Updating of the family list or searching for homologous domains - Cheek each "atypical" sequence (probably it will be edited or removed) - Preliminary division into subfamilies - Multiple sequence alignment (consensus?) - Phylogenetic analysis - Phylogenetic tree visualization - Subfamily structure - Interfamily relationship (superfamilies, clans, etc.) - 2D and 3D analysis (prediction) A Protein Family Analysis (

Let’s use this protein as a query sequence for BLAST

BLAST results (Descriptions) E-value < 0.01 or 0.001

BLAST results (Graphic overview) Domain IDomain IIDomain III

GH27NGH27C GH27N GH27CCBM13 GH27NGH27CCBM6 GH27NGH27CCBM6CBM13 GH27NCBM13GH27C NEW1GH27NCBM13GH27C NEW1GH27NGH27C NEW2NEW1GH27NGH27C GH27NGH27CNEW3NEW2 GH27NGH27CNEW3 GH27NGH27C Dockerin GH27NGH27CCBM1CE1 N-terminal domain of GH27 family C -terminal domain of GH27 family CE1 domain of carbohydrate esterases Carbohydrate-binding module CBM1 Carbohydrate-binding module CBM6 Carbohydrate-binding module CBM13 Dockerin I domain Uncharacterized domain Uncharacterized domain (NPCBM) Uncharacterized domain CBM13 CBM6 Dockerin NEW1 NEW2 NEW3 CBM1 CE1 GH27C GH27N Domain structure of proteins of the GH27 family according to Naumoff D.G. Phylogenetic analysis of α-galactosidases of the GH27 family. Molecular Biology (Engl Transl), 2004, 38(3): PDF:

ADDA December InterPro PUMA October2009http://pfam.janelia.org/ Pfam KOG COG 3902 June2009http://scop.mrc-lmb.cam.ac.uk/scop/ SCOP Jan2010http:// Jan2010http://www-cryst.bioc.cam.ac.uk/homstrad/ HOMSTRAD Number of families DateAddressDatabase Universal Protein Domain Databases ADDA 11082

Databases of individual protein families (

Sequence Based Classification of the Carbohydrate-Active Enzymes at the CAZy server ( Glycoside Hydrolases (including transglycosidases) => 118 GH families (14 clans) Glycosyltransferases => 92 GT families Polysaccharide Lyases => 21 PL families Carbohydrate Esterases => 16 CE families Carbohydrate-Binding Modules => 59 CBM families

Family GH72 of Glycoside Hydrolases (

Multiple Sequence Alignment: – Automatic (ClustalW or ClustalX)  >50% of sequence identity  only one domain  no protein fragments – Manual (BioEdit) (take into account BLAST pairwise sequence alignment!)  <30% of sequence identity  long insertions / deletions  facultative N-terminal part Local dissimilarities of very similar sequences: – Local frameshift – Exon-intron structure – Stop codon

BioEdit (

Phylip ( Maximum Parsimony (ProtPars) Distance program (Neighbor-Joining)

An infile for the Phylip package programs

Maximum Parsimony (protpars.exe) from the Phylip package

Phylogenetic tree visualization: TreeView program ( Slanted cladogram Radial Rectangular cladogram Phylogram

Subfamily criteria (for glycosidases) 1.Pairwise sequence similarity (>30% of identity) 2.Order of sequence appearance during BLAST search (members of the same subfamily always appear at the top of BLAST results) 3.Monophyletic status

The maximum parsimony phylogenetic tree of family GH C1_LEIXY 97C1_PRERU 97C2_BACTH C1_MICDE 97C2_MICDE 97C1_BACTH 97C2_PRERU 97C3_PRERU D1_CAUCR 97D1_XANAX 97D1_XANCA B1_MICDE 97B4_BACTH 97B1_PRERU 97B1_BACTH B2_PRERU 97B1_BACFR 97B3_BACTH 97B2_BACFR 97B2_BACTH E1_BACTH 97E1_RHOBA 97A1_HALMA 97A1_SALRU 97A2_BACFR 97A3_BACTH A1_PRERU 97A1_PREIN A1_BACTH 97A1_TANFO A1_BACFR 97A2_BACTH 97A1_UNBAC A8_ENSEQ 97A1_AZOVI A5_ENSEQ 97A4_ENSEQ 97A3_ENSEQ A7_ENSEQ 97A6_ENSEQ A1_MICDE 97A1_SHEON 97A2_ENSEQ 97A1_ENSEQ A1_NOVAR 97A1_ERYLI A1_XANAX Subfamily 97a 97A1_XANCA Subfamily 97d Subfamily 97e Subfamily 97c Subfamily 97b  -glucosidase activity [EC ]

The neighbor-joining phylogenetic tree of family GH97 97E1_RHOBA 97E1_BACTH 97C1_LEIXY 97C1_PRERU 97C2_BACTH 97C1_MICDE 97C2_MICDE 97C1_BACTH 97C2_PRERU 97C3_PRERU 97D1_CAUCR 97D1_XANCA 97D1_XANAX 97B1_MICDE 97B1_BACTH 97B4_BACTH 97B1_PRERU 97B2_PRERU 97B1_BACFR 97B3_BACTH 97B2_BACFR 97B2_BACTH 97A1_HALMA 97A1_PRERU 97A1_PREIN 97A1_TANFO 97A1_BACTH 97A1_BACFR 97A1_UNBAC 97A2_BACTH 97A1_SALRU 97A2_BACFR 97A3_BACTH 97A1_AZOVI 97A8_ENSEQ 97A5_ENSEQ 97A4_ENSEQ 97A3_ENSEQ 97A7_ENSEQ 97A6_ENSEQ 97A1_ERYLI 97A1_NOVAR 97A1_XANCA 97A1_XANAX 97A1_MICDE 97A1_SHEON 97A2_ENSEQ 97A1_ENSEQ Subfamily 97e Subfamily 97c Subfamily 97d Subfamily 97b Subfamily 97a [EC ]

The neighbor-joining phylogenetic tree of the α-galactosidase superfamily

Clans of Glycoside Hydrolases (β) 3 -solenoidinversion (axial orientation)28, 49GH-N (/)6(/)6 inversion (equatorial orientation)8, 48GH-M (/)6(/)6 inversion (axial orientation)15, 65GH-L (β/  ) 8 -barrel retention (equatorial orientation)18, 20, 85GH-K 5-fold β-propeller retention (β ‑ furanoside) 32, 68GH-J  +β inversion (equatorial orientation)24, 46, 80GH-I (β/  ) 8 -barrel retention (axial orientation)13, 70, 77GH-H inversion (axial orientation)37, 63GH-G 5-fold β-propellerinversion (equatorial orientation)43, 62GH-F 6-fold β-propellerretention (equatorial orientation)33, 34, 83, 93GH-E (β/  ) 8 -barrel retention (axial orientation)27, 31, 36GH-D β-jelly rollretention (equatorial orientation)11, 12GH-C β-jelly rollretention (equatorial orientation)7, 16GH-B (β/  ) 8 -barrel retention (equatorial orientation)1, 2, 5, 10, 17, 26, 30, 35, 39, 42, 50, 51, 53, 59, 72, 79, 86, 113 GH-A Tertiary StructureOptical ConfigurationFamilies (GH)Clan (/)6(/)6

Rigden DJ. Iterative database searches demonstrate that glycoside hydrolase families 27, 31, 36, and 66 share a common evolutionary origin with family 13. FEBS Lett. 2002, 523(1-3):17 ‑ 22. clans GH-D GH-H

Nagano N, Porter CT, Thornton JM. The (β/α) 8 glycosidases: sequence and structure analyses suggest distant evolutionary relationships. Protein Eng. 2001, 14(11): clans:GH-HGH-AGH-K?

Screenshot of PSI Protein Classifier D.G. Naumoff and M. Carreras PSI Protein Classifier: a new program automating PSI-BLAST search results. Molecular Biology (Engl Transl). V.43. N.4. P

A hierarchical classification of the (β/α) 8 -type glycosyl hydrolases

A hierarchical structure of the  -fructosidase (furanosidase) superfamily furanosidase superfamily GH32 GH68 GH43 GH62 GHLP clan GH-J clan GH-F GH32a GH32b GH32c GH32d GH68a GH68b GH43a GH43b GH43c GH43d GH43e GH43f GH43g

The Secondary Structure Prediction – 3D-PSSM ( – GOR IV ( – nnpredict ( – PredictProtein ( – Hydrophobic cluster analysis (HCA) The Tertiary Structure Prediction – The SWISS-MODEL modeling server (

Phylogenetic Analysis of a Protein Family – The first stage of a work  Prediction of 3D structure and domain structure of the protein  Prediction of the active center and residues for site-directed mutagenesis  Prediction of the enzymatic activities – The only part of a work (bioinformatics) – The final stage of a work (interpretation of the experimental results) Comparison of the phylogenetic trees of each domain of a certain protein will allow to reveal the protein evolutionary history, viz. the role of gene duplication, lost, fusion, and horizontal transfer.