Protein Sectors: Evolutionary Units of Three-Dimensional Structure Najeeb Halabi, Olivier Rivoire, Stanislas Leibler, and Rama Ranganthan Cell 138, 774-786,

Slides:



Advertisements
Similar presentations
TEMPLATE DESIGN © Statistical Coupling Analysis of the Photosystem II D1 Protein Janan Zhu 1 ; Nicholas Polizzi 2 ; 1.
Advertisements

Direct-Coupling Analysis (DCA) and Its Applications in Protein Structure and Protein-Protein Interaction Prediction Wang Yang
Eigen-analysis and the Power Method
Measuring the degree of similarity: PAM and blosum Matrix
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Pfam(Protein families )
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Evolution of minimal metabolic networks WANG Chao April 11, 2006.
Mining frequent patterns in protein structures: A study of protease families Dr. Charles Yan CS6890 (Section 001) ST: Bioinformatics The Machine Learning.
Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
Similar Sequence Similar Function Charles Yan Spring 2006.
Dali: A Protein Structural Comparison Algorithm Using 2D Distance Matrices.
Molecular Evolution, Part 2 Everything you didn’t want to know… and more! Everything you didn’t want to know… and more!
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Multiple Sequence Alignments
Single Motif Charles Yan Spring Single Motif.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Design of a novel globular protein with atomic-level accuracy.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
Eigenfaces for Recognition Student: Yikun Jiang Professor: Brendan Morris.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Protein Tertiary Structure Prediction
NUS CS5247 A dimensionality reduction approach to modeling protein flexibility By, By Miguel L. Teodoro, George N. Phillips J* and Lydia E. Kavraki Rice.
Protein Sectors: Evolutionary Units of Three-Dimensional Structure Cell (2009) Najeeb Halabi, Olivier Rivoire, Stanislas Leibler, and Rama Ranganathan.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Inferring Selection Pressure from Positional Residue Conservation Rose Hoberman Roni Rosenfeld Judith Klein-Seetharaman.
1/17 Identification of thermophilic species by the amino acid compositions deduced from their genomes Reporter: Yu Lun Kuo
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
Ozgur Ozturk, Ahmet Sacan, Hakan Ferhatosmanoglu, Yusu Wang The Ohio State University LFM-Pro: a tool for mining family-specific sites in protein structure.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Calculating branch lengths from distances. ABC A B C----- a b c.
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Identifying property based sequence motifs in protein families and superfamies: application to DNase-1 related endonucleases Venkatarajan S. Mathura et.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
3DM: Protein Super-family Platforms 3DM Protein super-family data integration Tom van den Bergh Bio-Prodict.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Comparative methods Basic logics: The 3D structure of the protein is deduced from: 1.Similarities between the protein and other proteins 2.Statistical.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
PROTEIN STRUCTURE (Donaldson, March 10,2003) What are we trying to learn about genes and their proteins: Predict function for unknown protein by comparison.
Machine Learning Methods of Protein Secondary Structure Prediction Presented by Chao Wang.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Chapter 14 Protein Structure Classification
Clustering Manpreet S. Katari.
Predicting Active Site Residue Annotations in the Pfam Database
There are four levels of structure in proteins
Robert G. Smock, Lila M. Gierasch  Cell 
Volume 113, Issue 3, Pages (May 2003)
Volume 112, Issue 7, Pages (April 2017)
Protein structure prediction.
Yael T. Aminetzach, John R. Srouji, Chung Yin Kong, Hopi E. Hoekstra 
Crystal Structure of the Human High-Affinity IgE Receptor
Markus Zettl, Michael Way  Current Biology 
Protein Sectors: Evolutionary Units of Three-Dimensional Structure
Structure, Exchange Determinants, and Family-Wide Rab Specificity of the Tandem Helical Bundle and Vps9 Domains of Rabex-5  Anna Delprato, Eric Merithew,
Introduction to bioinformatics Lecture 5 Pair-wise sequence alignment
Wenzhe Ma, Chao Tang, Luhua Lai  Biophysical Journal 
Structural Determinants of Sleeping Beauty Transposase Activity
Presentation transcript:

Protein Sectors: Evolutionary Units of Three-Dimensional Structure Najeeb Halabi, Olivier Rivoire, Stanislas Leibler, and Rama Ranganthan Cell 138, , August 21, 2009 Journal Club Yizhou Yin Sep 23, 2009

Sequence Conservation “…sequence conservation – the degree to which the frequency of amino acids at a given position deviates from random expectation in a well sampled multiple sequence alignment of the protein family...” sequencestructure property/function evolution sequence conservation Evolutionary relationship Structural/functional importance

Hypothesis -However, in the 3-dimensional structure of protein, the large amount of interactions between amino acid residues are also fundamental “structural elements”. -Amino acid distributions at individual position should not be taken as independent of one another. -Investigation of correlations between sequence positions in protein family leads to decomposition of the protein into groups of coevolving amino acids – “sectors”. Hypothesis: the sectors are features of proteins structures and reflect the evolutionary histories of their conserved biological properties.

S1A Family Serine protease Clan SA SB … Family S1 S2 … Sub-family S1A trypsin chymotrypsin tryptase kallikrein granzyme … Broad distribution and functions Prokaryotes Invertebrates Vertebrates Digestion Blood clotting Inflammation … Binding site - specificity Catalytic triad – active site Member … rat trypsin (3TGI)

Method Outline Identification of sectors  Statistical Coupling Analysis Statistical Independence  Correlated entropy Physical connectivity Distinct biochemical properties  Alanine mutagenesis  Catalytic power & thermal stability assays Independent divergence  Sequence similarity analysis

From Sequence to Sectors Multiple sequence alignment of 1470 members of the S1A family (single domain) NCBI nonredundant database through iterative PSI-BLAST Alignment: Cn3D, ClustalX Standard manual adjustment methods Di(a): Divergence (or relative entropy) fi(a): Observed frequency of amino acid a at position i q(a): Background frequency of a in all proteins Position Conservation

SCA matrix (conservation-weighted covariance matrix) Statistical Coupling Analysis (SCA) Cijab: frequency-based correlation between position i and j ~Cijab is a measure of the significance of observed correlations as judged by the conservation of the amino acids under consideration After binary approximation:

Binary approximation Di(ai): the conservation of ai, which is the most prevalent amino acid at that position

Spectral cleaning to separate functional correlation from statistical and historical noise Principal Component Analysis Spectral decomposition of ~Cij matrix to partially sort out the different contributions to the correlations 223 eigenvalues Lowest 218 – Statistical noise Randomized alignments retaining the same size and amino acid propensities at sites show eigenvalues of similar magnitude First mode makes the dominant contribution to ~Cij – historical noise The first eiganvelue is well approximated by a first order approximation, proves that the first eigenvector should just report the net contribution of each position to the total correlation

Sector Identification using modes 2 to 5

Overview of Sectors

Statistical Independence Compute correlation entropy to quantitatively measure the independence of sectors Minimum discriminatory information method i.e. S is small set of position, specifically, the top five positions contributing to each sector

Structure Connectivity No sector Known primary/secondary/subdomain-architecture subdivision Distinction in degree of solvent exposure Difference in proximity to the active site (not for green sector)

Without information about tertiary structure and only ~10% of total sequence positions contributes strongly to each sector, each sector reveals obvious intra- sector physical connectivity and only a few inter-sector contacts. Red: focus on S1 pocket catalytic specificity Blue: more distributed property Green: focus around catalytic triad catalytic activity

Biochemical Independence Additive effects from combination of mutations between two groups (magenta: observed | white: predicted) Mutations of red and blue sectors showed very different effects focused either on catalytic power or thermal stability

Independent Sequence Divergence Sequence similarity analysis of each sector classifies members in the family effectively only by the related property, while the analysis on all positions failed to do the classification (442 members with functional annotation)

Evidence of “Sector” theory in Other Protein Families PDZ PAS SH2 SH3 Different regulatory mechnisms

Novel Structural Organization Implication for Physical Properties of Proteins Alternative View to Calculate Residue Covariance Technical Challenges Protein Modulization Adaptive Advantage Discussion