Presentation is loading. Please wait.

Presentation is loading. Please wait.

Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.

Similar presentations


Presentation on theme: "Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles."— Presentation transcript:

1 Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles Remote homologs with no known structure - Given a large, diverse superfamily - protein may evolve different function or subtype - different substrate specificity or activity - proteins with similar fold but different function Past methods used phylogenetic trees - map unknown protein to one of the branches of the tree produced - but- maybe diverged to long ago to be clearly identified - co-evolution of multiple features - possible convergent evolution of molecular function at aa level

2 Other methodologies: Analysis/prediction of subtype from sequence alignments -characterization of aa residues, looking for significant substitutions - gathering sequences into subgroups, comparing each subgroup Principal component analysis (Casari et al, 1995) - looks for functional residues conserved in protein families Evolutionary Trace (Lichtarge et al) Phylogenetic Inference (Sjolander et al)

3 Goal: identify regions conferring sub-family specificity -Secondary goal: predict subtypes of orphan sequences Input to algorithm: - multiple sequence alignment (MSA) of sequences in a protein family - classification of subfamilies of sequences from above MSA For the given subtypes (or subfamilies) provided: - get the MSA subalignment for each subfamily - build a HMM profile for each sub-family MSA - Rationale: generate pseudocounts and account for statistical bias For each subalignment profile The profile value for amino acid x at position i for subfamily j over all amino acids at a given position will sum to 1. (probability of finding an amino acid x at position i in the subfamily j)

4 Relative Entropy - measure of “distance” between two probability distributions - Relative entropy produces a value >= 0. (value of 0 for two identical distributions) - for each position i in a subfamily s For each position, a RE value for a subfamily s vs s-bar (all other subfamilies) Cumulative Relative Entropy - given a set of relative entropies for each subfamily for each position -To produce a CRE for a given position i in the MSA across all subfamilies.

5 Given this set of cumulative relative entropy measures - one for each position in MRA- you take the Z score. - Standard statistical measure- the number of std dev’s above/below the mean - tells you which residue positions vary strongly in aa distribution between families - empirically, Z > 3 correlates with functional residue For position i, which amino acid is dominant in a given subfamily - find probability of observing aa x at position in subfamily s vs not-s - Take the aa with probability >= 0.5 - We now have a small set of aa residues which differ strongly between subfamilies of a protein family.

6 What exactly constitutes a family or subfamily? - not always clear - automated tree generation could not separate data into clear subfamilies - use of PFAM alignments and SWISSPROT data Subfamilies are not clearly defined in databases - divided proteins from PFAM database into subfamilies based on SWISSPROT data - keyword search limited to enzymatic activity string in SWISSPROT - put into groups, then checked for obvious mistakes - also eliminated divisions “easily discernable by sequence comparison” - 62 groupings from 42 alignments remained - randomly pick 1:1 to produce 42 groups over 42 alignments Subfamily data

7 Four very large families to test their results on - nucleotidyl cyclases - eukaryotic protein kinases - lactate/malate dehydrogenases - trypsin-like serine proteases Nucleotidyl cyclases - membrane-attached or cytosolic, cyclize (GTP -> cGMP) or (ATP -> cAMP) - found residues 1018, 938, which correlate with previous results - also identified residues which have not been tested experimentally Protein kinases - phosphorylate serine/threonine or tyrosine residues - compare to experimental result- some ser/thr vs tyr kinase differences not detected - inconsistency (no conservation) within the subfamily - residues which were common to both ser/thr and tyr kinases Subfamilies

8 Lactate/Malate Dehydrogenases - common to a very wide variety of organisms- highly divergent - results mostly as expected- but a few residues identified outside of active site Serine Proteases - cut protein backbone- differing specificity as to where (what aa precedes cut) - specificity pocket determines where protease can bind - identified 2 out of 3 of experimentally-determined pocket residues - (third had a low z-score because of tolerance in one protein family) - also identified a few residues outside of the active site Subfamilies (cont)

9 Sequence Similarity - straight % similarity with other sequences (ignoring gaps) BLAST - database search, assign to nearest subfamily with best alignment HMM method - align sequence of sub-type to all HMMs of subfamilies and assign it to best alignment - will attempt to do iterative optimization of match… Profile method - take original HMM, and probability profile -Sub-profile method - only use residues in above formula that have a positive Z-score - to reduce noise, restrict to values that have above average positive relative entropy Prediction of Protein Subfamily

10 Input: a multiple-sequence alignment - each sequence is converted to a vector of size (20 * l) where l is length of the alignment Generation of of N x (20*l) matrix - one sequence produces a vector of dimensions 20*l - N sequences to produce N vectors of dimension 20*l Use Principal Component Analysis - get the covariance matrix- tells you how factors are correlated to one another - eliminate covariance by finding eigenvectors/eigenvalues of covariance matrix - largest eigenvalues and corresponding eigenvectors give you principal components - ie the largest factors determining distribution of your dataset - they take the three largest (the largest of which represents consensus sequence) - project their 20*l dimensional data onto those 3 dimensions - this can be used to predict a protein subfamily for a given protein Casari, et al. (1995) A method to predict functional residues in proteins

11 Construction of a “comparison matrix” - take matrix x (matrix transpose) - solve for eigenvectors and eigenvalues as before Columns of f represent amino acid values and positions - becomes possible to examine individual amino acid residues and positions - plotted on graph, shows residue correlation to type of protein subfamily - does this actually work? General Weirdness


Download ppt "Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles."

Similar presentations


Ads by Google