Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

Profiles for Sequences
Introduction to Bioinformatics
Structural bioinformatics
Sequence Similarity Searching Class 4 March 2010.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
1 Multiple sequence alignment Lesson 4. 2 VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG.
Protein Sectors: Evolutionary Units of Three-Dimensional Structure Najeeb Halabi, Olivier Rivoire, Stanislas Leibler, and Rama Ranganthan Cell 138, ,
Profile-profile alignment using hidden Markov models Wing Wong.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Sequence analysis course
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
The Protein Data Bank (PDB)
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.
Exploiting Structural and Comparative Genomics to Reveal Protein Functions  How many domain families can we find in the genomes and can we predict the.
Similar Sequence Similar Function Charles Yan Spring 2006.
Dali: A Protein Structural Comparison Algorithm Using 2D Distance Matrices.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Multiple Sequence Alignments
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Classification A comparison of function inference techniques.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Protein Sequence Alignment and Database Searching.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
© Wiley Publishing All Rights Reserved. Building Multiple- Sequence Alignments.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Ozgur Ozturk, Ahmet Sacan, Hakan Ferhatosmanoglu, Yusu Wang The Ohio State University LFM-Pro: a tool for mining family-specific sites in protein structure.
Construction of Substitution Matrices
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Tutorial 4 Substitution matrices and PSI-BLAST 1.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Construction of Substitution matrices
Lecture 11 CS5661 Structural Bioinformatics – Structure Comparison Motivation Concepts Structure Comparison.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Sequence similarity, BLAST alignments & multiple sequence alignments
Multiple sequence alignment (msa)
Pairwise alignment incorporating dipeptide covariation
Protein structure prediction.
Presentation transcript:

Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles Remote homologs with no known structure - Given a large, diverse superfamily - protein may evolve different function or subtype - different substrate specificity or activity - proteins with similar fold but different function Past methods used phylogenetic trees - map unknown protein to one of the branches of the tree produced - but- maybe diverged to long ago to be clearly identified - co-evolution of multiple features - possible convergent evolution of molecular function at aa level

Other methodologies: Analysis/prediction of subtype from sequence alignments -characterization of aa residues, looking for significant substitutions - gathering sequences into subgroups, comparing each subgroup Principal component analysis (Casari et al, 1995) - looks for functional residues conserved in protein families Evolutionary Trace (Lichtarge et al) Phylogenetic Inference (Sjolander et al)

Goal: identify regions conferring sub-family specificity -Secondary goal: predict subtypes of orphan sequences Input to algorithm: - multiple sequence alignment (MSA) of sequences in a protein family - classification of subfamilies of sequences from above MSA For the given subtypes (or subfamilies) provided: - get the MSA subalignment for each subfamily - build a HMM profile for each sub-family MSA - Rationale: generate pseudocounts and account for statistical bias For each subalignment profile The profile value for amino acid x at position i for subfamily j over all amino acids at a given position will sum to 1. (probability of finding an amino acid x at position i in the subfamily j)

Relative Entropy - measure of “distance” between two probability distributions - Relative entropy produces a value >= 0. (value of 0 for two identical distributions) - for each position i in a subfamily s For each position, a RE value for a subfamily s vs s-bar (all other subfamilies) Cumulative Relative Entropy - given a set of relative entropies for each subfamily for each position -To produce a CRE for a given position i in the MSA across all subfamilies.

Given this set of cumulative relative entropy measures - one for each position in MRA- you take the Z score. - Standard statistical measure- the number of std dev’s above/below the mean - tells you which residue positions vary strongly in aa distribution between families - empirically, Z > 3 correlates with functional residue For position i, which amino acid is dominant in a given subfamily - find probability of observing aa x at position in subfamily s vs not-s - Take the aa with probability >= We now have a small set of aa residues which differ strongly between subfamilies of a protein family.

What exactly constitutes a family or subfamily? - not always clear - automated tree generation could not separate data into clear subfamilies - use of PFAM alignments and SWISSPROT data Subfamilies are not clearly defined in databases - divided proteins from PFAM database into subfamilies based on SWISSPROT data - keyword search limited to enzymatic activity string in SWISSPROT - put into groups, then checked for obvious mistakes - also eliminated divisions “easily discernable by sequence comparison” - 62 groupings from 42 alignments remained - randomly pick 1:1 to produce 42 groups over 42 alignments Subfamily data

Four very large families to test their results on - nucleotidyl cyclases - eukaryotic protein kinases - lactate/malate dehydrogenases - trypsin-like serine proteases Nucleotidyl cyclases - membrane-attached or cytosolic, cyclize (GTP -> cGMP) or (ATP -> cAMP) - found residues 1018, 938, which correlate with previous results - also identified residues which have not been tested experimentally Protein kinases - phosphorylate serine/threonine or tyrosine residues - compare to experimental result- some ser/thr vs tyr kinase differences not detected - inconsistency (no conservation) within the subfamily - residues which were common to both ser/thr and tyr kinases Subfamilies

Lactate/Malate Dehydrogenases - common to a very wide variety of organisms- highly divergent - results mostly as expected- but a few residues identified outside of active site Serine Proteases - cut protein backbone- differing specificity as to where (what aa precedes cut) - specificity pocket determines where protease can bind - identified 2 out of 3 of experimentally-determined pocket residues - (third had a low z-score because of tolerance in one protein family) - also identified a few residues outside of the active site Subfamilies (cont)

Sequence Similarity - straight % similarity with other sequences (ignoring gaps) BLAST - database search, assign to nearest subfamily with best alignment HMM method - align sequence of sub-type to all HMMs of subfamilies and assign it to best alignment - will attempt to do iterative optimization of match… Profile method - take original HMM, and probability profile -Sub-profile method - only use residues in above formula that have a positive Z-score - to reduce noise, restrict to values that have above average positive relative entropy Prediction of Protein Subfamily

Input: a multiple-sequence alignment - each sequence is converted to a vector of size (20 * l) where l is length of the alignment Generation of of N x (20*l) matrix - one sequence produces a vector of dimensions 20*l - N sequences to produce N vectors of dimension 20*l Use Principal Component Analysis - get the covariance matrix- tells you how factors are correlated to one another - eliminate covariance by finding eigenvectors/eigenvalues of covariance matrix - largest eigenvalues and corresponding eigenvectors give you principal components - ie the largest factors determining distribution of your dataset - they take the three largest (the largest of which represents consensus sequence) - project their 20*l dimensional data onto those 3 dimensions - this can be used to predict a protein subfamily for a given protein Casari, et al. (1995) A method to predict functional residues in proteins

Construction of a “comparison matrix” - take matrix x (matrix transpose) - solve for eigenvectors and eigenvalues as before Columns of f represent amino acid values and positions - becomes possible to examine individual amino acid residues and positions - plotted on graph, shows residue correlation to type of protein subfamily - does this actually work? General Weirdness