Remote Homology Detection: Beyond Hidden Markov Models Lenore Cowen CS Department Tufts University.

Slides:



Advertisements
Similar presentations
Mathematical Challenges in Protein Motif Recognition Bonnie Berger MIT.
Advertisements

Secondary structure prediction from amino acid sequence.
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
BLAST Sequence alignment, E-value & Extreme value distribution.
Pfam(Protein families )
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Structural bioinformatics
Finding the Beta Helix Motif By Marcin Mejran. Papers Predicting The  -Helix Fold From Protein Sequence Data by Phil Bradley, Lenore Cowen, Matthew Menke,
Protein-DNA interactions: amino acid conservation and the effects of mutations on binding specificity Nicholas M. Luscombe and Janet M. Thornton JMB (2002)
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Repetitive Beta Folds Form, Function, and Properties.
Profile-profile alignment using hidden Markov models Wing Wong.
Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.
Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Segmentation Conditional.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]
Protein Modules An Introduction to Bioinformatics.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Fa 05CSE182 CSE182-L6 Protein structure basics Protein sequencing.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Multiple Sequence Alignments
Introduction to Bioinformatics - Tutorial no. 8 Protein Prediction: - PROSITE - Pfam - SCOP - TOPITS - genThreader.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Structures.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Masquerade Detection Mark Stamp 1Masquerade Detection.
Genomics and Personalized Care in Health Systems Lecture 9 RNA and Protein Structure Leming Zhou, PhD School of Health and Rehabilitation Sciences Department.
Protein Sequence Alignment and Database Searching.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Hidden Markov Models for Sequence Analysis 4
PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches Gaurav Sahni, Ph.D.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Predicting The Beta-Helix Fold From Protein Sequence Data Phil Bradley, Lenore Cowen, Matthew Menke, Jonathan King, Bonnie Berger MIT.
Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Protein Secondary Structure, Bioinformatics Tools, and Multiple Sequence Alignments Finding Similar Sequences Predicting Secondary Structures Predicting.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Protein Modeling Protein Structure Prediction. 3D Protein Structure ALA CαCα LEU CαCαCαCαCαCαCαCα PRO VALVAL ARG …… ??? backbone sidechain.
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Local Flexibility Aids Protein Multiple Structure Alignment Matt Menke Bonnie Berger Lenore Cowen.
Protein Structure Prediction. Protein Sequence Analysis Molecular properties (pH, mol. wt. isoelectric point, hydrophobicity) Secondary Structure Super-secondary.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Noah M. Daniels | Raghavendra Hosur | Bonnie Berger | Lenore Cowen
Matt Menke, Tufts Bonnie Berger, MIT Lenore Cowen, Tufts
Large-Scale Genomic Surveys
Protein structure prediction.
Presentation transcript:

Remote Homology Detection: Beyond Hidden Markov Models Lenore Cowen CS Department Tufts University

What are Homologous Proteins? Proteins that preserve related structure (and often function) because they have evolved from a common ancestor. Human Insulin (Pdb Id: 1mso) Pig Insulin (Pdb Id: 1m5a)

Why is homology important? Common Ancestor Similar Structure Similar Function

Computational Approaches to Detecting Homology Sequence based methods work best when homology is not too distant These proteins aligned by BLAST have probably evolved from common ancestor S1 LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 60 L+P +K+ V A WGKV + E G EAL R+ + +P T+ +F F DLS G+ +V S2 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSAQV 55

A Greater Challenge: Detect Remote Homologs

Known proteins are organized into hierarchical structural classification schemes

SCOP (

Can we recognize all folds that form a beta-propeller, etc.? If they are evolutionarily close enough the answer is YES. Use BLAST to recognize homology (similar sequences have similar folds) …GVFIIIMGSHGK… …GVD-LMG-HGR…

Statistical template/profile methods (Altschul et al. 1990) Hidden Markov Models (Eddy, 1998) Threading Methods (Jones et al. 1992) Combinations of two or more of the above Approaches to Structural Motif Recognition

Profile Hidden Markov Models

HMM is trained from Sequence Alignment of Known Structures

Usually from a Structure Alignment Geometric criteria: BICRITERIA OPTIMIZATION PROBLEM: Place everything in the core, and residue distances are bad. Place a single residue in the core, all distances are great!

HMM is trained from Sequence Alignment of Known Structures

But: cannot capture pariwise long-range beta-sheet interactions!

In previous work, we showed that hydrogen- bonded residues in beta-sheets had strong statistical preferences that could recognize this fold.(No sequence repeat) The Right-handed Parallel Beta-Helix Pectate Lyase C (Yoder et al. 1993)

Thing in stacking beta strands show statistical correlations. But; cannot be captured by an HMM, because they are far away and variable distance apart in sequence. The Right-handed Parallel Beta-Helix Pectate Lyase C (Yoder et al. 1993)

3D Pairwise Correlations Stacking residues in adjacent beta-strands exhibit strong correlations B3 T2 B2 B1

[Bradley, Cowen, Menke, King, Berger, PNAS, 2001, 98:26, 14, ,824 ; Cowen, Bradley, Menke, King, Berger (2002), J Comp Biol, 9, ] Uses dynamic program to try to “thread” sequence onto a template with pairwise residue correlations. Throws away ALL sequence information Better than HMM methods. BetaWrap Program

[Bradley, Cowen, Menke, King, Berger, PNAS, 2001, 98:26, 14, ,824 ; Cowen, Bradley, Menke, King, Berger (2002), J Comp Biol, 9, ] Performance: On 2001 PDB: no false positives & no false negatives. Recognizes beta helices in PDB across SCOP families in cross-validation. Recognizes many new potential beta helices when run on larger sequence databases. Runs in linear time (~5 min. on SWISS-PROT). BetaWrap Program

Structural Motifs Using Random Fields SMURF

Structural Motifs Using Random Fields Can we get the benefit of pairwise correlations without having to throw away all sequence info?

Let’s look at what this would mean for propeller folds

The template is learned from solved structures in the PDB

Computing a Score Sequences are scored by computing their best “threading” or “parse” against the template as a sum of HMM(score) + pairwise(score) No longer polynomial time (multi- dimensional dynamic programming) Tractable on propellers because paired beta-strands don’t interleave too much

Results on Propellers 6-bladed7-bladed TNegHmmerSmurfHmmerSmurf 97% % % % % % % %

So: what new sequences fold into propellers? We predict a double propeller motif in the N- terminal region of a hybrid 2-component sensor protein.

What are these proteins? First found in a benign bacteria in human gut. May be involved in adapting to changes in diet/efficiently processing different sugars Found in other bacterial species: help sense and adapt to environmental changes. Big stretch (I am not a biologist): help to study human obesity epidemic??

Popular Domains HisKA histidine kinase domain GGDEF adenylyl cyclase signalling domain SpoIIE sporulation domain Gaf domain PAS domain HATPase domain

Species distribution

What’s Next for SMURF? Long-range dependencies Deeply interleaved β-strand pairs

Conclusions Combining an HMM score with a pairwise score can help recognize beta-structures Computing this score exactly with a random field is highly computationally intensive We will begin to look at when it is feasible and when we should use heuristics. Are there other ways to incorporate pairwise dependencies into HMMs?

An Hmm is only as good as its training data An Hmm is only as good as its training data– or is it? Idea: we augment the training set, using the simplest model of evolution!

Copyright restrictions may apply. Kumar, A. et al. Bioinformatics : ; doi: /bioinformatics/btp265

The PAM 250 Matrix

Evolution Model (BLOSUM62)

Copyright restrictions may apply. Kumar, A. et al. Bioinformatics : ; doi: /bioinformatics/btp265

β-Strand Mutation Model Augment the MSA with sequences such that the frequency of AA hydrogen bonded resembles that of known proteins For each sequence add M mutated (new) sequences with p% mutation rate, where p is proportional to total length of the β-Strands Adding Mutation –Select residue position ‘i’ at random –If ‘i’ pairs with ‘j’, then ‘i’ is mutated based on either buried or exposed residue probability table –The process is repeated to obtain ‘p%’ mutation rate –Set the value of p = 10,20…100 For 100 sequences we get augmented set of 100+M*100 sequences Test various values of M ranging from at 20% to determine stability Pick M that results in stable results– 150 sequences for each training sequence.

Dataset Pick sequences that are less than 95% identical from ASTRAL database Families that belong to “all beta proteins” class that have at least 10 sequences and remaining families in superfamily have at least one sequence (Resulted in 41 families) Training set –Sequences from a family Test set –Positive sequences: sequences in superfamily that don’t belong to family in training set –Negative sequences: sequences from other folds

HMM Stability – Simple Mutation Model

HMM Stability – β-Strand Mutation Model

AUC Improvement (Simple Mutation Model)

AUC Improvement (β-Strand Mutation Model)

AUC Improvement (Combined)

Distribution of Families (Simple Mutation Model)

Distribution of Families (β-Strand Mutation Model)

Discussion We have seen two different ways of combining beta-sheet pairwise statistics with HMM models, and in both cases showed improved detection of remote homologs.

Acknowledgements Matt Menke Bonnie Berger Bonnie Berger group (SMURF) Anoop Kumar Tufts BCB group

Acknowledgements National Institutes of Health Thank you!