Remote Homology Detection: Beyond Hidden Markov Models Lenore Cowen CS Department Tufts University.

Remote Homology Detection: Beyond Hidden Markov Models Lenore Cowen CS Department Tufts University

What are Homologous Proteins? Proteins that preserve related structure (and often function) because they have evolved from a common ancestor. Human Insulin (Pdb Id: 1mso) Pig Insulin (Pdb Id: 1m5a)

Why is homology important? Common Ancestor Similar Structure Similar Function

Computational Approaches to Detecting Homology Sequence based methods work best when homology is not too distant These proteins aligned by BLAST have probably evolved from common ancestor S1 LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 60 L+P +K+ V A WGKV + E G EAL R+ + +P T+ +F F DLS G+ +V S2 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSAQV 55

A Greater Challenge: Detect Remote Homologs

Known proteins are organized into hierarchical structural classification schemes

SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/)

Can we recognize all folds that form a beta-propeller, etc.? If they are evolutionarily close enough the answer is YES. Use BLAST to recognize homology (similar sequences have similar folds) …GVFIIIMGSHGK… …GVD-LMG-HGR…

Statistical template/profile methods (Altschul et al. 1990) Hidden Markov Models (Eddy, 1998) Threading Methods (Jones et al. 1992) Combinations of two or more of the above Approaches to Structural Motif Recognition

Profile Hidden Markov Models

HMM is trained from Sequence Alignment of Known Structures

Usually from a Structure Alignment Geometric criteria: BICRITERIA OPTIMIZATION PROBLEM: Place everything in the core, and residue distances are bad. Place a single residue in the core, all distances are great!

HMM is trained from Sequence Alignment of Known Structures

But: cannot capture pariwise long-range beta-sheet interactions!

In previous work, we showed that hydrogen- bonded residues in beta-sheets had strong statistical preferences that could recognize this fold.(No sequence repeat) The Right-handed Parallel Beta-Helix Pectate Lyase C (Yoder et al. 1993)

Thing in stacking beta strands show statistical correlations. But; cannot be captured by an HMM, because they are far away and variable distance apart in sequence. The Right-handed Parallel Beta-Helix Pectate Lyase C (Yoder et al. 1993)

3D Pairwise Correlations Stacking residues in adjacent beta-strands exhibit strong correlations B3 T2 B2 B1

[Bradley, Cowen, Menke, King, Berger, PNAS, 2001, 98:26, 14,819- 14,824 ; Cowen, Bradley, Menke, King, Berger (2002), J Comp Biol, 9, 261-276] Uses dynamic program to try to “thread” sequence onto a template with pairwise residue correlations. Throws away ALL sequence information Better than HMM methods. BetaWrap Program

[Bradley, Cowen, Menke, King, Berger, PNAS, 2001, 98:26, 14,819- 14,824 ; Cowen, Bradley, Menke, King, Berger (2002), J Comp Biol, 9, 261-276] Performance: On 2001 PDB: no false positives & no false negatives. Recognizes beta helices in PDB across SCOP families in cross-validation. Recognizes many new potential beta helices when run on larger sequence databases. Runs in linear time (~5 min. on SWISS-PROT). BetaWrap Program

Structural Motifs Using Random Fields SMURF

Structural Motifs Using Random Fields Can we get the benefit of pairwise correlations without having to throw away all sequence info?

Let’s look at what this would mean for propeller folds

The template is learned from solved structures in the PDB

Computing a Score Sequences are scored by computing their best “threading” or “parse” against the template as a sum of HMM(score) + pairwise(score) No longer polynomial time (multi- dimensional dynamic programming) Tractable on propellers because paired beta-strands don’t interleave too much

Results on Propellers 6-bladed7-bladed TNegHmmerSmurfHmmerSmurf 97% 52 80 87 96% 56 80 87 95% 64 80 87 93 94% 68 84 90 93 93% 68 84 90 93 92% 68 88 90 97 91% 68 92 90 97 90% 68 92 93100

So: what new sequences fold into propellers? We predict a double propeller motif in the N- terminal region of a hybrid 2-component sensor protein.

What are these proteins? First found in a benign bacteria in human gut. May be involved in adapting to changes in diet/efficiently processing different sugars Found in other bacterial species: help sense and adapt to environmental changes. Big stretch (I am not a biologist): help to study human obesity epidemic??

Popular Domains HisKA histidine kinase domain GGDEF adenylyl cyclase signalling domain SpoIIE sporulation domain Gaf domain PAS domain HATPase domain

Species distribution

What’s Next for SMURF? Long-range dependencies Deeply interleaved β-strand pairs

Conclusions Combining an HMM score with a pairwise score can help recognize beta-structures Computing this score exactly with a random field is highly computationally intensive We will begin to look at when it is feasible and when we should use heuristics. Are there other ways to incorporate pairwise dependencies into HMMs?

An Hmm is only as good as its training data An Hmm is only as good as its training data– or is it? Idea: we augment the training set, using the simplest model of evolution!

Copyright restrictions may apply. Kumar, A. et al. Bioinformatics 2009 25:1602-1608; doi:10.1093/bioinformatics/btp265

The PAM 250 Matrix

Evolution Model (BLOSUM62)

Copyright restrictions may apply. Kumar, A. et al. Bioinformatics 2009 25:1602-1608; doi:10.1093/bioinformatics/btp265

β-Strand Mutation Model Augment the MSA with sequences such that the frequency of AA hydrogen bonded resembles that of known proteins For each sequence add M mutated (new) sequences with p% mutation rate, where p is proportional to total length of the β-Strands Adding Mutation –Select residue position ‘i’ at random –If ‘i’ pairs with ‘j’, then ‘i’ is mutated based on either buried or exposed residue probability table –The process is repeated to obtain ‘p%’ mutation rate –Set the value of p = 10,20…100 For 100 sequences we get augmented set of 100+M*100 sequences Test various values of M ranging from 10-1000 at 20% to determine stability Pick M that results in stable results– 150 sequences for each training sequence.

Dataset Pick sequences that are less than 95% identical from ASTRAL database Families that belong to “all beta proteins” class that have at least 10 sequences and remaining families in superfamily have at least one sequence (Resulted in 41 families) Training set –Sequences from a family Test set –Positive sequences: sequences in superfamily that don’t belong to family in training set –Negative sequences: sequences from other folds

HMM Stability – Simple Mutation Model

HMM Stability – β-Strand Mutation Model

AUC Improvement (Simple Mutation Model)

AUC Improvement (β-Strand Mutation Model)

AUC Improvement (Combined)

Distribution of Families (Simple Mutation Model)

Distribution of Families (β-Strand Mutation Model)

Discussion We have seen two different ways of combining beta-sheet pairwise statistics with HMM models, and in both cases showed improved detection of remote homologs.

Acknowledgements Matt Menke Bonnie Berger Bonnie Berger group (SMURF) Anoop Kumar Tufts BCB group

Acknowledgements National Institutes of Health Thank you!

Remote Homology Detection: Beyond Hidden Markov Models Lenore Cowen CS Department Tufts University.

Similar presentations

Presentation on theme: "Remote Homology Detection: Beyond Hidden Markov Models Lenore Cowen CS Department Tufts University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Remote Homology Detection: Beyond Hidden Markov Models Lenore Cowen CS Department Tufts University.

Similar presentations

Presentation on theme: "Remote Homology Detection: Beyond Hidden Markov Models Lenore Cowen CS Department Tufts University."— Presentation transcript:

Similar presentations

About project

Feedback