Matt Menke, Tufts Bonnie Berger, MIT Lenore Cowen, Tufts

Slides:

Advertisements

Similar presentations

Mathematical Challenges in Protein Motif Recognition Bonnie Berger MIT.

Advertisements

Secondary structure prediction from amino acid sequence.

Hidden Markov Model in Biological Sequence Analysis – Part 2

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.

PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological.

PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.

Protein Threading Zhanggroup Overview Background protein structure protein folding and designability Protein threading Current limitations.

Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.

1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry

Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.

درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.

Structural bioinformatics

Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]

Finding the Beta Helix Motif By Marcin Mejran. Papers Predicting The  -Helix Fold From Protein Sequence Data by Phil Bradley, Lenore Cowen, Matthew Menke,

Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.

Repetitive Beta Folds Form, Function, and Properties.

Profile-profile alignment using hidden Markov models Wing Wong.

Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Segmentation Conditional.

Recursive domains in proteins

Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.

. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]

Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.

Molecular modelling / structure prediction (A computational approach to protein structure) Today: Why bother about proteins/prediction Concepts of molecular.

Fa 05CSE182 CSE182-L6 Protein structure basics Protein sequencing.

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

IBGP/BMI 705 Lab 4: Protein structure and alignment TA: L. Cooper.

Structural alignment Protein structure Every protein is defined by a unique sequence (primary structure) that folds into a unique.

Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.

CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.

Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines Blaise Gassend, Charles W. O'Donnell, William Thies,

Sequence analysis: Macromolecular motif recognition Sylvia Nagl.

Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.

Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Predicting The Beta-Helix Fold From Protein Sequence Data Phil Bradley, Lenore Cowen, Matthew Menke, Jonathan King, Bonnie Berger MIT.

Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center.

Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa Nadimpalli Mengfei Cao COMP150PBT | Fall 2010.

Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.

Protein Secondary Structure, Bioinformatics Tools, and Multiple Sequence Alignments Finding Similar Sequences Predicting Secondary Structures Predicting.

Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009

Hidden Markov Models A first-order Hidden Markov Model is completely defined by: A set of states. An alphabet of symbols. A transition probability matrix.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Structure prediction: Homology modeling

Protein Modeling Protein Structure Prediction. 3D Protein Structure ALA CαCα LEU CαCαCαCαCαCαCαCα PRO VALVAL ARG …… ??? backbone sidechain.

Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.

PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.

Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.

Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.

Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.

Remote Homology Detection: Beyond Hidden Markov Models Lenore Cowen CS Department Tufts University.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Local Flexibility Aids Protein Multiple Structure Alignment Matt Menke Bonnie Berger Lenore Cowen.

Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)

Sequence: PFAM Used example: Database of protein domain families. It is based on manually curated alignments.

Homology 3D modeling Miguel Andrade Mainz, Germany Faculty of Biology,

Chapter 14 Protein Structure Classification

Computational Structure Prediction

Protein Families, Motifs & Domains.

Multiple sequence alignment (msa)

Noah M. Daniels | Raghavendra Hosur | Bonnie Berger | Lenore Cowen

Pfam: multiple sequence alignments and HMM-profiles of protein domains

Protein Structure Prediction and Protein Homology modeling

CISC 841 Bioinformatics (Fall 2007) Hidden Markov Models

Protein Structures.

Volume 19, Issue 7, Pages (July 2011)

Protein structure prediction.

Recognizing Protein Substructure Similarity Using Segmental Threading

Protein structure prediction

Homology modeling in short…

Presentation transcript:

Remote Homology Detection of Beta-Structural Motifs Using Random Fields Matt Menke, Tufts Bonnie Berger, MIT Lenore Cowen, Tufts ISMB 3Dsig 2010 July 10, 2010

Inferring structural similarity from homology is hard at the SCOP superfamily/fold level

Profile HMMs 3

HMM is trained from Sequence Alignment of Known Structures But: cannot capture pariwise long-range beta-sheet interactions!

Pectate Lyase C (Yoder et al. 1993) HMMs cannot capture statistical preferences from residues close in space but far, and a variable distance apart in seq. Pectate Lyase C (Yoder et al. 1993)

Look at Just Pairs or Generalize to Markov Random Fields Only look at Pairs: Generalize to Markov Random Fields Liu et al. 2009 Zhao et al. 2010 Menke et al. 2010 (This work) B3 T2 B2 B1 [Bradley, Cowen, Menke, King, Berger, PNAS, 2001, 98:26, 14,819-14,824 ; Cowen, Bradley, Menke, King, Berger (2002), J Comp Biol, 9, 261-276]

Let’s look at what this would mean for propeller folds

SCOP (http://scop.mrc-lmb.cam.ac.uk/scop Goal: capture HMM sequence information and pairwise information in beta-structural motifs at the same time! SCOP (http://scop.mrc-lmb.cam.ac.uk/scop

Structural Motifs Using Random Fields SMURF

Structural Motifs Using Random Fields Can we get the benefit of pairwise correlations without having to throw away all sequence info?

The template is learned from solved structures in the PDB

The template is learned from solved structures in the PDB: Aligned with Matt

Digression: Matt structural alignment program Menke, Berger, Cowen, (PLOS Combio 2008) Specifically designed to align more distant homologs AFP chaining using dynamic programming with “translations and twists” (flexibility)

The template is learned from solved structures in the PDB: Aligned with Matt

Two beta tables are learned from amphapathic beta sheets that are not propellers from solved structures in the PDB. A C D E F G H I K L M N P Q R S T V W Y 0.78 0.18 0.14 0.15 0.59 0.70 0.06 1.06 0.07 1.19 0.17 0.12 0.05 0.11 0.08 0.22 0.25 1.53 0.27 0.24 0.03 0.28 0.34 0.02 0.01 0.39 0.10 0.16 0.26 0.40 0.57 0.19 0.66 0.61 0.13 1.35 0.43 0.58 0.77 1.13 0.23 0.09 0.31 1.27 0.48 0.04 2.27 2.21 0.38 0.29 0.45 2.56 0.42 0.00 2.96 0.33 0.36 2.64 0.50 0.49 0.44 3.74 0.64 Two pairwise Exposed Residue A C D E F G H I K L M N P Q R S T V W Y 0.27 0.04 0.13 0.28 0.22 0.18 0.11 0.31 0.23 0.38 0.06 0.37 0.49 0.25 0.08 0.05 0.07 0.03 0.02 0.01 0.10 0.09 0.71 0.12 0.15 0.50 0.36 0.41 0.24 0.43 0.21 1.92 0.14 1.49 0.60 1.01 0.63 0.32 0.16 0.34 0.19 0.29 0.33 0.17 0.20 0.48 0.57 0.30 0.59 0.40 0.46 0.42 0.70 1.17 0.52 0.26 0.62 0.39 0.47 0.68 0.72 0.91 0.88 1.60 0.82 0.87 0.64 Buried Residue http://bcb.cs.tufts.edu/propellers/si/

Computing a Score Sequences are scored by computing their best “threading” or “parse” against the template as a sum of HMM(score) + pairwise(score) No longer polynomial time (multi-dimensional dynamic programming) Tractable on propellers because paired beta-strands don’t interleave too much

Let’s look at what this would mean for propeller folds

Let’s look at what this would mean for propeller folds Training set for HMM score: leave-superfamily-out cross validation Training set for pairwise score: amphapathic beta-sheets from NON-propellers

Results on Propellers 6-bladed 7-bladed TNeg Hmmer Smurf 97% 52 80 87 96% 56 95% 64 93 94% 68 84 90 93% 92% 88 97 91% 92 90% 100

Results on Propellers Note that this is “6 (or 7)” bladed propeller versus non-propeller– distinguishing the number of blades in the propeller seems to be a much harder problem….

Different propeller closures 1jof 2trc

So: what new sequences fold into propellers? We predict a double propeller motif in the N-terminal region of a hybrid 2-component sensor protein.

What are these proteins? First found in a benign bacteria in human gut. May be involved in adapting to changes in diet/efficiently processing different sugars Found in other bacterial species: help sense and adapt to environmental changes. Big stretch (I am not a biologist): help to study human obesity epidemic??

Popular Domains HisKA histidine kinase domain GGDEF adenylyl cyclase signalling domain SpoIIE sporulation domain Gaf domain PAS domain HATPase domain

Species distribution

Distinguishing Number of Blades The automatic SMURF consensus 7-bladed template only learns 6 blades. Sequence motifs are similar– the same Pfam motif occurs in propellers with different numbers of blades The fix: throw out propellers with a “funky” 7th blade by hand and build a new template. Now 6-bladed propellers don’t like the 7-bladed template Double propellers we found are probably 7-7 (but 7-6 is also plausible).

Predict propellers with Smurf! http://smurf.cs.tufts.edu Accepts sequences in FASTA format 6,7,8-bladed templates, as well as all 9 double-propeller template http://bcb.cs.tufts.edu/propellers/si pairwise tables long list of predicted propeller sequences

What’s Next for SMURF? Long-range dependencies Deeply interleaved β-strand pairs

Conclusions Combining an HMM score with a pairwise score can help recognize beta-structures Computing this score exactly with a random field is highly computationally intensive We will begin to look at when it is feasible and when we should use heuristics. Also: add side-chain packing, other model refinements.

More Questions When should we over-weight the HMM versus the pair portion of the score? -- the case of 8-bladed propellers Are there other ways to incorporate pairwise dependencies into HMMs?

An Hmm is only as good as its training data An Hmm is only as good as its training data– or is it? Idea: we augment the training set, using the simplest model of evolution! See Kumar and Cowen’s ISMB proceedings paper!

Acknowledgements National Institutes of Health Thank you!