Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Segmentation Conditional.

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
Mathematical Challenges in Protein Motif Recognition Bonnie Berger MIT.
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
Segmenting G-Protein Coupled Receptors using Language Models Betty Yee Man Cheng Language Technologies Institute, CMU Advisors:Judith Klein-Seetharaman.
Protein Threading Zhanggroup Overview Background protein structure protein folding and designability Protein threading Current limitations.
1 Learning Dynamic Models from Unsequenced Data Jeff Schneider School of Computer Science Carnegie Mellon University joint work with Tzu-Kuo Huang, Le.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Profiles for Sequences
Finding the Beta Helix Motif By Marcin Mejran. Papers Predicting The  -Helix Fold From Protein Sequence Data by Phil Bradley, Lenore Cowen, Matthew Menke,
January, 2009 Jaime Carbonell et al Carnegie Mellon University Data-Intensive Scalability in Machine Learning and Computational Proteomics.
Repetitive Beta Folds Form, Function, and Properties.
Conditional Random Fields
Carnegie Mellon School of Computer Science 1 Protein Tertiary and Quaternary Fold Recognition: A ML Approach Jaime Carbonell Joint work with: Yan Liu(
Protein Quaternary Fold Recognition Using Conditional Graphical Models
. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]
Modeling biological data and structure with probabilistic networks I Yuan Gao, Ph.D. 11/05/2002 Slides prepared from text material by Simon Kasif and Arthur.
Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project TXTpred: A New.
Protein Structures.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
A Probabilistic Approach to Protein Backbone Tracing in Electron Density Maps Frank DiMaio, Jude Shavlik Computer Sciences Department George Phillips Biochemistry.
Protein Tertiary Structure Prediction
Graphical models for part of speech tagging
Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines Blaise Gassend, Charles W. O'Donnell, William Thies,
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Predicting The Beta-Helix Fold From Protein Sequence Data Phil Bradley, Lenore Cowen, Matthew Menke, Jonathan King, Bonnie Berger MIT.
P ROTEIN SEONDARY & SUPER-SECONDARY STRUCTURE PREDICTION WITH HMM By En-Shiun Annie Lee CS 882 Protein Folding Instructed by Professor Ming Li.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Protein Modeling Protein Structure Prediction. 3D Protein Structure ALA CαCα LEU CαCαCαCαCαCαCαCα PRO VALVAL ARG …… ??? backbone sidechain.
Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS
Introduction to Protein Structure Prediction BMI/CS 576 Colin Dewey Fall 2008.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Carnegie Mellon School of Computer Science 1 Protein Quaternary Fold Recognition Using Conditional Graphical Models Yan Liu IBM Research Jaime Carbonell.
Unsupervised Mining of Statistical Temporal Structures in Video Liu ze yuan May 15,2011.
John Lafferty Andrew McCallum Fernando Pereira
Carnegie Mellon School of Computer Science 1 Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute Carnegie.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Christopher M. Bishop, Pattern Recognition and Machine Learning 1.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Structured learning: overview Sunita Sarawagi IIT Bombay TexPoint fonts used in EMF. Read the TexPoint manual before.
Remote Homology Detection: Beyond Hidden Markov Models Lenore Cowen CS Department Tufts University.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
1 Relational Factor Graphs Lin Liao Joint work with Dieter Fox.
Markov Networks: Theory and Applications Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208
Carnegie Mellon School of Computer Science 1 Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute Carnegie.
Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.
Protein Structure Prediction. Protein Sequence Analysis Molecular properties (pH, mol. wt. isoelectric point, hydrophobicity) Secondary Structure Super-secondary.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Recovering Temporally Rewiring Networks: A Model-based Approach
Matt Menke, Tufts Bonnie Berger, MIT Lenore Cowen, Tufts
Estimating Networks With Jumps
Protein Structures.
Protein structure prediction.
Conditional Graphical Models for Protein Structure Prediction
Protein structure prediction
Presentation transcript:

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Segmentation Conditional Random Fields (SCRFs) A New Approach for Protein Fold Recognition Yan Liu 1, Jaime Carbonell 1, Peter Weigele 2,Vanathi Gopalakrishnan 3 1. School of Computer Science, Carnegie Mellon University 2. Biology Department, Massachusetts Institute of Technology 3. Center for Biomedical Informatics, University of Pittsburgh

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Structural motif recognition Structural motif –Regular arrangement of secondary structural elements, which commonly appears in a variety of protein families –Super-secondary structure, or protein fold –Example Structural motif recognition –Given a structural motif and a protein sequence, predict the presence of the motif and the exact location in the protein, based on sequences only β-α-β (2CMD) Leucine-rich repeats (1A4Y)

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Previous work on structural motif recognition General approaches for structural motif recognition –Sequence similarity searches, e.g. PSI-BLAST [Altschul et al, 1997] –Profile HMM,.e.g. HMMER [Durbin et al, 1998] and SAM [Karplus et al, 1998] –Homology modeling or threading, e.g. Threader [Jones, 1998] Methods of careful design for specific structure motifs –Example: αα- and ββ- hairpins, β-turn and β-helix Our goal is to have a general probabilistic framework to address all these problems for structural motif prediction Major challenges: structural similarity without clear sequence similarity Long-range interactions, such as β-sheets Hard to generalize

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Outline Introduction Conditional random fields Segmentation conditional random fields Case study on β-helix fold recognition Conclusion

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Graphical models for protein structure prediction –Probabilistic causal networks [Delcher et al, 1993] –Markov random fields [White et al, 1994] –Hidden Markov model [Bystroff et al, 2000] –Bayesian segmentation model [Schmidler and Liu, 2000] Protein structure prediction can be generalized as learning problems for structured data –Structured data: observation with internal or external structures –Conditional graphical models are successful in various applications

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Condition random fields Condition random fields (CRFs) [lafferty et al, 2001] –A conditional undirected graphical model –The conditional probability is defined as –Flexible feature definition –Convex optimization function guarantees the globally optimal solution –Efficient inference algorithms –Kernel CRFs permits the use of implicit feature spaces via kernels [Lafferty et al, 2004] HMMsCRFs

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Graphical models for Structural Motif detection Structural motif detection –Structural components Secondary structural elements instead of individual residues –Informative features Indicator for conserved regions Length of each component Propensities to form hydrogen bond in β-sheet Segmented Markov Models

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Segmentation conditional random fields (I) Protein structural graph G = –V: nodes for the secondary structural elements of variable lengths –E1 edges between adjacent nodes for peptide bonds –E2 edges between distant nodes for hydrogen bonds or disulfide bonds Example: β-α-β motif Tradeoff between fidelity of the model and graph complexity

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Segmentation conditional random fields (II) Segmentation conditional random fields (SCRFs) –Given a protein structure graph G, we define a segmentation of the sequence W = (M, S), where S i = –The conditional probability of the segmentation W given the observation x is defined as –If each subgraph of the resulting graph is a tree or a chain, we can simplify the model to be

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Training and Testing for SCRFs Training phase : learn the model parameters –Minimizing regularized log loss –Seek the direction whose empirical values agrees with the expectation –Iterative searching algorithm have to be applied Testing phase: search the segmentation that maximize P(w|x)

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Inference algorithm Backward-forward algorithm* Viterbi algorithm*

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project SCRFs for β-helix fold recognition (I) Right-handed β-helix fold –A regular structural fold with an elongated helix-like structures whose successive rungs composed of three parallel β -strands (B1, B2, B3 strands) –T2 turn: a unique two-residue turn –Perform important functions such as the bacterial infection of plants, binding the O- antigen and etc. Computational challenges –Long insertions in T1 and T3 turn –Structural similarity with low sequence similarity Previous work –BetaWrap [Bradley et al 2001, Bradley et al. 2001, Cowen et al 2002] –BetaWrapPro [McDonnell et al] Pectate Lyase C (Yoder et al. 1993)

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project SCRFs for β-helix fold recognition (II) Protein structure graph –5 states: B1, B23, T1, T3, I –Length constraints B1, B23: fixed length as 3 and 9 T1, T3: 1 – 80 –Long-range interactions between B23 Prediction scores –Log-ratio scores

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Features Node features –Regular expression template, HMM profiles –Secondary structure prediction scores –Segment length Inter-node features –β -strand Side-chain alignment scores –Preferences for parallel alignment scores [Steward & Thonton, 2002] –Distance between adjacent B23 segments Features are general and easy to extend

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Experiments (I) Cross-family validation for known β -helix proteins –PDB select dataset: non-homologous proteins in PDB removing β -helix proteins –SCRFs can score all known β -helices higher than non β -helices

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Experiments (II) Predicted Segmentation for known Beta-helices

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Experiments (III) Histograms for known β-helices against PDB-minus dataset –18 non β-helix proteins have a score higher than 0 –13 from β-class and 5 from α/β class –Most confusing proteins: β-sandwiches and left-handed β-helix 5

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Discovery of potential β-helices Hypothesize on Uniprot reference databases with less than 50% identity (UniRef50) –93 sequences were returned with scores above a cutoff of 5 –48 proteins are homologous with proteins known be β-helices Full list can be accessed at Verification on recently crystallized structures Successfully identify gp14 of Shigella bacteriophage as a β-helix protein with scoring 15.63

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Conclusion Segmentation conditional random fields (SCRFs) for protein structural motif detection –Consider the structural characteristics in a general probabilistic framework –Conditional graphical models that considers the long-range interactions directly and conveniently –A case study for β-helix fold recognition Future work –Computational complexity: O(N 2 ) Chain graph model: localized SCRFs model –Generality of the model Leucine-rich repeats, Ankyrin proteins and some virus-spike folds

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Further Exploration-(I) Chain graph model –A combination of directed and undirected graph –Local normalization version of segmentation CRFs –Reduce the computational complexity to O(N) Experiment on β-helix fold and Leucine-rich repeats –Achieve approximate results as SCRFs with only slight difference 1A4Y1OGQ

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Further Exploration (II) Cross-family validation for known LLR by chain graph model –41 LLR proteins with known structures –2 super-family and 11 families

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project SCRFs for general graph For any graph G =, the conditional probability of the segmentation W given the observation x is defined as –If there are no E2 edges (long-range interactions) semi-markov conditional random fields ( Sarawagi & Cohen, 2004 )

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project SCRFs for general graph For any graph G =, the conditional probability of the segmentation W given the observation x is defined as –If there are no E2 edges (long-range interactions) semi-markov conditional random fields ( Sarawagi & Cohen, 2004 ) Efficient algorithms for inference –If the state transition is deterministic and the resulting graph consists of trees or chains

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project SCRFs for general graph For any graph G =, the conditional probability of the segmentation W given the observation x is defined as –If there are no E2 edges (long-range interactions) semi-markov conditional random fields ( Sarawagi & Cohen, 2004 ) Efficient algorithms for inference –If the state transition is deterministic and the resulting graph consists of trees or chains –If the state transition is not deterministic or a complex graph Approximation methods have to be applied, such as variational methods or sampling

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Acknowledgement Jonathan MIT Bonnie MIT Robert E. Steward and Janet EMBL-EBI John CMU