Carnegie Mellon School of Computer Science 1 Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute Carnegie.

Slides:



Advertisements
Similar presentations
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
Advertisements

John Lafferty, Andrew McCallum, Fernando Pereira
Three-Stage Prediction of Protein Beta-Sheets Using Neural Networks, Alignments, and Graph Algorithms Jianlin Cheng and Pierre Baldi Institute for Genomics.
Profiles for Sequences
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Structural bioinformatics
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
Finding the Beta Helix Motif By Marcin Mejran. Papers Predicting The  -Helix Fold From Protein Sequence Data by Phil Bradley, Lenore Cowen, Matthew Menke,
January, 2009 Jaime Carbonell et al Carnegie Mellon University Data-Intensive Scalability in Machine Learning and Computational Proteomics.
Repetitive Beta Folds Form, Function, and Properties.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 14.10: Common Multiple.
Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Segmentation Conditional.
Conditional Random Fields
Carnegie Mellon School of Computer Science 1 Protein Tertiary and Quaternary Fold Recognition: A ML Approach Jaime Carbonell Joint work with: Yan Liu(
Protein Quaternary Fold Recognition Using Conditional Graphical Models
. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]
Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.
Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project TXTpred: A New.
Protein Structures.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Template-based Prediction of Protein 8-state Secondary Structures June 12 th 2013 Ashraf Yaseen and Yaohang Li DEPARTMENT OF COMPUTER SCIENCE OLD DOMINION.
Protein Tertiary Structure Prediction
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Genomics and Personalized Care in Health Systems Lecture 9 RNA and Protein Structure Leming Zhou, PhD School of Health and Rehabilitation Sciences Department.
Graphical models for part of speech tagging
Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines Blaise Gassend, Charles W. O'Donnell, William Thies,
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
P ROTEIN SEONDARY & SUPER-SECONDARY STRUCTURE PREDICTION WITH HMM By En-Shiun Annie Lee CS 882 Protein Folding Instructed by Professor Ming Li.
Web Servers for Predicting Protein Secondary Structure (Regular and Irregular) Dr. G.P.S. Raghava, F.N.A. Sc. Bioinformatics Centre Institute of Microbial.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS
A Generalization of Forward-backward Algorithm Ai Azuma Yuji Matsumoto Nara Institute of Science and Technology.
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Carnegie Mellon School of Computer Science 1 Protein Quaternary Fold Recognition Using Conditional Graphical Models Yan Liu IBM Research Jaime Carbonell.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Carnegie Mellon School of Computer Science 1 Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute Carnegie.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Proteins Structure Predictions Structural Bioinformatics.
Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.
Protein Structure Prediction. Protein Sequence Analysis Molecular properties (pH, mol. wt. isoelectric point, hydrophobicity) Secondary Structure Super-secondary.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Predicting Structural Features Chapter 12. Structural Features Phosphorylation sites Transmembrane helices Protein flexibility.
GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
BIOINFORMATION A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation - - 王红刚 14S
Recovering Temporally Rewiring Networks: A Model-based Approach
Estimating Networks With Jumps
Protein Structures.
Protein structure prediction.
Conditional Graphical Models for Protein Structure Prediction
Protein structure prediction
Presentation transcript:

Carnegie Mellon School of Computer Science 1 Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute Carnegie Mellon University Thesis Committee Jaime Carbonell (Chair) John Lafferty Eric P. Xing Vanathi Gopalakrishnan (Univ. of Pittsburgh) PhD Thesis Proposal

Carnegie Mellon School of Computer Science 2 Proteins in Our Life Nobelprize.org Predict protein structures from sequences

Carnegie Mellon School of Computer Science 3 Protein Structure Hierarchy Our focus: predicting the topology of the structures

Carnegie Mellon School of Computer Science 4 Previous work General approaches for structural motif recognition  Sequence similarity searches, e.g. PSI-BLAST [Altschul et al, 1997]  Profile HMM,.e.g. HMMER [Durbin et al, 1998] and SAM [Karplus et al, 1998]  Homology modeling or threading, e.g. Threader [Jones, 1998]  Window-based methods, e.g. PSI_pred [Jones, 2001] Methods of careful design for specific structure motifs  Example: αα- and ββ- hairpins, β-turn and β-helix - Structural similarity without clear sequence similarity - Long-range interactions - Hard to generalize

Carnegie Mellon School of Computer Science 5 Missing pieces Informative features without clear interpretation No principled models to formulate the structured properties of proteins Discriminative models Graphical models Conditional graphical models for protein structure prediction

Carnegie Mellon School of Computer Science 6 Conditional Graphical Models (CGM) Structural prediction is reduced to segmentation and labeling problem Define segment semanticsstructural modules  W = (M, {Wi}), M: # of segments, Wi: configuration of segment i The conditional probability of a possible segmentation W given the observation x is defined as

Carnegie Mellon School of Computer Science 7 Advantages of Conditional Graphical Models Flexible feature definition and kernelization Different loss functions and regularizers Convex functions for globally optimal solutions Structured information, especially the long-range interactions

Carnegie Mellon School of Computer Science 8 Training and Testing Training phase : learn the model parameters  Minimizing regularized log loss  Seek the direction whose empirical values agree with the expectation  Iterative searching algorithms have to be applied Testing phase: search the segmentation that maximizes P(w| x )

Carnegie Mellon School of Computer Science 9 Thesis Work Conditional graphical models for protein structure prediction

Carnegie Mellon School of Computer Science 10 Model Roadmap Conditional random fields Kernel CRFs Segmentation CRFs Chain graph model Factorial segmentation CRFs Kernelized Segmentation Locally normalized Factorized

Carnegie Mellon School of Computer Science 11 Outline Conditional graphical models for protein structure prediction

Carnegie Mellon School of Computer Science 12 Task Definition and Evaluation Materials Given a protein sequence, predict its secondary structure assignments  Three class: helix (C), sheets (E) and coil (C) APAFSVSPASGACGP ECA CCEEEEECCCCCHH HCCC Protein Secondary Structure Prediction

Carnegie Mellon School of Computer Science 13 CGM on Secondary Structure Prediction [Liu, Carbonell, Klein-Seetharaman and Gopalakrishnan, Bioinformatics 2004] Structural component  Individual residues Segmentation definition  W = (n, Y), n: # of residues, Yi = H, E or C Specific models  Conditional random fields (CRFs)  Kernel conditional random fields (kCRFs) where Protein Secondary Structure Prediction

Carnegie Mellon School of Computer Science 14 Target Directions Protein Secondary Structure Prediction Input Features (PSI-BLAST profile) Prediction Combination (CRFs) Feature Exploration (KCRFs) Learning Algorithm (SVM) Beta-sheet Detection

Carnegie Mellon School of Computer Science 15 Prediction Combination Previous work  Window-based label combination Rule-based algorithm  Window-based score combination SVM, Neural networks Other graphical models for score combination  Maximum entropy Markov model (MEMM)  Higher-order MEMMs (HOMEMM)  Pseudo state duration MEMMs (PSMEMM) Protein Secondary Structure Prediction MEMM HOMEMM

Carnegie Mellon School of Computer Science 16 Experiment Results Protein Secondary Structure Prediction - Prediction Combination Graphical models are consistently better than the window-based approaches CRFs perform the best among the four graphical models

Carnegie Mellon School of Computer Science 17 Experiment Results Offset from the correct register for the correctly predicted beta-strand pairs (CB513) Protein Secondary Structure Prediction - Beta-sheet detection Considerable improvement in prediction performance Prediction accuracy for beta-sheets

Carnegie Mellon School of Computer Science 18 Experiment Results Prediction accuracy for KCRFs with vertex cliques (V) and edge cliques (E)  PSI-BLAST profile with RBF kernel Protein Secondary Structure Prediction - Feature exploration

Carnegie Mellon School of Computer Science 19 Summary Conditional graphical model for protein secondary structure prediction  Conditional random fields (CRFs)  Kernel conditional random fields (KCRFs) Protein Secondary Structure Prediction

Carnegie Mellon School of Computer Science 20 Outline Conditional graphical models for protein structure prediction

Carnegie Mellon School of Computer Science 21 Task Definition and Evaluation Materials..APAFSVSPASGACGPECA.. Contains the structural motif?..NNEEEEECCCCCHHHCCC.. Structural motif recognition Structural motif  Regular arrangement of secondary structural elements  Super-secondary structure, or protein fold Yes

Carnegie Mellon School of Computer Science 22 CGM for Structural Motif Recognition Structural component  Secondary structure elements Protein structural graph  Nodes for the states of secondary structural elements of unknown lengths  Edges for interactions between nodes in 3-D Example: β-α-β motif Tradeoff between fidelity of the model and graph complexity Structural Motif Recognition

Carnegie Mellon School of Computer Science 23 CGM for Structural Motif Recognition Segmentation definition  Yi: state of segment i Si: the length of segment i M: number of segments Segmentation conditional random fields (SCRFs)  For any graph, we have  For a simplified graph, we have Structural Motif Recognition

Carnegie Mellon School of Computer Science 24 Protein Folds with Structural Repeats Prevalent in proteins and important in functions Each repeats consists of structural motifs and insertions Challenge  Low sequence similarity in structural motifs  Long-range interactions Structural Motif Recognition

Carnegie Mellon School of Computer Science 25 CGM for Structural Motif Recognition Chain graph  A graph consisting of directed and undirected graphs  Given a variable set V that forms multiple subgraphs U Segmentation defintion  Two layer segmentation W = {M, {Ξ i }, T } M: # of envelops Ξ i : the segmentation of envelops Ti: the state of envelop i = repeat, non-repeat Chain graph model Structural Motif Recognition

Carnegie Mellon School of Computer Science 26 Experiments Right-handed β-helix fold  An elongated helix-like structures whose successive rungs composed of three parallel β - strands (B1, B2, B3 strands)  T2 turn: a unique two-residue turn  Low sequence identity Leucine-rich repeats (LLR)  Solenoid-like regular arrangement of beta-strands and an alpha-helix, connected by coils  Relatively high sequence identity (many Leucines) Structural Motif Recognition

Carnegie Mellon School of Computer Science 27 Experiment Results Cross-family validation for classifying β -helix proteins  SCRFs can score all known β -helices higher than non β -helices Structural Motif Recognition SCRFs model

Carnegie Mellon School of Computer Science 28 Experiment Results Predicted Segmentation for known Beta-helices Structural Motif Recognition SCRFs model

Carnegie Mellon School of Computer Science 29 Experiment Results Histograms of scores for known β- helices against PDB-minus dataset  18 non β-helix proteins have a score higher than 0  13 from β-class and 5 from α/β class  Most confusing proteins: β-sandwiches and left-handed β-helix 5 Structural Motif Recognition SCRFs model

Carnegie Mellon School of Computer Science 30 Experiment Results Verification on recently crystallized structures Successfully identify  1YP2: Potato Tuber ADP-Glucose Pyrophosphorylase score  1PXZ: Jun A 1, The Major Allergen From Cedar Pollen score  GP14 of Shigella bacteriophage as a β-helix protein with scoring Structural Motif Recognition SCRFs model

Carnegie Mellon School of Computer Science 31 Experiment Results Cross-family validation for classifying β -helix proteins  Chain graph model can score all known β -helices higher than non β -helices Structural Motif Recognition Chain graph model

Carnegie Mellon School of Computer Science 32 Experiment Results Cross-family validation for classifying LLR proteins  Chain graph model can score all known LLR higher than non- LLR Structural Motif Recognition Chain graph model

Carnegie Mellon School of Computer Science 33 Experiment Results Predicted Segmentation for known Beta-helices and LLRs Structural Motif Recognition Chain graph model

Carnegie Mellon School of Computer Science 34 Further Experiments Virus proteins  Noncellular biological entity that can reproduce only within a host cell  Dynamical properties  Various kinds of viruses Adenovirus (common cold) Bacteriophage (which infects bacteria) DNA virus, RNA virus Structural Motif Recognition Gp 41: core protein of HIV virus (1AIK)

Carnegie Mellon School of Computer Science 35 Summary Segmented conditional graphical models for structural motif recognition  Segmentation conditional random fields  Chain graph model Successful applications  Right-handed beta-helices  Leucine-rich repeats Further verfication  Virus spike folds and others Structural Motif Recognition

Carnegie Mellon School of Computer Science 36 Outline Conditional graphical models for protein structure prediction

Carnegie Mellon School of Computer Science 37 Quaternary Structure Prediction Quaternary structures  Multiple chains associated together through noncovalent bonds or disulfide bonds Classes of quaternary structures  Based on the number of subunits  Based on the identity of the subunits Homo-oligomers* and hetero-oligomers Very few related research work..APAFSVSPASGACGPECA.. Contains the quaternary structures?..NNEEEEECCCCCHHHCCC.. Yes

Carnegie Mellon School of Computer Science 38 Proposed Work Structural component  Secondary structure elements  Super-secondary structure elements Protein structural graph  Nodes for the states of secondary structural or super-secondary structural elements of unknown lengths  Edges for interactions between nodes in 3-D  Protocol: nodes representing secondary structure elements must involve long-range interactions Quaternary structure prediction

Carnegie Mellon School of Computer Science 39 Proposed Work Segmentation definition  W = (M, {Wi}), M: # of segments Wi: configuration of segment i Factorial segmentation conditional random fields R: # of chains in quaternary structures Quaternary structure prediction

Carnegie Mellon School of Computer Science 40 Experiment Design Triple beta-spirals  Described by van Raaij et al. in Nature (1999)  Clear sequence repeats  Two proteins with crystallized structures and about 20 without structure annotation Tripe beta-helices  Described by van Raaij et al. in JMB (2001)  Structurally similar to beta-helix without clear sequence similarity  Two proteins with crystallized structures  Information from beta-helices can be used Both folds characterized by unusual stability to heat, protease, and detergent Quaternary structure prediction

Carnegie Mellon School of Computer Science 41 Summary Conditional graphical models for protein structure prediction

Carnegie Mellon School of Computer Science 42 Expected Contribution Computational contribution  Conditional graphical models for structured data prediction  “First effective models for the long-range interactions” Biological contribution  Improvement in protein structure prediction  Hypothesis on potential proteins with biological important folds  Aids for function analysis and drug design

Carnegie Mellon School of Computer Science 43 Timeline Jun, Aug, 2005  Data collection for triple beta-spirals and triple beta-helices  Preliminary investigation for virus protein folds Sep, Nov, 2005  Implementation and Testing the model on synthetic data and real data  Determining the specific virus proteins to work on Nov, Jan, 2006  Virus protein fold recognition  Analysis the properties for virus protein folds Feb, June, 2006:  Virus protein fold recognition  Investigation the possibility for protein function prediction or information extraction July, Aug, 2006  Writing up thesis

Carnegie Mellon School of Computer Science 44

Carnegie Mellon School of Computer Science 45 Features Node features  Regular expression template, HMM profiles  Secondary structure prediction scores  Segment length Inter-node features  β -strand Side-chain alignment scores  Preferences for parallel alignment scores  Distance between adjacent B23 segments Features are general and easy to extend Structural Motif Recognition

Carnegie Mellon School of Computer Science 46 Evaluation Measure Q3 (accuracy) Precision, Recall Segment Overlap quantity (SOV) Matthew’s Correlation coefficients P +P- T +Pu T -on

Carnegie Mellon School of Computer Science 47 Local Information PSI-blast profile  Position-specific scoring matrices (PSSM)  Linear transformation[Kim & Park, 2003] SVM classifier with RBF kernel Feature#1 (Si): Prediction score for each residue Ri

Carnegie Mellon School of Computer Science 48 Previous Work Huge literature over decades  Window-based methods  Hidden Markov models Major breakthroughs in recent years  Combine the predictions from neighboring residues or various methods (Cuff & Barton, 1999)  Explore evolutionary information from sequences (Jones, 1999)  Specific algorithm for beta-sheets or infer the paring of beta-strands (Meiler & Baker, 2003) Protein Secondary Structure Prediction