Carnegie Mellon School of Computer Science 1 Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute Carnegie.

Slides:



Advertisements
Similar presentations
Secondary structure prediction from amino acid sequence.
Advertisements

PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
John Lafferty, Andrew McCallum, Fernando Pereira
Three-Stage Prediction of Protein Beta-Sheets Using Neural Networks, Alignments, and Graph Algorithms Jianlin Cheng and Pierre Baldi Institute for Genomics.
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National.
Profiles for Sequences
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Finding the Beta Helix Motif By Marcin Mejran. Papers Predicting The  -Helix Fold From Protein Sequence Data by Phil Bradley, Lenore Cowen, Matthew Menke,
Protein threading algorithms 1.GenTHREADER Jones, D. T. JMB(1999) 287, Protein Fold Recognition by Prediction-based Threading Rost, B., Schneider,
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
January, 2009 Jaime Carbonell et al Carnegie Mellon University Data-Intensive Scalability in Machine Learning and Computational Proteomics.
Repetitive Beta Folds Form, Function, and Properties.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Segmentation Conditional.
Carnegie Mellon School of Computer Science 1 Protein Tertiary and Quaternary Fold Recognition: A ML Approach Jaime Carbonell Joint work with: Yan Liu(
Protein Quaternary Fold Recognition Using Conditional Graphical Models
. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]
CISC667, F05, Lec20, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Protein Structure Prediction Protein Secondary Structure.
Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.
Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project TXTpred: A New.
Protein Structures.
(Foundation Block) Dr. Ahmed Mujamammi Dr. Sumbul Fatma
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Template-based Prediction of Protein 8-state Secondary Structures June 12 th 2013 Ashraf Yaseen and Yaohang Li DEPARTMENT OF COMPUTER SCIENCE OLD DOMINION.
Protein Tertiary Structure Prediction
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Genomics and Personalized Care in Health Systems Lecture 9 RNA and Protein Structure Leming Zhou, PhD School of Health and Rehabilitation Sciences Department.
Graphical models for part of speech tagging
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines Blaise Gassend, Charles W. O'Donnell, William Thies,
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Secondary structure prediction
P ROTEIN SEONDARY & SUPER-SECONDARY STRUCTURE PREDICTION WITH HMM By En-Shiun Annie Lee CS 882 Protein Folding Instructed by Professor Ming Li.
Tertiary structure combines regular secondary structures and loops (coil) Bovine carboxypeptidase A.
Web Servers for Predicting Protein Secondary Structure (Regular and Irregular) Dr. G.P.S. Raghava, F.N.A. Sc. Bioinformatics Centre Institute of Microbial.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Carnegie Mellon School of Computer Science 1 Protein Quaternary Fold Recognition Using Conditional Graphical Models Yan Liu IBM Research Jaime Carbonell.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Carnegie Mellon School of Computer Science 1 Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute Carnegie.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Motif Search and RNA Structure Prediction Lesson 9.
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
Proteins Structure Predictions Structural Bioinformatics.
Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Protein Structure Prediction. Protein Sequence Analysis Molecular properties (pH, mol. wt. isoelectric point, hydrophobicity) Secondary Structure Super-secondary.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Predicting Structural Features Chapter 12. Structural Features Phosphorylation sites Transmembrane helices Protein flexibility.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
BIOINFORMATION A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation - - 王红刚 14S
Recovering Temporally Rewiring Networks: A Model-based Approach
חיזוי ואפיון אתרי קישור של חלבון לדנ"א מתוך הרצף
Support Vector Machine (SVM)
Estimating Networks With Jumps
Protein Structures.
Protein structure prediction.
Conditional Graphical Models for Protein Structure Prediction
Presentation transcript:

Carnegie Mellon School of Computer Science 1 Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute Carnegie Mellon University Thesis Committee Jaime Carbonell (Chair) John Lafferty Eric P. Xing Vanathi Gopalakrishnan (Univ. of Pittsburgh) PhD Thesis Proposal

Carnegie Mellon School of Computer Science 2 Proteins in Our Life Nobelprize.org Predict protein structures from sequences

Carnegie Mellon School of Computer Science 3 Protein Structure Hierarchy Focus on predicting the topology of the structures, instead of the specific 3-D coordinates

Carnegie Mellon School of Computer Science 4 Machine Learning Aspect on Previous Work Ab Initio  Similar to generative model Fold recognition  Classification problem Homology modeling  Nearest neighbor

Carnegie Mellon School of Computer Science 5 Missing pieces Informative features without clear understanding No explicit models to consider the structured properties of proteins Discriminative models Graphical models Conditional graphical models for protein structure prediction

Carnegie Mellon School of Computer Science 6 Conditional Graphical Models Define segment semanticsstructural modules  W = (M, {Wi}), M: # of segments, Wi: configuration of segment i The conditional probability of a possible segmentation W given the observation x is defined as

Carnegie Mellon School of Computer Science 7 Conditional Graphical Models Flexible feature definition and kernelization Applicable to different loss functions and regularizers Convex functions for globally optimal solutions Capture the structured information directly, especially the long-range interactions

Carnegie Mellon School of Computer Science 8 Training and Testing Training phase : learn the model parameters  Minimizing regularized log loss  Seek the direction whose empirical values agrees with the expectation  Iterative searching algorithm have to be applied Testing phase: search the segmentation that maximize P(w|x)

Carnegie Mellon School of Computer Science 9 Proposed Work Conditional graphical models for protein structure prediction

Carnegie Mellon School of Computer Science 10 Model Roadmap Conditional random fields Kernel CRFs Segmentation CRFs Chain graph model Factorial segmentation CRFs Kernelized Segmented Locally normalized Factorized

Carnegie Mellon School of Computer Science 11 Outline Conditional graphical models for protein structure prediction

Carnegie Mellon School of Computer Science 12 Task Definition and Evaluation Materials Given a protein sequence, predict its secondary structure assignments  Three class: helix (C), sheets (E) and coil (C) Focus on globular proteins Experiment materials  Datasets RS126 and CB513  Evaluation measure Accuracy, Precision, Recall, Segment of overlap (SOV) APAFSVSPASGACGP ECA CCEEEEECCCCCHH HCCC Protein Secondary Structure Prediction

Carnegie Mellon School of Computer Science 13 Previous Work Huge literature over decades  Window-based methods  Hidden Markov models Major breakthroughs in recent years  Combine the predictions from neighboring residues or various methods (Cuff & Barton, 1999)  Explore evolutionary information from sequences (Jones, 1999)  Specific algorithm for beta-sheets or infer the paring of beta-strands (Meiler & Baker, 2003) Protein Secondary Structure Prediction

Carnegie Mellon School of Computer Science 14 Proposed work Structural component  Individual residues Protein Structural graph  A chain structure Segmentation definition  W = (n, Y), n: # of residues, Yi = H, E or C Specific models  Conditional random fields (CRFs)  Kernel conditional random fields (kCRFs) where Protein Secondary Structure Prediction

Carnegie Mellon School of Computer Science 15 Target Directions Protein Secondary Structure Prediction Input Features (PSI-BLAST profile) Prediction Combination (CRFs) Feature Exploration (KCRFs) Learning Algorithm (SVM) Beta-sheet Detection

Carnegie Mellon School of Computer Science 16 Prediction Combination Previous work  Window-based label combination Rule-based algorithm  Window-based score combination SVM, Neural networks Other graphical models for score combination  Maximum entropy Markov model (MEMM)  Higher-order MEMMs (HOMEMM)  Pseudo state duration MEMMs (PSMEMM) Protein Secondary Structure Prediction MEMM HOMEMM

Carnegie Mellon School of Computer Science 17 Experiment Results Protein Secondary Structure Prediction - Prediction Combination Graphical models are consistently better than the window-based approaches CRFs perform the best among the four graphical models

Carnegie Mellon School of Computer Science 18 Beta-sheet detection Given a protein sequence, predict where are the beta-strands and how they are aligned to form beta-sheets Protein Secondary Structure Prediction APAFSVSPASGACGP ECA NNEEEEENNNEEEEE NN

Carnegie Mellon School of Computer Science 19 Experiment Results Offset from the correct register for the correctly predicted beta-strand pairs (CB513) Protein Secondary Structure Prediction - Beta-sheet detection Considerable improvement in prediction performance Prediction accuracy for beta-sheets

Carnegie Mellon School of Computer Science 20 Experiment Results Prediction accuracy for KCRFs with vertex cliques (V) and edge cliques (E)  PSI-BLAST profile with RBF kernel Protein Secondary Structure Prediction - Feature exploration

Carnegie Mellon School of Computer Science 21 Summary Conditional graphical model with a chain structure for protein secondary structure prediction  Conditional random fields (CRFs)  Kernel conditional random fields (KCRFs) Protein Secondary Structure Prediction

Carnegie Mellon School of Computer Science 22 Outline Conditional graphical models for protein structure prediction

Carnegie Mellon School of Computer Science 23 Task Definition and Evaluation Materials..APAFSVSPASGACGPE CA.. Contains the structural motif?..NNEEEEECCCCCHHH CCC.. Structural motif recognition Structural motif  Regular arrangement of secondary structural elements  Super-secondary structure, or protein fold Experiment materials  Datasets PDB-minus dataset  Cross-family validation  Evaluation measure Rank of scores and segmentation results Yes

Carnegie Mellon School of Computer Science 24 Previous work General approaches for structural motif recognition  Sequence similarity searches, e.g. PSI-BLAST [Altschul et al, 1997]  Profile HMM,.e.g. HMMER [Durbin et al, 1998] and SAM [Karplus et al, 1998]  Homology modeling or threading, e.g. Threader [Jones, 1998] Methods of careful design for specific structure motifs  Example: αα- and ββ- hairpins, β-turn and β-helix Our goal is to have a general probabilistic framework to address all these problems for structural motif prediction - Structural similarity without clear sequence similarity - Long-range interactions - Hard to generalize Structural motif recognition

Carnegie Mellon School of Computer Science 25 Proposed work (I) Structural component  Secondary structure elements Protein structural graph  Nodes for the states of secondary structural elements of unknown lengths  Edges for interactions between nodes in 3-D Example: β-α-β motif Tradeoff between fidelity of the model and graph complexity Structural Motif Recognition

Carnegie Mellon School of Computer Science 26 Proposed Work (II) Segmentation definition  W = (M, {Wi}), M: # of segments, Wi = {(Yi, Si)} Yi: State of segment I Si: the length of segment i Segmentation conditional random fields (SCRFs)  For any graph, we have  For a simplified graph, we have Structural Motif Recognition

Carnegie Mellon School of Computer Science 27 Protein Folds with Structural Repeats Prevalent in proteins and important in functions Each repeats consists of structural motifs and insertions Challenge  Low sequence similarity in structural motifs  Low structural conservation in insertions  Long-range interactions Structural Motif Recognition

Carnegie Mellon School of Computer Science 28 Proposed Work (III) Chain graph  A graph consisting of directed and undirected graphs  Given a variable set V that forms multiple subgraphs U Segmentation defintion  Two layer segmentation W = {M, {Ξ i }, T } M: # of envelops Ξ i : the segmentation of envelops Ti: the state of envelop i = repeat, non-repeat Chain graph model Structural Motif Recognition

Carnegie Mellon School of Computer Science 29 Experiments Right-handed β-helix fold  An elongated helix-like structures whose successive rungs composed of three parallel β - strands (B1, B2, B3 strands)  T2 turn: a unique two-residue turn  Low sequence identity Leucine-rich repeats (LLR)  Solenoid-like regular arrangement of beta-strands and an alpha-helix, connected by coils  Relatively high sequence identity (many Leucines) Structural Motif Recognition

Carnegie Mellon School of Computer Science 30 Experiment Results Cross-family validation for classifying β -helix proteins  SCRFs can score all known β -helices higher than non β -helices Structural Motif Recognition SCRFs model

Carnegie Mellon School of Computer Science 31 Experiment Results Predicted Segmentation for known Beta-helices Structural Motif Recognition SCRFs model

Carnegie Mellon School of Computer Science 32 Experiment Results Histograms of scores for known β- helices against PDB-minus dataset  18 non β-helix proteins have a score higher than 0  13 from β-class and 5 from α/β class  Most confusing proteins: β-sandwiches and left-handed β-helix 5 Structural Motif Recognition SCRFs model

Carnegie Mellon School of Computer Science 33 Experiment Results Verification on recently crystallized structures Successfully identify  1YP2: Potato Tuber ADP-Glucose Pyrophosphorylase score  1PXZ: Jun A 1, The Major Allergen From Cedar Pollen score  GP14 of Shigella bacteriophage as a β-helix protein with scoring Structural Motif Recognition SCRFs model

Carnegie Mellon School of Computer Science 34 Experiment Results Cross-family validation for classifying β -helix proteins  Chain graph model can score all known β -helices higher than non β -helices Structural Motif Recognition Chain graph model

Carnegie Mellon School of Computer Science 35 Experiment Results Cross-family validation for classifying LLR proteins  Chain graph model can score all known LLR higher than non- LLR Structural Motif Recognition Chain graph model

Carnegie Mellon School of Computer Science 36 Experiment Results Predicted Segmentation for known Beta-helices and LLRs Structural Motif Recognition Chain graph model

Carnegie Mellon School of Computer Science 37 Summary Segmented conditional graphical models for structural motif recognition  Segmentation conditional random fields  Chain graph model Successful applications  Right-handed beta-helices  Leucine-rich repeats Structural Motif Recognition

Carnegie Mellon School of Computer Science 38 Outline Conditional graphical models for protein structure prediction

Carnegie Mellon School of Computer Science 39 Quaternary Structure Prediction Quaternary structures  Multiple chains associated together through noncovalent bonds or disulfide bonds Classes of quaternary structures  Based on the number of subunits Monomer, dimer, trimer, tetramer and etc  Based on the identity of the subunits Homo-oligomers* and hetero-oligomers Symmetrical structural arrangement of oligomeric proteins Very few related research work..APAFSVSPASGACGPECA.. Contains the quaternary structures?..NNEEEEECCCCCHHHCCC.. Yes

Carnegie Mellon School of Computer Science 40 Proposed Work Structural component  Secondary structure elements  Super-secondary structure elements Protein structural graph  Nodes for the states of secondary structural or super-secondary structural elements of unknown lengths secondary structure elements must involve long- range interactions  Edges for interactions between nodes in 3-D Quaternary structure prediction

Carnegie Mellon School of Computer Science 41 Proposed Work Segmentation definition  W = (M, {Wi}), M: # of segments, Wi = {(Yi, Si)} Yi: State of segment i Si: the length of segment i Factorial segmentation conditional random fields R: # of chains in quaternary structures Quaternary structure prediction

Carnegie Mellon School of Computer Science 42 Experiment Design Triple beta-spirals  Described by van Raaij et al. in Nature (1999)  Clear sequence repeats  Two proteins with crystallized structures and about 20 without structure annotation Tripe beta-helices  Described by van Raaij et al. in JMB (2001)  Structurally similar to beta-helix without clear sequence similarity  Two proteins with crystallized structures  Information from beta-helices can be used Both folds characterized by unusual stability to heat, protease, and detergent Quaternary structure prediction

Carnegie Mellon School of Computer Science 43 Experiment Design Virus proteins  Noncellular biological entity that can reproduce only within a host cell  Dynamical properties  Various kinds of viruses Adenovirus (common cold) Bacteriophage (which infects bacteria) DNA virus, RNA virus Structural motif and quaternary structure prediction Gp 41: core protein of HIV virus (1AIK)

Carnegie Mellon School of Computer Science 44 Summary Conditional graphical models for protein structure prediction

Carnegie Mellon School of Computer Science 45 Expected Contribution Computational contribution  Conditional graphical models for structured data prediction  “First effective models for the long-range interactions” Biological contribution  Improvement in protein structure prediction  Hypothesis on potential proteins with biological important folds  Aids for function analysis and drug design

Carnegie Mellon School of Computer Science 46 Timeline Jun, Aug, 2005  Data collection for triple beta-spirals and triple beta-helices  Preliminary investigation for virus protein folds Sep, Nov, 2005  Implementation and Testing the model on synthetic data and real data  Determining the specific virus proteins to work on Nov, Jan, 2006  Virus protein fold recognition  Analysis the properties for virus protein folds Feb, June, 2006:  Virus protein fold recognition  Investigation the possibility for protein function prediction or information extraction July, Aug, 2006  Writing up thesis

Carnegie Mellon School of Computer Science 47

Carnegie Mellon School of Computer Science 48 Features Node features  Regular expression template, HMM profiles  Secondary structure prediction scores  Segment length Inter-node features  β -strand Side-chain alignment scores  Preferences for parallel alignment scores  Distance between adjacent B23 segments Features are general and easy to extend Structural Motif Recognition

Carnegie Mellon School of Computer Science 49 Evaluation Measure Q3 (accuracy) Precision, Recall Segment Overlap quantity (SOV) Matthew’s Correlation coefficients P +P- T +Pu T -on

Carnegie Mellon School of Computer Science 50 Local Information PSI-blast profile  Position-specific scoring matrices (PSSM)  Linear transformation[Kim & Park, 2003] SVM classifier with RBF kernel Feature#1 (Si): Prediction score for each residue Ri