Conditional Graphical Models for Protein Structure Prediction

Slides:



Advertisements
Similar presentations
Secondary structure prediction from amino acid sequence.
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
John Lafferty, Andrew McCallum, Fernando Pereira
Three-Stage Prediction of Protein Beta-Sheets Using Neural Networks, Alignments, and Graph Algorithms Jianlin Cheng and Pierre Baldi Institute for Genomics.
Profiles for Sequences
Structural bioinformatics
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Finding the Beta Helix Motif By Marcin Mejran. Papers Predicting The  -Helix Fold From Protein Sequence Data by Phil Bradley, Lenore Cowen, Matthew Menke,
January, 2009 Jaime Carbonell et al Carnegie Mellon University Data-Intensive Scalability in Machine Learning and Computational Proteomics.
Repetitive Beta Folds Form, Function, and Properties.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Segmentation Conditional.
Carnegie Mellon School of Computer Science 1 Protein Tertiary and Quaternary Fold Recognition: A ML Approach Jaime Carbonell Joint work with: Yan Liu(
Protein Quaternary Fold Recognition Using Conditional Graphical Models
. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]
Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project TXTpred: A New.
Bioinformatics (3 lectures) Why bother about proteins/prediction What is bioinformatics Protein databases Making use of database information –Predictions.
(Foundation Block) Dr. Ahmed Mujamammi Dr. Sumbul Fatma
Protein Tertiary Structure Prediction
Genomics and Personalized Care in Health Systems Lecture 9 RNA and Protein Structure Leming Zhou, PhD School of Health and Rehabilitation Sciences Department.
Graphical models for part of speech tagging
Lecture 10: Protein structure
Proteins Secondary Structure Predictions Structural Bioinformatics.
Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines Blaise Gassend, Charles W. O'Donnell, William Thies,
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Molecules of Life II CHAPTER 3 Proteins Amino Acid Monomers Polypeptide (protein) Polymers Levels of Protein Structure Importance of Structure to Function.
Operone lac Principles of protein structure and function Function is derived from structure Structure is derived from amino acid sequence Different.
Web Servers for Predicting Protein Secondary Structure (Regular and Irregular) Dr. G.P.S. Raghava, F.N.A. Sc. Bioinformatics Centre Institute of Microbial.
Part I : Introduction to Protein Structure A/P Shoba Ranganathan Kong Lesheng National University of Singapore.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Protein Structure (Foundation Block) What are proteins? Four levels of structure (primary, secondary, tertiary, quaternary) Protein folding and stability.
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
Carnegie Mellon School of Computer Science 1 Protein Quaternary Fold Recognition Using Conditional Graphical Models Yan Liu IBM Research Jaime Carbonell.
Proteins Polypeptide chains in specific conformations Protein Graphic Design video.
Carnegie Mellon School of Computer Science 1 Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute Carnegie.
Comparative methods Basic logics: The 3D structure of the protein is deduced from: 1.Similarities between the protein and other proteins 2.Statistical.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Levels of Protein Structure. Why is the structure of proteins (and the other organic nutrients) important to learn?
Carnegie Mellon School of Computer Science 1 Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute Carnegie.
Levels of Protein Structure. Why is the structure of proteins (and the other organic nutrients) important to learn?
Proteins Structure Predictions Structural Bioinformatics.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Predicting Structural Features Chapter 12. Structural Features Phosphorylation sites Transmembrane helices Protein flexibility.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
BIOINFORMATION A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation - - 王红刚 14S
Protein Proteins are biochemical compounds consisting of one or more polypeptides typically folded into a globular or fibrous form in a biologically functional.
Chemical agents PROTEINS: The Molecular Tools of the Cell
3.11 Proteins are essential to the structures and activities of life
Feature Extraction Introduction Features Algorithms Methods
Extra Tree Classifier-WS3 Bagging Classifier-WS3
Introduction to Bioinformatics II
Support Vector Machine (SVM)
Protein Structure Prediction
Protein Structures.
Protein structure prediction.
Loyola Marymount University
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
Protein structure prediction
2.4 - Proteins.
Presentation transcript:

Conditional Graphical Models for Protein Structure Prediction PhD Thesis Proposal Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute Carnegie Mellon University Thesis Committee Jaime Carbonell (Chair) John Lafferty Eric P. Xing Vanathi Gopalakrishnan (Univ. of Pittsburgh)

Predict protein structures from sequences Biology on One Slide Nobelprize.org Predict protein structures from sequences Proteins are composed of amino acids and have a defined three-dimensional structure. This form dictates the function. Proteins are responsible for all the reactions and activities of the cell. The structure of the individual proteins is encoded in DNA in the cell nucleus. Cytoskeletal proteins control the shape and movement of the cell. Proteins are synthesized on ribosomes. Mitochondrial proteins are responsible for cell respiration and the synthesis of ATP that provides cellular energy. Enzymes in the cell catalyze chemical reactions. Storage vesicles contin, and release, hormones and neurotransmittors. They act on receptors and control ion channels. In this way cells can communicate with each other and order proteins in the cell to work in concert with the entire organism.

Protein Structure Hierarchy Our focus: predicting the topology of the structures from sequences APAFSVSPASGACGPECA

Previous work General approaches Sequence similarity searches, e.g. PSI-BLAST [Altschul et al, 1997] Profile HMM, .e.g. HMMER [Durbin et al, 1998] and SAM [Karplus et al, 1998] Homology modeling or threading, e.g. Threader [Jones, 1998] Window-based methods, e.g. PSI_pred [Jones, 2001] Methods of careful design for specific structures Example: αα- and ββ- hairpins, β-turn and β-helix [Efimov, 1991; Wilmot and Thornton, 1990; Bradley at al, 2001] Fail to handle - Structural similarity with less sequence similarity (under 25%) - Long-range interactions Hard to generalize

Missing pieces No principled probabilistic models to formulate the structured properties of proteins Informative features without clear mapping to structures Polar, hydrophobic, aromatic and etc.. Graphical models Discriminative models

Thesis Statement Conditional graphical models are effective for protein structure prediction

Conditional Graphical Models (CGM) Discriminative model defined over undirected graphs Conditional random fields [Lafferty et al, 2001] Protein structural graph Nodes: the states of structural building blocks (of fixed or variable lengths) or observed sequences Edges: interactions between blocks (local or long-range) Long-range interactions Local interactions

Conditional Graphical Models (CGM) Definition of segment semantics structural building blocks W = ( M, {Wi} ) M: # of segments (predetermined or to be inferred) Wi: a set of labels uniquely determining the ith segment The conditional probability of a possible segmentation W given the observed sequence x is defined as Structural topology prediction is reduced to segmentation and labeling problem

Advantages of Conditional Graphical Models Flexible feature definition and kernelization Different loss functions and regularizers Convex functions for globally optimal solutions Structured information, especially the long-range interactions

Training and Testing Training phase : learn the model parameters Minimizing regularized loss function Seek the direction whose empirical values agree with the expectation Iterative searching algorithms have to be applied Testing phase: search the segmentation that maximizes P(w|x)

Thesis Work Conditional graphical models for protein structure prediction

Generalized conditional graphical models Model Roadmap Kernelized Conditional random fields Kernel CRFs Segmentation Locally normalized Segmentation CRFs Chain graph model Factorized Factorial segmentation CRFs Generalized conditional graphical models

Outline

Task Definition and Evaluation Materials Protein Secondary Structure Prediction Task Definition and Evaluation Materials Given a protein sequence, predict its secondary structure assignments Three classes: helix (C), sheets (E) and coil (C) APAFSVSPASGACGPECA CCEEEEECCCCCHHHCCC

Protein Secondary Structure Prediction CGM on Secondary Structure Prediction [Liu et al, Bioinformatics 2004; Liu et al, BLC 2003; Lafferty et al, 2004 ] Structural building blocks Individual residues Segmentation definition W = (n, Y), n: # of residues, Yi = H, E or C Specific models Conditional random fields (CRFs) [lafferty et al, 2001] Kernel conditional random fields (kCRFs) where

Lessons from Previous Work Protein Secondary Structure Prediction Lessons from Previous Work How to combine the predictions effectively? How to accurately identify beta-sheets? How to explore evolutionary information from sequences?

Experiment Results on Prediction Combination Protein Secondary Structure Prediction Experiment Results on Prediction Combination Previous work: Window-based label combination [Rost and Sander, 1993] Window-based score combination [Jones, 1999] Other graphical models for score combination Maximum entropy Markov model (MEMM) [McCallum et al, 2000] Higher-order MEMMs (H-MEMM), Pseudo state duration MEMMs (PSMEMM) Graphical models are generally better than the window-based approaches CRFs perform the best among the four graphical models

Experiment Results on Beta-sheet Detection Protein Secondary Structure Prediction Experiment Results on Beta-sheet Detection Offset from the correct register for the correctly predicted beta-strand pairs (CB513 dataset) Prediction accuracy for beta-sheets Considerable improvement in prediction performance

Experiment Results on Feature Exploration Protein Secondary Structure Prediction Experiment Results on Feature Exploration Prediction accuracy for KCRFs with vertex cliques (V) and edge cliques (E) PSI-BLAST profile with RBF kernel

Protein Secondary Structure Prediction Summary Conditional graphical model for protein secondary structure prediction Conditional random fields (CRFs) Kernel conditional random fields (KCRFs)

Outline Conditional graphical models for protein structure prediction

Task Definition and Evaluation Materials Structural motif recognition Task Definition and Evaluation Materials Structural motif Regular arrangement of secondary structural elements Super-secondary structure, or protein fold ..APAFSVSPASGACGPECA.. Contains the structural motif? ..NNEEEEECCCCCHHHCCC.. Yes

CGM for Structural Motif Recognition Structural building blocks Secondary structure elements Protein structural graph Nodes for the states of secondary structural elements of unknown lengths Edges for interactions between nodes in 3-D Example: β-α-β motif

Structural Motif Recognition CGM for Structural Motif Recognition [Liu, Carbonell, Weigele and Gopalakrishnan, RECOMB 2005] Segmentation definition Yi: state of segment i Si: the length of segment i M: number of segments Segmentation conditional random fields (SCRFs) For any graph, we have For a simplified graph (chains or trees), we have

Protein Folds with Structural Repeats Structural Motif Recognition Protein Folds with Structural Repeats Prevalent in proteins and important in functions Each rung consists of structural motifs and insertions Challenge Low sequence similarity in structural motifs Long-range interactions

Structural Motif Recognition CGM for Structural Motif Recognition [Liu, Xing and Carbonell, Submitted 2005] Chain graph A graph consisting of directed and undirected graphs Given a variable set V that forms multiple subgraphs U Segmentation definition Envelope: one rung with motifs and insertions Two layer segmentation W = {M, {Ξi}, T} M: # of envelops Ξi: the segmentation of envelopes Ti: the state of envelope i = repeat, non-repeat Chain graph model Motif profile model SCRFs

Experiments Right-handed β-helix fold Leucine-rich repeats (LLR) Structural Motif Recognition Experiments Right-handed β-helix fold An elongated helix-like structures whose successive rungs composed of three parallel β-strands (B1, B2, B3 strands) T2 turn: a unique two-residue turn Low sequence identity (under 25%) Leucine-rich repeats (LLR) Solenoid-like regular arrangement of beta-strands and an alpha-helix, connected by coils Relatively high sequence identity (many Leucines)

Experiment Results on SCRF for β-helix Structural Motif Recognition Experiment Results on SCRF for β-helix Cross-family validation for classifying β-helix proteins SCRFs can score all known β-helices higher than non β-helices

Experiment Results on SCRF for β-helix Structural Motif Recognition Experiment Results on SCRF for β-helix Predicted Segmentation for known β -helices on cross-family validation

Experiment Results on SCRF for β-helix Structural Motif Recognition Experiment Results on SCRF for β-helix Histograms of scores for known β-helices against PDB-minus dataset on cross-family validation # of sequences 5 Log-ratio score

Verification on Proteins with Recently Crystallized Structures Structural Motif Recognition Verification on Proteins with Recently Crystallized Structures Successfully identify proteins from different organisms 1YP2: Potato Tuber ADP-Glucose Pyrophosphorylase score 10.47 1PXZ: Jun A 1, The Major Allergen From Cedar Pollen score 32.35 GP14 of Shigella bacteriophage as a β-helix protein with scoring 15.63

Experiment Results on Chain Graph Model for β-helix Structural Motif Recognition Experiment Results on Chain Graph Model for β-helix Cross-family validation for classifying β-helix proteins Chain graph model can score all known β-helices higher than non β-helices Chain graph model reduces the real running time by around 50 times faster

Experiment Results on Chain Graph Model for LLR Structural Motif Recognition Chain graph model Experiment Results on Chain Graph Model for LLR Cross-family validation for classifying LLR proteins Chain graph model can score all known LLR higher than non-LLR

Experiment Results on Chain Graph Model Structural Motif Recognition Chain graph model Experiment Results on Chain Graph Model Predicted Segmentation for known β-helices and LLRs (A) SCRFs; (B) chain graph model

Further Experiments Virus protein folds Structural Motif Recognition Further Experiments Virus protein folds Noncellular biological entity that can reproduce only within a host cell Computational challenge Dynamical properties Successes in virus-related proteins Adenovirus (cause for common cold) Bacteriophage (which infects bacteria) DNA virus, RNA virus These pictures show the structure of the Human Immunodeficiency Virus (HIV). The outer shell of the virus is known as the viral envelope. Embedded in the viral envelope is a complex protein known as env, which consists of an outer protruding cap glycoprotein (gp) 120, and a stem gp41. Within the viral envelope is an HIV protein called p17 (matrix), and within this is the viral core or capsid, which is made of another viral protein p24 (core antigen). The major elements contained within the viral core are two single strands of HIV RNA, a protein p7 (nucleocapsid), and three enzyme proteins, p51 (reverse transcriptase), p11 (protease) and p32 (integrase). Gp 41: core protein of HIV virus (1AIK)

Structural Motif Recognition Summary Segmented conditional graphical models for structural motif recognition Segmentation conditional random fields Chain graph model Successes in applications with limited positive examples Right-handed beta-helices Leucine-rich repeats Further verification Virus spike folds and others

Outline Conditional graphical models for protein structure prediction

Quaternary Structure Prediction Quaternary structures Multiple chains associated together through noncovalent bonds or disulfide bonds Classes of quaternary structures Based on the number of subunits Usually 2 - 10 Based on the identity of the subunits Homo-oligomers* and hetero-oligomers Very limited research work to date ..APAFSVSPASGACGPECA.. Contains the quaternary structures? ..NNEEEEECCCCCHHHCCC.. Yes

Proposed Work Structural building blocks Protein structural graph Quaternary structure prediction Proposed Work Van Raaij et al. in Nature (1999) Structural building blocks Secondary structure elements Super-secondary structure elements Protein structural graph Nodes for the states of secondary structural or super-secondary structural elements of variable lengths Edges for interactions between nodes in 3-D (inter- and intra-chain) Protocol: nodes representing secondary structure elements must involve long-range interactions

Proposed Work Segmentation definition Quaternary structure prediction Proposed Work Segmentation definition W = (M, {Wi}), M: # of segments Wi: a set of labels determining ith segment Factorial segmentation conditional random fields R: # of chains in quaternary structures Wi(1), Wi(2) … Wi(R) have different interaction maps

Experiment Design Triple beta-spirals Quaternary structure prediction Experiment Design Triple beta-spirals Described by van Raaij et al. in Nature (1999) Clear sequence repeats Two proteins with crystallized structures and about 20 without structure annotation Tripe beta-helices Described by van Raaij et al. in JMB (2001) Structurally similar to beta-helix without clear sequence similarity Two proteins with crystallized structures Information from beta-helices can be used Both folds characterized by unusual stability to heat, protease, and detergent

Conditional graphical models for protein structure prediction Summary of Thesis Work Conditional graphical models for protein structure prediction

Expected Contribution Computational contribution Conditional graphical models for structured data prediction “First effective models for the long-range interactions” Biological contribution Improvement in protein structure prediction Hypothesis on potential proteins with biological important folds Aids for function analysis and drug design

Timeline Jun, 2005 -- Aug, 2005 Sep, 2005 -- Nov, 2005 Data collection for triple beta-spirals and triple beta-helices Preliminary investigation for virus protein folds Sep, 2005 -- Nov, 2005 Implementation and Testing the model on synthetic data and real data Determining the specific virus proteins to work on Nov, 2005 -- Jan, 2006 Virus protein fold recognition Analysis of the properties for virus protein folds Feb, 2006 -- June, 2006: Investigation the possibility for protein function prediction or information extraction July, 2006 -- Aug, 2006 Writing up thesis

Features Features are general and easy to extend Node features Structural Motif Recognition Features Node features Regular expression template, HMM profiles Secondary structure prediction scores Segment length Inter-node features β-strand Side-chain alignment scores Preferences for parallel alignment scores Distance between adjacent B23 segments Features are general and easy to extend

Evaluation Measure Q3 (accuracy) Precision, Recall Segment Overlap quantity (SOV) Matthew’s Correlation coefficients

Local Information PSI-blast profile SVM classifier with RBF kernel Position-specific scoring matrices (PSSM) Linear transformation[Kim & Park, 2003] SVM classifier with RBF kernel Feature#1 (Si): Prediction score for each residue Ri

Previous Work Huge literature over decades Protein Secondary Structure Prediction Previous Work Huge literature over decades Window-based methods Hidden Markov models Major breakthroughs in recent years Combine the predictions from neighboring residues or various methods (Cuff & Barton, 1999) Explore evolutionary information from sequences (Jones, 1999) Specific algorithm for beta-sheets or infer the paring of beta-strands (Meiler & Baker, 2003)