Download presentation
Presentation is loading. Please wait.
Published byRoderick Preston Modified over 9 years ago
1
Carnegie Mellon School of Computer Science 1 Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute Carnegie Mellon University Thesis Committee Jaime Carbonell (Chair) John Lafferty Eric P. Xing Vanathi Gopalakrishnan (Univ. of Pittsburgh) PhD Thesis Proposal
2
Carnegie Mellon School of Computer Science 2 Proteins in Our Life Nobelprize.org Predict protein structures from sequences
3
Carnegie Mellon School of Computer Science 3 Protein Structure Hierarchy Our focus: predicting the topology of the structures
4
Carnegie Mellon School of Computer Science 4 Previous work General approaches for structural motif recognition Sequence similarity searches, e.g. PSI-BLAST [Altschul et al, 1997] Profile HMM,.e.g. HMMER [Durbin et al, 1998] and SAM [Karplus et al, 1998] Homology modeling or threading, e.g. Threader [Jones, 1998] Window-based methods, e.g. PSI_pred [Jones, 2001] Methods of careful design for specific structure motifs Example: αα- and ββ- hairpins, β-turn and β-helix - Structural similarity without clear sequence similarity - Long-range interactions - Hard to generalize
5
Carnegie Mellon School of Computer Science 5 Missing pieces Informative features without clear interpretation No principled models to formulate the structured properties of proteins Discriminative models Graphical models Conditional graphical models for protein structure prediction
6
Carnegie Mellon School of Computer Science 6 Conditional Graphical Models (CGM) Structural prediction is reduced to segmentation and labeling problem Define segment semanticsstructural modules W = (M, {Wi}), M: # of segments, Wi: configuration of segment i The conditional probability of a possible segmentation W given the observation x is defined as
7
Carnegie Mellon School of Computer Science 7 Advantages of Conditional Graphical Models Flexible feature definition and kernelization Different loss functions and regularizers Convex functions for globally optimal solutions Structured information, especially the long-range interactions
8
Carnegie Mellon School of Computer Science 8 Training and Testing Training phase : learn the model parameters Minimizing regularized log loss Seek the direction whose empirical values agree with the expectation Iterative searching algorithms have to be applied Testing phase: search the segmentation that maximizes P(w| x )
9
Carnegie Mellon School of Computer Science 9 Thesis Work Conditional graphical models for protein structure prediction
10
Carnegie Mellon School of Computer Science 10 Model Roadmap Conditional random fields Kernel CRFs Segmentation CRFs Chain graph model Factorial segmentation CRFs Kernelized Segmentation Locally normalized Factorized
11
Carnegie Mellon School of Computer Science 11 Outline Conditional graphical models for protein structure prediction
12
Carnegie Mellon School of Computer Science 12 Task Definition and Evaluation Materials Given a protein sequence, predict its secondary structure assignments Three class: helix (C), sheets (E) and coil (C) APAFSVSPASGACGP ECA CCEEEEECCCCCHH HCCC Protein Secondary Structure Prediction
13
Carnegie Mellon School of Computer Science 13 CGM on Secondary Structure Prediction [Liu, Carbonell, Klein-Seetharaman and Gopalakrishnan, Bioinformatics 2004] Structural component Individual residues Segmentation definition W = (n, Y), n: # of residues, Yi = H, E or C Specific models Conditional random fields (CRFs) Kernel conditional random fields (kCRFs) where Protein Secondary Structure Prediction
14
Carnegie Mellon School of Computer Science 14 Target Directions Protein Secondary Structure Prediction Input Features (PSI-BLAST profile) Prediction Combination (CRFs) Feature Exploration (KCRFs) Learning Algorithm (SVM) Beta-sheet Detection
15
Carnegie Mellon School of Computer Science 15 Prediction Combination Previous work Window-based label combination Rule-based algorithm Window-based score combination SVM, Neural networks Other graphical models for score combination Maximum entropy Markov model (MEMM) Higher-order MEMMs (HOMEMM) Pseudo state duration MEMMs (PSMEMM) Protein Secondary Structure Prediction MEMM HOMEMM
16
Carnegie Mellon School of Computer Science 16 Experiment Results Protein Secondary Structure Prediction - Prediction Combination Graphical models are consistently better than the window-based approaches CRFs perform the best among the four graphical models
17
Carnegie Mellon School of Computer Science 17 Experiment Results Offset from the correct register for the correctly predicted beta-strand pairs (CB513) Protein Secondary Structure Prediction - Beta-sheet detection Considerable improvement in prediction performance Prediction accuracy for beta-sheets
18
Carnegie Mellon School of Computer Science 18 Experiment Results Prediction accuracy for KCRFs with vertex cliques (V) and edge cliques (E) PSI-BLAST profile with RBF kernel Protein Secondary Structure Prediction - Feature exploration
19
Carnegie Mellon School of Computer Science 19 Summary Conditional graphical model for protein secondary structure prediction Conditional random fields (CRFs) Kernel conditional random fields (KCRFs) Protein Secondary Structure Prediction
20
Carnegie Mellon School of Computer Science 20 Outline Conditional graphical models for protein structure prediction
21
Carnegie Mellon School of Computer Science 21 Task Definition and Evaluation Materials..APAFSVSPASGACGPECA.. Contains the structural motif?..NNEEEEECCCCCHHHCCC.. Structural motif recognition Structural motif Regular arrangement of secondary structural elements Super-secondary structure, or protein fold Yes
22
Carnegie Mellon School of Computer Science 22 CGM for Structural Motif Recognition Structural component Secondary structure elements Protein structural graph Nodes for the states of secondary structural elements of unknown lengths Edges for interactions between nodes in 3-D Example: β-α-β motif Tradeoff between fidelity of the model and graph complexity Structural Motif Recognition
23
Carnegie Mellon School of Computer Science 23 CGM for Structural Motif Recognition Segmentation definition Yi: state of segment i Si: the length of segment i M: number of segments Segmentation conditional random fields (SCRFs) For any graph, we have For a simplified graph, we have Structural Motif Recognition
24
Carnegie Mellon School of Computer Science 24 Protein Folds with Structural Repeats Prevalent in proteins and important in functions Each repeats consists of structural motifs and insertions Challenge Low sequence similarity in structural motifs Long-range interactions Structural Motif Recognition
25
Carnegie Mellon School of Computer Science 25 CGM for Structural Motif Recognition Chain graph A graph consisting of directed and undirected graphs Given a variable set V that forms multiple subgraphs U Segmentation defintion Two layer segmentation W = {M, {Ξ i }, T } M: # of envelops Ξ i : the segmentation of envelops Ti: the state of envelop i = repeat, non-repeat Chain graph model Structural Motif Recognition
26
Carnegie Mellon School of Computer Science 26 Experiments Right-handed β-helix fold An elongated helix-like structures whose successive rungs composed of three parallel β - strands (B1, B2, B3 strands) T2 turn: a unique two-residue turn Low sequence identity Leucine-rich repeats (LLR) Solenoid-like regular arrangement of beta-strands and an alpha-helix, connected by coils Relatively high sequence identity (many Leucines) Structural Motif Recognition
27
Carnegie Mellon School of Computer Science 27 Experiment Results Cross-family validation for classifying β -helix proteins SCRFs can score all known β -helices higher than non β -helices Structural Motif Recognition SCRFs model
28
Carnegie Mellon School of Computer Science 28 Experiment Results Predicted Segmentation for known Beta-helices Structural Motif Recognition SCRFs model
29
Carnegie Mellon School of Computer Science 29 Experiment Results Histograms of scores for known β- helices against PDB-minus dataset 18 non β-helix proteins have a score higher than 0 13 from β-class and 5 from α/β class Most confusing proteins: β-sandwiches and left-handed β-helix 5 Structural Motif Recognition SCRFs model
30
Carnegie Mellon School of Computer Science 30 Experiment Results Verification on recently crystallized structures Successfully identify 1YP2: Potato Tuber ADP-Glucose Pyrophosphorylase score 10.47 1PXZ: Jun A 1, The Major Allergen From Cedar Pollen score 32.35 GP14 of Shigella bacteriophage as a β-helix protein with scoring 15.63 Structural Motif Recognition SCRFs model
31
Carnegie Mellon School of Computer Science 31 Experiment Results Cross-family validation for classifying β -helix proteins Chain graph model can score all known β -helices higher than non β -helices Structural Motif Recognition Chain graph model
32
Carnegie Mellon School of Computer Science 32 Experiment Results Cross-family validation for classifying LLR proteins Chain graph model can score all known LLR higher than non- LLR Structural Motif Recognition Chain graph model
33
Carnegie Mellon School of Computer Science 33 Experiment Results Predicted Segmentation for known Beta-helices and LLRs Structural Motif Recognition Chain graph model
34
Carnegie Mellon School of Computer Science 34 Further Experiments Virus proteins Noncellular biological entity that can reproduce only within a host cell Dynamical properties Various kinds of viruses Adenovirus (common cold) Bacteriophage (which infects bacteria) DNA virus, RNA virus Structural Motif Recognition Gp 41: core protein of HIV virus (1AIK)
35
Carnegie Mellon School of Computer Science 35 Summary Segmented conditional graphical models for structural motif recognition Segmentation conditional random fields Chain graph model Successful applications Right-handed beta-helices Leucine-rich repeats Further verfication Virus spike folds and others Structural Motif Recognition
36
Carnegie Mellon School of Computer Science 36 Outline Conditional graphical models for protein structure prediction
37
Carnegie Mellon School of Computer Science 37 Quaternary Structure Prediction Quaternary structures Multiple chains associated together through noncovalent bonds or disulfide bonds Classes of quaternary structures Based on the number of subunits Based on the identity of the subunits Homo-oligomers* and hetero-oligomers Very few related research work..APAFSVSPASGACGPECA.. Contains the quaternary structures?..NNEEEEECCCCCHHHCCC.. Yes
38
Carnegie Mellon School of Computer Science 38 Proposed Work Structural component Secondary structure elements Super-secondary structure elements Protein structural graph Nodes for the states of secondary structural or super-secondary structural elements of unknown lengths Edges for interactions between nodes in 3-D Protocol: nodes representing secondary structure elements must involve long-range interactions Quaternary structure prediction
39
Carnegie Mellon School of Computer Science 39 Proposed Work Segmentation definition W = (M, {Wi}), M: # of segments Wi: configuration of segment i Factorial segmentation conditional random fields R: # of chains in quaternary structures Quaternary structure prediction
40
Carnegie Mellon School of Computer Science 40 Experiment Design Triple beta-spirals Described by van Raaij et al. in Nature (1999) Clear sequence repeats Two proteins with crystallized structures and about 20 without structure annotation Tripe beta-helices Described by van Raaij et al. in JMB (2001) Structurally similar to beta-helix without clear sequence similarity Two proteins with crystallized structures Information from beta-helices can be used Both folds characterized by unusual stability to heat, protease, and detergent Quaternary structure prediction
41
Carnegie Mellon School of Computer Science 41 Summary Conditional graphical models for protein structure prediction
42
Carnegie Mellon School of Computer Science 42 Expected Contribution Computational contribution Conditional graphical models for structured data prediction “First effective models for the long-range interactions” Biological contribution Improvement in protein structure prediction Hypothesis on potential proteins with biological important folds Aids for function analysis and drug design
43
Carnegie Mellon School of Computer Science 43 Timeline Jun, 2005 -- Aug, 2005 Data collection for triple beta-spirals and triple beta-helices Preliminary investigation for virus protein folds Sep, 2005 -- Nov, 2005 Implementation and Testing the model on synthetic data and real data Determining the specific virus proteins to work on Nov, 2005 -- Jan, 2006 Virus protein fold recognition Analysis the properties for virus protein folds Feb, 2006 -- June, 2006: Virus protein fold recognition Investigation the possibility for protein function prediction or information extraction July, 2006 -- Aug, 2006 Writing up thesis
44
Carnegie Mellon School of Computer Science 44
45
Carnegie Mellon School of Computer Science 45 Features Node features Regular expression template, HMM profiles Secondary structure prediction scores Segment length Inter-node features β -strand Side-chain alignment scores Preferences for parallel alignment scores Distance between adjacent B23 segments Features are general and easy to extend Structural Motif Recognition
46
Carnegie Mellon School of Computer Science 46 Evaluation Measure Q3 (accuracy) Precision, Recall Segment Overlap quantity (SOV) Matthew’s Correlation coefficients P +P- T +Pu T -on
47
Carnegie Mellon School of Computer Science 47 Local Information PSI-blast profile Position-specific scoring matrices (PSSM) Linear transformation[Kim & Park, 2003] SVM classifier with RBF kernel Feature#1 (Si): Prediction score for each residue Ri
48
Carnegie Mellon School of Computer Science 48 Previous Work Huge literature over decades Window-based methods Hidden Markov models Major breakthroughs in recent years Combine the predictions from neighboring residues or various methods (Cuff & Barton, 1999) Explore evolutionary information from sequences (Jones, 1999) Specific algorithm for beta-sheets or infer the paring of beta-strands (Meiler & Baker, 2003) Protein Secondary Structure Prediction
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.