Conditional Graphical Models for Protein Structure Prediction

Conditional Graphical Models for Protein Structure Prediction
Yan Liu Language Technologies Institute School of Computer Science Carnegie Mellon University Oct 24, 2006 Welcome to my thesis defense on “”

Snapshot of Cell Biology
Nobelprize.org Protein structure + Protein function DSCTFTTAAAAKAGKAKAG Protein sequence On earth, there are different kinds of living things, such as plants, animals and us human beings. All these kinds of living things are made up of cell. In the nucleus of the cell stores the DNA, which will be transcribed and translated into proteins. Proteins, as chains of amino acids, adopt a unique three-dimensional structures at it native environment, which ultimately determine their functions so that they make up a large portion of the living organisms and perform most of the important functions Proteins are composed of amino acids and have a defined three-dimensional structure. This form dictates the function. Proteins are responsible for all the reactions and activities of the cell. The structure of the individual proteins is encoded in DNA in the cell nucleus. Cytoskeletal proteins control the shape and movement of the cell. Proteins are synthesized on ribosomes. Mitochondrial proteins are responsible for cell respiration and the synthesis of ATP that provides cellular energy. Enzymes in the cell catalyze chemical reactions. Storage vesicles contin, and release, hormones and neurotransmittors. They act on receptors and control ion channels. In this way cells can communicate with each other and order proteins in the cell to work in concert with the entire organism.

Protein Structures and Functions
Example: triple beta-spiral fold To emphasize the importance of protein structures to functions. Here is an interesting example of the triple beta spiral fold. The graph on the left shows the 3-D structure of the adenovirus fiber, which consists of a rigid shaft (a protein with TBS fold) and a knob. They are part of the virus capids. The fiber protein holds the antenna-like structure will serve as hands of the virus so that they can attach the virus to the cell surface, where the DNA virus can be injected. We will comeback to this fold later in our discussion, but at this stage, we can understand the importance of the protein structures to functions, which motives us to determine protein structures. The structures provide important information about their functions. Motivates extensive work on identifying the protein structures This triple beta-spiral fold exists in different kinds of virus proteins with any sequence similarity. Identifying more examples of this fold will demonstrate its common existence of all virus proteins. and furthermore indicate they come from a common ancestor. Enlarged view of an adenovirus particle. The viral capsid is an icosahedron with 12 antenna-like fiber projections that function to attach the virus to the cell surface during infection. The viral DNA is packaged inside the particle. Adenovirus Fibre Shaft Virus Capsid Courtesy of Nobelprize.org

Protein Structure Determination
Lab experiments: time and labor- consuming X-ray crystallography Nobel Prize, Kendrew & Perutz, 1962 NMR spectroscopy Nobel Prize, Kurt Wuthrich, 2002 The gap between sequence and structure necessitates computational methods of protein structure determination 3,023,461 sequences v.s. 36,247 resolved structures (1.2%) 1MBN 1BUS There are several techniques that can determine the protein structures, such as x-ray crystallography, which emits electromagnetic radition on crystallized proteins, and NMR spectroscopy that uses magnetic field similar is fMRI in clinical study. These lab experiments are both time-consuming and labor-expensive. There are also many proteins that even cannot be resolved via current techniques. Since it is relatively easy to determine the protein sequences, the gap between sequence and structure necessitates computational methods of protein structure determination, which is the main task in the thesis Homology modeling: Required Accuracy: 2 Å, MODELLER (Sali and Blundell), Probabilistic constraints, SWISS-MODEL Ab inito: Harold Scheraga Ab initio: HP-models, H=hydrophobic; P = polar Rosetta: I-Sites are sequence–structure motifs, Mined from PDB, Current I-Site database has 261 of these motifs, HMMSTR is a set of Hidden Markov Models for predicting sequences of I-sites, States include Amino acid types, Secondary structure type, Discretized / angles, Structural “Context”, PROTEINASE INHIBITOR IIA FROM BULL SEMINAL PLASMA X-ray electromagnetic radiation emitted to crystallized structures Nuclear Magnetic Resonance: measures the magnetic properties of each nucleus in a magnetic field 10 minutes, or as long as a week or two magnetic strength of the machine Cryo EM: for large molecule

Protein Structure Hierarchy
We focus on predicting the topology of the structures from sequences APAFSVSPASGACGPECA Before digging into the details of prediction algorithms, we start with introducing the current understanding of protein structures. The protein structures are defined as having 4 conceptual levels in hierarchies. The primary structures refers to the linear polymers of amino acids. There are 20 types of standard amino acids in nature, which are represented by English letters. The secondary structure of a protein can be thought of as the local conformation of the polypeptide chain, or intuitively as building blocks for its three-dimensional structures. There are 2 types of dominant secondary structures, which are alpha-helix, parallel or anti-parallel beta-sheets. These two exhibit a high degree of regularity and they are connected by the rest irregular regions, called loops. The tertiary structure of a protein is often defined as the global three-dimensional structures and usually represented as a set of 3-D coordinates for each atoms. An important property is that protein sequences have been selected by the evolutionary process to achieve a unique reproducible and stable structure. Sometimes several protein chains (either identical or non-identical) will unit together and form chemical bonds between each other to reach a structurally stable unit. As we can see, there are many interesting and challenging problems about the protein structures. In this thesis, we focus on …, that is, given the protein sequence information only, our goal is to predict what the secondary structure elements are, how they arrange themselves in three-dimensional space, and how multiple chains associate into complexes. All the experimentally solved 3-D structure data will be deposited in a worldwide repository, that is, protein data bank Structural biologists have systematically annotated the structures and evolutionary relationships. SCOP is the database that store the classification of proteins based on current understanding SCOP, acronym for structural classification of proteins, is a database of annotating the the structural and evolutionary relationships of protein manually. family for proteins with clear evolutionary relationships.

Major Challenges Protein structures are non-linear
Long-range dependencies Structural similarity often does not indicate sequence similarity Sequence alignment reaches twilight zone (under 25% similarity) β-α-β motif There are two major challenges in computationally predicting the protein structures. One is the long-range interaction problem, i.e. the other is structure conservation without sequence conservation, that is, structural similarity often does not indicate sequence similarity. To attach the challenges, we use the structural motif recognition as a setting The need to identify distant sequential similarities in order to gain structural insight can be a major challenge” RMSD: 1.9 Å, sequence Identity: 16%. Difficult problems. Ubiquitin (blue) Ubx-Faf1 (gold)

Previous Work Sequence similarity perspective
Sequence similarity searches, e.g. PSI-BLAST [Altschul et al, 1997] Profile HMM, .e.g. HMMER [Durbin et al, 1998] and SAM [Karplus et al, 1998] Window-based methods, e.g. PSI_pred [Jones, 2001] Physical forces perspective Homology modeling or threading, e.g. Threader [Jones, 1998] Structural biology perspective Methods of careful design for specific structures, e.g. αα- and ββ- hairpins, β-turn and β-helix [Efimov, 1991; Wilmot and Thornton, 1990; Bradley at al, 2001] Fail to capture the structure properties Generative models based on physical free-energy Previous work on protein structure prediction can be summarized as three different approaches Fail to capture the structure properties of the protein folds. without sequences similarity Relies strongly on the validity of the assumptions we take in the free energy definition No principled probabilistic models to formulate the structured properties of proteins Informative features without clear mapping to structures Polar, hydrophobic, aromatic and etc.. Motivated by previous work in protein structure prediction and conditional random fields, we propose the generalized conditional graphical model Hard to generalize due to the various informative features

Structured Prediction
Many prediction tasks involve outputs with correlations or constraints Structure Sequence Tree Grid Input John ate the cat . SEQUENCEXS…WGIKQLQAR Output From machine learning perspective, the protein structure prediction belongs to a a general task known as structured prediction. In which, the input are sequences or arrays of pixels, while the outputs are Given structured input—sequence, graph-structured, Predict classification label for each node Fundamental importance in many areas I speech, natural language processing, text analysis, web search, biosequence analysis, etc. We approach the structured prediction problems as learning a mapping from input x to a structured output y, that is, the output falls in a joint spaces with constraints or correlations. For example, protein secondary structure prediction, we have sequential structure. We seek a mapping from strings of amino acids to a reasonable strings of secondary structure assignments. In parsing, we have recursive structure. We are looking for a mapping from sentences to parse trees. In image segmentation, we have spatial structure. Our input is a mesh of points, and our output is a spatially coherent segmentation of these points into a set of object classes. Input: sequences or arrays of pixels HHHCCCEEE…EECCCCEEE Fundamental importance in many areas Potential for significant theoretical and practical advances

Graphical Models A graphical model is a graph representation of probability dependencies [Pearl 1993; Jordan 1999] Node: random variables Edges: dependency relations Directed graphical model (Bayesian networks) Undirected graphical model (Markov random fields) Graphical models are a natural choice to handle the structured prediction problem due to its convenient representation of probability dependencies via graphs. In the graph, the nodes denote the random variables and the edges represent the dependency relations Based on the directionality of the graphs, we have the directed and undirected graphical models Definition. To represent the constraints or correlations, the graphical model uses the edge to represent qualitatively and the potentials, that is, features quantitatively. One of the simplest form of structures is a chain. A series of graphical models have been developed to model the sequential data. CRFs have both theoretical advantages and strong empirical successes. HMM: no assumption about the data generation, MEMM: global normlization to find a solution globally. Hidden markov model: markov assumption and output-independence assumptions simple Markov network, also known as chain Conditional Random Field (CRF). The conditional probability of y given x is a product of node and edge potentials. The node potentials roughly correspond to emission probabilities in an HMM, while the edge potentials correspond to transition probabilities. However, these potentials do not need to sum to 1 as in HMMs, but simply need to be positive functions. Consider the node potential, phi_n. A natural way to represent potentials is using a log linear combination of basis functions. Each basis function could be an indicator function asking a question like “is pixel p on and the letter equal to ‘z’”. Note that the weights w can be negative. Now consider the edge potentials, phi_e. The basis functions here are indicator functions asking a question like is current letter “z” and the next one an “a”. Now the products of such potentials are again log-linear combinations of basis functions. We will use this compact vector notation by stacking up all the parameters and basis functions together. Now are basis functions are simply counts like how many times z is followed by a and how many times pixel p is on when the letter is z.

Conditional Random Fields
Hidden Markov model (HMM) [Rabiner, 1989] Conditional random fields (CRFs) [Lafferty et al, 2001] Model conditional probability directly Allow arbitrary dependencies in observation Adaptive to different loss functions and regularizers Promising results in multiple applications One of the simplest form of structures is a chain. A series of graphical models have been developed to model the sequential data, among which CRFs have demonstrated both theoretical advantages and strong empirical successes. Unlike the HMM, which defines the joint probability of the labels and observations as a product of the emission probability and transition probability, CRF takes a discriminative training approach and defines the conditional probability of the labels given the observation as an exponential function of the features f. The dependencies are captured via the features one way to define the feature is to factorized it Definition of the features: Lambda, Z CRF models the conditional probability directly without any assumptions about the data, which results in a series of nice properties. HMM: no assumption about the data generation, MEMM: global normalization to find a solution globally. Hidden markov model: markov assumption and output-independence assumptions simple Markov network, also known as chain Conditional Random Field (CRF). The conditional probability of y given x is a product of node and edge potentials. The node potentials roughly correspond to emission probabilities in an HMM, while the edge potentials correspond to transition probabilities. However, these potentials do not need to sum to 1 as in HMMs, but simply need to be positive functions. Consider the node potential, phi_n. A natural way to represent potentials is using a log linear combination of basis functions. Each basis function could be an indicator function asking a question like “is pixel p on and the letter equal to ‘z’”. Note that the weights w can be negative. Now consider the edge potentials, phi_e. The basis functions here are indicator functions asking a question like is current letter “z” and the next one an “a”. Now the products of such potentials are again log-linear combinations of basis functions. We will use this compact vector notation by stacking up all the parameters and basis functions together. Now are basis functions are simply counts like how many times z is followed by a and how many times pixel p is on when the letter is z.

Protein Structure Prediction
Dependency between residues (single observation) Dependency between components (subsequences of observations) However, it is not appropriate the use the original CRF model for our task considering the special properties about protein structures. Therefore in the thesis we develop a graphical model framework for protein structure prediction, which retains all the advantages for the CRF model and can capture the structure dependencies

Outline Brief introduction to protein structures
Graphical models for structured-prediction Conditional graphical models for protein structure prediction General framework Specific models Experiment results Conclusion and discussion

Our Solution: Conditional Graphical Models
Local dependency Long-range dependency Outputs Y = {M, {Wi} }, where Wi = {pi, qi, si} Feature definition Node feature Local interaction feature Long-range interaction feature Here is the general framework of conditional graphical models. Given a protein structure level we want to predict, we generate a undirected graph, referred to as protein structure graph in our later discussion, In the graph, the nodes .. , The edges indicate the dependencies between nodes, which can be local dependency, or intuitively the peptide bonds to connect the amino acid, or long-range dependency, or intuitively the chemical bonding between residues that are far way in sequence order. These potentials are captured via the features, including the node potentials and edge potentials. To use the graphical models for protein structure prediction, we first need to define the graph that encode the structure information. Long-range interactions () : they are distant in primary structures with unknown number insertions

Conditional Graphical Models (II)
Conditional probability given observed sequences x is defined as Prediction: Training phase : learn the model parameters λ Minimizing regularized negative log loss Iterative search algorithms by seeking the direction whose empirical values agree with the expectation To use the graphical models for protein structure prediction, we first need to define the graph that encode the structure information. Long-range interactions () : they are distant in primary structures with unknown number insertions

Major Components Graph topology Efficient inference Features
Secondary structure prediction: CRF, kernel CRF Tertiary fold recognition: Segmentation CRF, Chain graph model Quaternary fold recognition: Linked segmentation CRF Efficient inference Prefer exact inference with O(nd) complexity Resort to approximate inference Features Allows flexible and rich feature definition There three major challenges to use the model described before for applications. One is graph topology, it involves two aspects, how to get the constraints structures and how to make better use of the topology of the graph for accurate predictions. The first aspects involves domain knowledge, so we skip the detailed discussion and assume that we’re given the topology of the graph beforehand. The second aspects will be discussed in detail in the next few slides The second problem is how to make efficient inference to calculate the feature expectation and argmax solutions. We prefer to seek exact inference with polynomial time complexity. If not, we resort to approximations which is general enough to hand most cases. I will highlight specific algorithms Another aspect is feature definition. The details can be found in the thesis.

Protein Secondary Structure Prediction
Given a protein sequence, predict its secondary structure assignments Three classes: helix (H), sheets (E) and coil (C) Input: APAFSVSPASGACGPECA Output: CCEEEEECCCCCHHHCCC

CRF on Secondary Structure Prediction [Liu et al, Bioinformatics 2004]
C C E E … C Node semantics – secondary structure assignment Graphical model - conditional random fields (CRFs) or kernel CRF Inference algorithm - efficient inferences exists, such as forward-backward or Viterbi algorithm This reduces our conditional grahical model to CRF, which assigns label to each position

Protein Fold Recognition and Alignment
Protein fold: identifiable regular arrangement of secondary structural elements Different from previous simple fold classification Provide important information and novel biological insights Training Phase Input: ..APAFSVSPASGACGPECA.. Output 1: Does the target fold exist? Output 2: ..NNEEEEECCCCCHHHCCC.. Yes Testing Phase There are different aspects to predict protein structures, In this talk, I focused on protein fold recognition and alignment, including the folds from the tertiary level and quatenary level. Definition Given a protein fold, usually represented by a training set of instances of this fold. We want to predict whether a given protein sequence contains the fold and if so, provide its alignment against the fold. The task is important because the outputs can be used in produce the final 3-D structures. More importantly, it gives interesting insights to the biologists, which is the main motivation for which we work on the task

Conditional Graphical Model for Fixed Template Fold [Liu et al, RECOMB 2005]
β-α-β motif Node semantics - secondary structure elements of variable lengths Graphical model - segmentation conditional random fields (SCRFs) Inference - forward-backward and Viterbi-like algorithm can be derived given some assumptions We start with simple cases, in which fixed template fold, that is, we know the specific layouts of all structural components. Beta=alpha=beta motif:

Conditional Graphical Model for Repetitive Fold Recognition [Liu et al, ICML 2005]
Node semantics - two layer segmentation Y = {M, {Ξi}, T} Level 1: envelop, or one repeat, level 2: components of one repeat Graphical model - Chain graph model A graph consisting of directed and undirected graphs Inference - forward-backward algorithm and Viterbi-like algorithm Complex structures results in complex graphs, a native formulation will results in huge computational costs. Based on the repetitive patterns, we define a two-layer .. Reflected in the protein structural graph, we have the top nodes Ti, representing the state of each component, whether it is a repeat or not. On the bottom, we have the specific configuration of each node E_i, which is modeled by a smaller-scale SCRF model. Combining two layers together, we reach the chain graph model, which is a combination of directed and undirected graphs We have a separate motif model to estimate the conditional prob of T_i, while a SCRF models the second part.

Conditional Graphical Model for for Quaternary Fold Recognition [Liu et al, IJCAI 2007]
Node semantics – secondary structure elements and/or simple fold Graphical model - linked segmentation CRF (L-SCRF) Fix template and/or repetitive subunits Inter-chain and intra-chain interactions In quaternary protein fold, we have …. Therefore we have both inter-chain and intra-chain arcs in the graph. Therefore we introduce the dynamic SCRF model where the conditional prob is defined over an exponential form of the features over single nodes and a pair of nodes. Notice that we define the joint prob of the labels y_i for all the proteins because the folds are stablized by participation of all proteins. This results in a complex graph in which approximate inferences have to be applied. We derive the Quaternary structures Multiple sequences with tertiary structures associated together to form a stable structure Similar structure stabilization mechanism as tertiary structures Very limited research work to date Complex structures Few positive training data

Approximate Inference
Varying dimensionality requires reversible jump MCMC sampling [Greens, 1995, Schmidler et al, 2001] Four types of Metropolis proposals State switching Position switching Segment split Segment merge Simulated annealing reversible jump MCMC [Andireu et al, 2000] Replace the sample with RJ MCMC Theoretically converge on the global optimum Sampling algorithms are pursued due to its simplicity and its ability to handle random variables of variable dimensions. However, there is a native formulation. State switching: given a segmentation yi = (Mi;wi), select a segment j uniformly from [1;M], and a state value s0 uniformly from state set S. Set y¤i = yi except that s¤i;j = s0. Position Switching: given a segmentation yi = (Mi;wi), select the segment j uniformly from [1;M] and a position assignment d0 » U[di;j¡1 + 1; di;j+1 ¡ 1]. Set y¤i = yi except that d¤i;j = d0. Segment split: given a segmentation yi = (Mi;wi), propose y¤i = (M¤ i ;wi¤) with M¤ i = Mi + 1 segments by splitting the jth segment, where j is randomly sampled from U[1;M]. Set w¤i;k = wi;k for k = 1; : : : ; j ¡ 1, and w¤i;k+1 = wi;k for k = j + 1; : : : ;Mi. Sample a value assignment of v » P(v), compute w¤i ;w¤i+1 via (w¤i;j ;w¤i;j+1; v0) = ª(wi;j ; v). Segment merge: given a segmentation yi = (Mi;wi), propose Mi¤ = Mi ¡ 1 by merging the jth segment and j +1th segment, where j is sampled uniformly from [1;M ¡ 1]. Set w¤i;k = wi;k for k = 1; : : : ; j ¡1, and w¤i;k¡1 = wi;k for k = j+1; : : : ;Mi. Sample a value assignment of v0 » P(v0), compute wi;j via (w¤i;j ; v) = ª¡1(wi;j ;wi;j+1; v0). Then the acceptance rate for the proposed transition from

Conditional Graphical Models for Protein Structure Prediction
To summarize, we develop a conditional graphical model framework for protein structure prediction, based on the target protein structure

Generalized as conditional graphical models
Model Roadmap Generalized as conditional graphical models Kernelization Conditional random fields Kernel CRFs Segment Correlations Segmentation CRFs Local and Global Tradeoff Inter-chain Segment Correlations We just describe the three models motivated for different kinds of protein folds. In terms of machine learning perspective, we start from CRF, introduces the … Chain graph model Linked segmentation CRFs

Graphical models for structured prediction Conditional graphical models for protein structure prediction Experiment results Fold recognition Fold alignment prediction Discovery of potential membership proteins Conclusion and discussion

Experiments: Target Fold
Right-handed β-helix fold [Yoder et al, 1993] Bacterial infection of plants, binding the O-antigen and so on Leucine-rich repeats (LLR) [Kobe & Deisenhofer, 1994] Structural framework for protein-protein interaction of beta-strands and an alpha-helix, connected by coils, Three parallel β-strands (B1, B2, B3 strands), T2 turn: a conserved two-residue turn Perform important functions Complex structures with repetitive motifs One with sequence similarity, one without sequence similarity Structural similarity without sequence similarity, reach the twilight zone of sequence-based algorithms

Experiments: Target Quaternary Fold
Triple beta-spirals [van Raaij et al. Nature 1999] Virus fibers in adenovirus, reovirus and PRD1 Double barrel trimer [Benson et al, 2004] Coat protein of adenovirus, PRD1, STIV, PBCV of beta-strands and an alpha-helix, connected by coils, Three parallel β-strands (B1, B2, B3 strands), T2 turn: a conserved two-residue turn Structural similarity without sequence similarity, reach the twilight zone of sequence-based algorithms Reason for choose these two: Computationally: example of folds that are important but have limited number of examples. The TBS is easy because we have Biologically: They are both protein folds related to the virus proteins, its common existence in viruses attacking different species reveal important evolution information and suggested the common ancestor of all viruses

Tertiary Fold Recognition: β-Helix fold
Histogram and ranks for known β-helices against PDB-minus dataset In the experiments, we study the effectiveness of our model via three evaluation measures. One is the fold recognition task. Our task is similar to the information retrieval task, therefore we want to see if our model can rank positive examples higher than negative ones The graph on the left shows the histograms of scores predicted by SCRF for discriminating the beta-helix proteins and the native ones, the log ratio score between the best segmentation and the null state. Green bar is.. Blue bar is .. Compared with other methods, the table on the right shows the rank generated by other algorithms. The higher the rank, the better. Chain graph model is an approximation to SCRF model 5 Chain graph model reduces the real running time of SCRFs model by around 50 times

Quaternary Fold Recognition: Triple β-Spirals
Histogram and ranks for known triple β-spirals against PDB-minus dataset

Quaternary Fold Recognition: Double Barrel-Trimer
Histogram and ranks for known double barrel-trimer against PDB-minus dataset We can see that it is extremely difficult in predicting the DBT fold. However, our method is able to give higher ranks for 3 of the 4 known DBT proteins, although we are unable to reach a clear separation between the DBT proteins and the rest. The results are within our expectation because the lack of signal features and unclear understanding about the inter-chain interactions makes the prediction significantly harder. We believe more improvement can be achieved by combining the results from multiple algorithms.

Fold Alignment Prediction: β-Helix
Predicted alignment for known β -helices on cross-family validation

Fold Alignment Prediction: LLR and Triple β-Spirals
Predicted alignment for known LLRs using chain graph model (left) and triple β-spirals using L-SCRFs

Discovery of Potential β-helices
Hypothesize potential β-helices from Uniprot reference databases Full list can be accessed at Verification on proteins with later resolved structures from different organisms 1YP2: potato tuber ADP-glucose pyrophosphorylase 1PXZ: major allergen from Cedar Pollen GP14 of Shigella bacteriophage as a β-helix protein 93 top rank proteins, good at predicting non-homologous sequences Pollen from cedar and cypress trees is a major cause of seasonal hypersensitivity in humans in several regions of the Northern Hemi

Conclusion Thesis Statement Strong claims Weak claims
Conditional graphical models are effective for protein structure prediction Strong claims Effective representation for protein structural properties Flexibility to incorporate different kinds of informative features Efficient inference algorithms for large-scale applications Weak claims Ability to handle long-range interactions Best performance bounded by prior knowledge In our exploration, we have demonstrated the effectiveness of the conditional graphical models for general secondary structure prediction on globular proteins, tertiary fold (motif) recognition for two specific folds, i.e. right-handed $\beta$-helix and leucine-rich repeats (mostly non-globular proteins), quaternary fold recognition for two specific folds, i.e. triple $\beta$-spirals and double barrel trimer. Therefore we confirmed the thesis statement, that is, the conditional graphical models are theoretically justified and empirically effective for protein structure prediction, independent of the protein structure hierarchies. Contribution and limitation

Contribution and Limitation
Contribution to machine learning Enrichment of graphical models Formulation to incorporate domain knowledge Contribution to computational biology Effective for protein structure prediction and fold recognition Solutions for the long-range interactions (inter-chain and intra-chain) Limitation Manual feature extraction Difficulty in verification High complexity In this thesis, our primary goal is to develop effective machine learning algorithms for protein structure prediction. In addition, we target at designing novel models to best capture the properties of protein structures rather than naive applications of existing algorithms so that we can contribute both computationally and biologically.

Future Work + Computational biology Machine Learning
Protein structure prediction + Protein function and protein-protein interaction prediction Drug target design Graph-based semi-supervised learning Graph topology learning Immediate extension Long-term plan How to use the abundant unlabeled data to help the prediction. Active learning What kind of protein function? What kind of protein-protein interactions? (bio-literature, and multiple sources information) In parallel to the applications, I’m also interested developing theory and algorithms in machine learning, such as graphical models, semi-supervised learning, active learning and metric learning Graphical models for complex probabilistic models, Prior knowledge incorporation, Fast inference algorithms, Graph-topology learning, Semi-supervised learning and active learning, Graph-based semi-supervised learning, Cluster-based active learning, Metric learning via dimension reduction and kernels Reduce the amount of labeled data required for self-learning intrusion detection systems. Active learning for structured data

Acknowledgement Jaime Carbonell, Eric Xing, John Lafferty, Vanathi Gopalakrishnan Chris Langmead, Yiming Yang, Roni Rosenfeld, Peter Weigele , Jonathan King, Judith Klein-Seetharaman, , Ivet Bahar, James Conway and many more And fellow graduate students …

Features for Tertiary Fold Recognition
Node features Regular expression template, HMM profiles Secondary structure prediction scores Segment length Inter-node features β-strand Side-chain alignment scores Preferences for parallel alignment scores Distance between adjacent B23 segments Features are general and easy to extend

Features for Protein Fold Recognition

Discovery of Potential Double Barrel-Trimer
Potential proteins suggested in [Benson, 2005] Further verification need experimental data

Inference Algorithm for SCRF
Backward-forward algorithm* Viterbi algorithm* p(state yr ends at r |xl+1 xl+2… xr-1xr and state yl ends at l) =

Contrastive Divergence

Reversible jump MCMC Algorithm
Three types of proposals Position switching: randomly select a segment j and a new position assignment dj(i+1) ~U(dj-1(i),dj+1(i)) Segment split: randomly select a segment j and split it into two segments where (dj(i+1) , dj+1(i+1) ) = G(dj-1(i) ,u(i) ) where u(i) ~ U Segment merge: randomly select a segment j and merge segment j and j+1 Simulated annealing reversible jump MCMC for computing y = argmax P(y|x) [Andireu et al, 2000] DSCRF model is very generalized, therefore we want a general inference algorithms. MCMC sampling. Extension of MCMC for different state space

Simulated annealing reversible jump MCMC

Protein Structural Graph for Beta-helix

Protein Structure Determination
Lab experiments: time and labor- consuming X-ray crystallography NMR spectroscopy Electron microscopy and many more Computational methods: Homology modeling: ≥ 30% sequence similarity Fold recognition: < 30% sequence similarity Ab inito methods: no template structure needed Active research area in multiple scientific fields Homology modeling: Required Accuracy: 2 Å, MODELLER (Sali and Blundell), Probabilistic constraints, SWISS-MODEL Ab inito: Harold Scheraga Ab initio: HP-models, H=hydrophobic; P = polar Rosetta: I-Sites are sequence–structure motifs, Mined from PDB, Current I-Site database has 261 of these motifs, HMMSTR is a set of Hidden Markov Models for predicting sequences of I-sites, States include Amino acid types, Secondary structure type, Discretized / angles, Structural “Context”,

Evaluation Measure Q3 (accuracy) Precision, Recall
Segment Overlap quantity (SOV) Matthew’s Correlation coefficients

Discriminative graphical models Generalized discriminative graphical models for protein fold recognition Experiment results Conclusion and discussion

Graphical Models for Structured Prediction
Conditional Random Fields Model conditional probability directly, not joint probability Allow arbitrary dependencies in observation (e.g. long range, overlapping) Adaptive to different loss functions and regularizers Promising results in multiple applications Recent developments Alternative estimation algorithms (Collins, 2002, Dietterich et al, 2004) Alternative loss functions, use of kernels (Taskar et al., 2003, Altun et al, 2003, Tsochantaridis et al, 2004) Baysian formulation (Qi and Minka, 2005) and semi-markov version (Sarawagi and cohen, 2004) Motivated by the idea of CRF and the structural properties of the protein folds, we developed the framework of … What if the constraints are described in segment-based instead of individual positions.

Local Information PSI-blast profile SVM classifier with RBF kernel
Position-specific scoring matrices (PSSM) Linear transformation[Kim & Park, 2003] SVM classifier with RBF kernel Feature#1 (Si): Prediction score for each residue Ri

Structured Data Prediction
Data in many applications have inherent structures Example Text sequence Parsing tree Protein sequence Protein structures 3-D image Segmented objects Input John ate the cat . SEQUENCEXS…WGIKQLQAR Structures Input: sequences or arrays of pixels Given structured input—sequence, graph-structured, Predict classification label for each node Fundamental importance in many areas I speech, natural language processing, text analysis, web search, biosequence analysis, etc. Fundamental importance in many areas Potential for significant theoretical and practical advances

Tertiary Fold Recognition
Structural motif Identifiable arrangement of secondary structural elements Super-secondary structure, or protein fold Structural motif recognition Given a structural motif, predict its presence in a testing sequence and produce the sequence to structure alignment, based on sequences only Different from genome-wide high throughput predictions, we are espcially interested in β-α-β motif

Training and Testing Training phase : learn the model parameters λ
Minimizing regularized log loss Iterative search algorithms by seeking the direction whose empirical values agree with the expectation Testing phase: search the segmentation that maximizes P(y|x)

Future Work + Drug Design Protein sequence Protein structure
DSCTFTTAAAAKAGKAKAG Protein sequence + Protein function Drug Design

Experiment Setup Cross-family validation Negative sets: PDB-minus set
Focus on identification of non-homologous proteins Negative sets: PDB-minus set Non-homologous proteins in Protein Data Bank (less than 25%) – Positive sets Features Node features: regular expression template, predicted secondary structure prediction scores, segment length, hydrophobicity and so on Inter-node features: β-strand side-chain alignment scores, distance between nodes and so on Features are general and easy to extend

Major Challenges Protein structures are non-linear
Long-range dependencies Structural similarity often does not indicate sequence similarity Sequence alignment reaches twilight zone (under 25% similarity) β-α-β motif To attach the challenges, we use the structural motif recognition as a setting The need to identify distant sequential similarities in order to gain structural insight can be a major challenge” RMSD: 1.9 Å, sequence Identity: 16%. Difficult problems. Ubiquitin (blue) Ubx-Faf1 (gold)

Conditional Random Fields [Lafferty et al, 2001]
Model p (label y|observation x), not joint distribution Global normalization over undirected graphical models Allow arbitrary dependencies in observation (e.g. long range, overlapping) Adaptive to different loss functions and regularizers

Quaternary Fold Quaternary structures
Multiple sequences with tertiary structures associated together to form a stable structure Similar structure stabilization mechanism as tertiary structures Very limited research work to date Complex structures Few positive training data Triple beta-spirals Tumor necrosis factor (TNF)

Protein Quaternary Fold Recognition
Training Testing Input: ..APAFSVSPASGACGPECA.. Output 1: Does the target fold exist? Output 2:.. Yes Seq 1: APA FSVSPA … SGACGP ECAESG Seq 2 : DSCTFT…TAAAAKAGKAKCSTITL Inter-chain and intra-chain dependencies

Conditional Random Fields
Promising results in multiple applications Tagging (Collins, 2002) and parsing (Sha and Pereira, 2003) Information extraction (Pinto et al., 2003) Image processing (Kumar and Hebert, 2004) DNA sequence analysis (Bockhorst & Craven, 2005) No one has consider applying this model for protein structure prediction

Conditional Graphical Models for Protein Structure Prediction

Similar presentations

Presentation on theme: "Conditional Graphical Models for Protein Structure Prediction"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Conditional Graphical Models for Protein Structure Prediction

Similar presentations

Presentation on theme: "Conditional Graphical Models for Protein Structure Prediction"— Presentation transcript:

Similar presentations

About project

Feedback