Carnegie Mellon School of Computer Science 1 Protein Tertiary and Quaternary Fold Recognition: A ML Approach Jaime Carbonell Joint work with: Yan Liu(

Slides:



Advertisements
Similar presentations
+ Multi-label Classification using Adaptive Neighborhoods Tanwistha Saha, Huzefa Rangwala and Carlotta Domeniconi Department of Computer Science George.
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields - A probabilistic graphical model Stefan Mutter Machine Learning Group Conditional Random Fields - A probabilistic graphical.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Carnegie Mellon School of Computer Science 1 NSF-Relevant Challenges in Computational Intelligence Jaime Carbonell & Tom Mitchell, Guy.
Finding the Beta Helix Motif By Marcin Mejran. Papers Predicting The  -Helix Fold From Protein Sequence Data by Phil Bradley, Lenore Cowen, Matthew Menke,
Computational Proteomics: Structure/Function Prediction & the Protein Interactome Jaime Carbonell ( ), with Betty Cheng, Yan Liu, Eric Xing,
January, 2009 Jaime Carbonell et al Carnegie Mellon University Data-Intensive Scalability in Machine Learning and Computational Proteomics.
Repetitive Beta Folds Form, Function, and Properties.
Challenges for Information Fusion in Retrieval Welcome to RIAO Conference, Pittsburgh PA Jaime Carbonell Language Technologies Institute.
Particle filters (continued…). Recall Particle filters –Track state sequence x i given the measurements ( y 0, y 1, …., y i ) –Non-linear dynamics –Non-linear.
Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Segmentation Conditional.
Conditional Random Fields
Protein Quaternary Fold Recognition Using Conditional Graphical Models
Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.
Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project TXTpred: A New.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Protein Tertiary Structure Prediction
Molecular Modeling and Drug Discovery Judith Klein-Seetharaman Assistant Professor Department of Pharmacology University of Pittsburgh School of Medicine.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
History-Dependent Graphical Multiagent Models Quang Duong Michael P. Wellman Satinder Singh Computer Science and Engineering University of Michigan, USA.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
Graphical models for part of speech tagging
Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines Blaise Gassend, Charles W. O'Donnell, William Thies,
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Introduction to the Language Technologies Institute Fall, 2008 Jaime Carbonell
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
Molecular Modeling and Drug Discovery Judith Klein-Seetharaman Assistant Professor Department of Pharmacology University of Pittsburgh School of Medicine.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS
Introduction to Protein Structure Prediction BMI/CS 576 Colin Dewey Fall 2008.
Carnegie Mellon School of Computer Science 1 Protein Quaternary Fold Recognition Using Conditional Graphical Models Yan Liu IBM Research Jaime Carbonell.
John Lafferty Andrew McCallum Fernando Pereira
Carnegie Mellon School of Computer Science 1 Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute Carnegie.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Protein Targeting and Degradation
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Markov Networks: Theory and Applications Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208
Carnegie Mellon School of Computer Science 1 Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute Carnegie.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Learning Deep Generative Models by Ruslan Salakhutdinov
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Intelligent Information System Lab
Conditional Graphical Models for Protein Structure Prediction
Estimating Networks With Jumps
Protein Structures.
Bucket Renormalization for Approximate Inference
Protein structure prediction.
Conditional Graphical Models for Protein Structure Prediction
Presentation transcript:

Carnegie Mellon School of Computer Science 1 Protein Tertiary and Quaternary Fold Recognition: A ML Approach Jaime Carbonell Joint work with: Yan Liu( IBM ), Vanathi Gopalakrishnan (U Pitt), Peter Weigele (MIT) Language Technologies Institute Carnegie Mellon University Machine Learning Lunch – 11-April-2007

Carnegie Mellon School of Computer Science 2 Snapshot of Cell Biology Nobelprize.org + Protein function DSCTFTTAAAAKAGKAKAG Protein sequence Protein structure

Carnegie Mellon School of Computer Science 3 Primary Sequence MNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT LCCGKNPLGD DEASTTVSKT ETSQVAPA 3D Structure Folding Complex function within network of proteins Normal P ROTEIN S Sequence  Structure  Function (Borrowed from: Judith Klein-Seetharaman)

Carnegie Mellon School of Computer Science 4 Primary Sequence MNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT LCCGKNPLGD DEASTTVSKT ETSQVAPA 3D Structure Folding Complex function within network of proteins Disease P ROTEIN S Sequence  Structure  Function

Carnegie Mellon School of Computer Science 5 Example Protein Structures Adenovirus Fibre Shaft Virus Capsid Triple beta-spiral fold in Adenovirus Fiber Shaft

Carnegie Mellon School of Computer Science 6 Predicting Protein Structures Protein Structure is a key determinant of protein function Crystalography to resolve protein structures experimentally in-vitro is very expensive, NMR can only resolve very-small proteins The gap between the known protein sequences and structures:  3,023,461 sequences v.s. 36,247 resolved structures (1.2%)  Therefore we need to predict structures in-silico

Carnegie Mellon School of Computer Science 7 Quaternary Folds and Alignments Protein fold  Identifiable regular arrangement of secondary structural elements Thus far, a limited number of protein folds have been discovered (~1000)  Very few research work on quaternary folds Complex structures and few labeled data Quaternary fold recognition Seq 1: APA FSVSPA … SGACGP ECAESG Seq 2 : DSCTFT…TAAAAKAGKAKCSTITL Biology taskProtein foldMembership and non- membership proteins Will the protein take the fold? AI taskPattern to be induced Training data (seq- struc pairs + physics) Does the pattern appear in the testing sequence?

Carnegie Mellon School of Computer Science 8 Previous Work Sequence similarity perspective  Sequence similarity searches, e.g. PSI-BLAST [Altschul et al, 1997]  Profile HMM,.e.g. HMMER [Durbin et al, 1998] and SAM [Karplus et al, 1998]  Window-based methods, e.g. PSI_pred [Jones, 2001] Physical forces perspective  Homology modeling or threading, e.g. Threader [Jones, 1998] Structural biology perspective  Painstakingly hand-engineered methods for specific structures, e.g. αα- and ββ- hairpins, β-turn and β-helix [ Efimov, 1991; Wilmot and Thornton, 1990; Bradley at al, 2001] Generative models based on rough approximation of free-energy, perform very poorly on complex structures Very Hard to generalize due to built-in constants, fixed features Fail to capture the structure properties and long-range dependencies

Carnegie Mellon School of Computer Science 9 Conditional Random Fields Hidden Markov model (HMM) [Rabiner, 1989] Conditional random fields (CRFs) [Lafferty et al, 2001]  Model conditional probability directly (discriminative models, directly optimizable)  Allow arbitrary dependencies in observation  Adaptive to different loss functions and regularizers  Promising results in multiple applications  But, need to scale up (computationally) and extend to long-distance dependencies

Carnegie Mellon School of Computer Science 10 Outputs Y = {M, {W i } }, where W i = {p i, q i, s i } Feature definition  Node feature  Local interaction feature  Long-range interaction feature Our Solution: Conditional Graphical Models Long-range dependencyLocal dependency

Carnegie Mellon School of Computer Science 11 Linked Segmentation CRF Node: secondary structure elements and/or simple fold Edges: Local interactions and long-range inter-chain and intra- chain interactions L-SCRF: conditional probability of y given x is defined as Joint Labels

Carnegie Mellon School of Computer Science 12 Classification: Training : learn the model parameters λ  Minimizing regularized negative log loss  Iterative search algorithms by seeking the direction whose empirical values agree with the expectation Complex graphs results in huge computational complexity Linked Segmentation CRF (II)

Carnegie Mellon School of Computer Science 13 Approximate Inference of L-SCRF Most approximation algorithms cannot handle variable number of nodes in the graph, but we need variable graph topologies, so… Reversible jump MCMC sampling [Greens, 1995, Schmidler et al, 2001] with Four types of Metropolis operators  State switching  Position switching  Segment split  Segment merge Simulated annealing reversible jump MCMC [Andireu et al, 2000]  Replace the sample with RJ MCMC  Theoretically converge on the global optimum

Carnegie Mellon School of Computer Science 14 Features for Protein Fold Recognition

Carnegie Mellon School of Computer Science 15 Tertiary Fold Recognition: β- Helix fold Histogram and ranks for known β-helices against PDB-minus dataset 5 Chain graph model reduces the real running time of SCRFs model by around 50 times

Carnegie Mellon School of Computer Science 16 Fold Alignment Prediction: β- Helix Predicted alignment for known β -helices on cross-family validation

Carnegie Mellon School of Computer Science 17 Discovery of New Potential β -helices Run structural predictor seeking potential β-helices from Uniprot (structurally unresolved) databases  Full list (98 new predictions) can be accessed at Verification on 3 proteins with later experimentally resolved structures from different organisms  1YP2: Potato Tuber ADP-Glucose Pyrophosphorylase  1PXZ: The Major Allergen From Cedar Pollen  GP14 of Shigella bacteriophage as a β-helix protein  No single false positive!

Carnegie Mellon School of Computer Science 18 Experiments: Target Quaternary Fold Triple beta-spirals [van Raaij et al. Nature 1999]  Virus fibers in adenovirus, reovirus and PRD1 Double barrel trimer [Benson et al, 2004]  Coat protein of adenovirus, PRD1, STIV, PBCV

Carnegie Mellon School of Computer Science 19 Experiment Results: Fold Recognition Double barrel- trimer Triple beta-spirals

Carnegie Mellon School of Computer Science 20 Experiment Results: Alignment Prediction Triple beta-spirals Four states: B1, B2, T1 and T2 Correct Alignment: B1: i – o B2: a - h Predicted Alignment B1B2

Carnegie Mellon School of Computer Science 21 Experiment Results: Discovery of New Membership Proteins Predicted membership proteins of triple beta-spirals can be accessed at Membership proteins of double barrel-trimer suggested by biologists [Benson, 2005] compared with L-SCRF predictions

Carnegie Mellon School of Computer Science 22 Concluding Remarks Conditional graphical models for protein structure prediction  Effective representation for protein structural properties  Feasibility to incorporate different kinds of informative features  Efficient inference algorithms for large-scale applications A major extension compared with previous work  Knowledge representation through graphical models  Ability to handle long-range interactions within one chain and between chains Future work  Automatic learning of graph topology  Active learning – including minority-class discovery