Protein Classification. Given a new protein, can we place it in its “correct” position within an existing protein hierarchy? Methods BLAST / PsiBLAST.

Slides:

Advertisements

Similar presentations

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.

Advertisements

PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.

1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Machine learning continued Image source:

BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.

1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry

Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.

Profiles for Sequences

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Structural bioinformatics

Lecture 6, Thursday April 17, 2003

Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.

Mismatch string kernels for discriminative protein classification By Leslie. et.al Presented by Yan Wang.

Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.

Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.

CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Bayesian Classification of Protein Data Thomas Huber Computational Biology and Bioinformatics Environment ComBinE Department of Mathematics.

Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]

Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.

Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.

. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]

1 Protein Structure Prediction Reporter: Chia-Chang Wang Date: April 1, 2005.

Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Similar Sequence Similar Function Charles Yan Spring 2006.

Semi-supervised protein classification using cluster kernels Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff and William Stafford.

Protein Classification. PDB Growth New PDB structures.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.

Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

Radial Basis Function Networks

Protein Structural Prediction. Protein Structure is Hierarchical.

M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.

BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.

Protein Tertiary Structure Prediction

Masquerade Detection Mark Stamp 1Masquerade Detection.

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?

Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.

Sequence analysis: Macromolecular motif recognition Sylvia Nagl.

Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.

Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.

HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Protein Folding and Modeling Carol K. Hall Chemical and Biomolecular Engineering North Carolina State University.

Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.

Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.

Protein Classification Using Averaged Perceptron SVM

1 CISC 841 Bioinformatics (Fall 2007) Kernel engineering and applications of SVMs.

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.

2 classes: ICS 280, BIT Forum Meeting only on Mondays from 5 to 6:20 in CS2 136 (BIT). (P. Baldi and L. Ralaivola) ICS 280: Baldi group meeting and projects.

CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct

1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.

Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.

Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.

Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.

Protein Classification

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.

Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.

Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.

More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.

Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)

Semi-Supervised Clustering

Protein structure prediction.

Protein Structural Classification

Presentation transcript:

Protein Classification

Given a new protein, can we place it in its “correct” position within an existing protein hierarchy? Methods BLAST / PsiBLAST Profile HMMs Supervised Machine Learning methods Fold Family Superfamily Proteins ? new protein

PSI-BLAST Given a sequence query x, and database D 1.Find all pairwise alignments of x to sequences in D 2.Collect all matches of x to y with some minimum significance 3.Construct position specific matrix M Each sequence y is given a weight so that many similar sequences cannot have much influence on a position (Henikoff & Henikoff 1994) 4.Using the matrix M, search D for more matches 5.Iterate 1–4 until convergence Profile M

Classification with Profile HMMs Fold Family Superfamily ? new protein

The Fisher Kernel Fisher score  U X =   log P(X | H 1,  )  Quantifies how each parameter contributes to generating X  For two different sequences X and Y, can compare U X, U Y D 2 F (X, Y) = ½  2 |U X – U Y | 2 Given this distance function, K(X, Y) is defined as a similarity measure:  K(X, Y) = exp(-D 2 F (X, Y))  Set  so that the average distance of training sequences X i  H 1 to sequences X j  H 0 is 1

The Fisher Kernel To train a classifier for a given family H 1, 1.Build profile HMM, H 1 2.U X =   log P(X | H 1,  )(Fisher score) 3.D 2 F (X, Y) = ½  2 |U X – U Y | 2 (distance) 4.K(X, Y) = exp(-D 2 F (X, Y)), (akin to dot product) 5.L(X) =  Xi  H1 i K(X, X i ) –  Xj  H0 j K(X, X j ) 6.Iteratively adjust to optimize J( ) =  Xi  H1 i (2 - L(X i )) –  Xj  H0 j (2 + L(X j )) To classify query X,  Compute U X  Compute K(X, X i ) for all training examples X i with I ≠ 0 (few)  Decide based on L(X) >? 0

O. Jangmin

QUESTION Running time of Fisher kernel SVM on query X?

k-mer based SVMs Leslie, Eskin, Weston, Noble; NIPS 2002 Highlights K(X, Y) = exp(-½  2 |U X – U Y | 2 ), requires expensive profile alignment: U X =   log P(X | H 1,  ) – O(|X| |H 1 |) Instead, new kernel K(X, Y) just “counts up” k-mers with mismatches in common between X and Y – O(|X|) in practice Off-the-shelf SVM software used

k-mer based SVMs For given word size k, and mismatch tolerance l, define K(X, Y) = # distinct k-long word occurrences with ≤ l mismatches Define normalized kernel K’(X, Y) = K(X, Y)/ sqrt(K(X,X)K(Y,Y)) SVM can be learned by supplying this kernel function A B A C A R D I A B R A D A B I X Y K(X, Y) = 4 K’(X, Y) = 4/sqrt(7*7) = 4/7 Let k = 3; l = 1

SVMs will find a few support vectors v After training, SVM has determined a small set of sequences, the support vectors, who need to be compared with query sequence X

Benchmarks

Semi-Supervised Methods GENERATIVE SUPERVISED METHODS

Semi-Supervised Methods DISCRIMINATIVE SUPERVISED METHODS

Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)

Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)

Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)

Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)

Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)

Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)

Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)

Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)

Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)

Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)

Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples

Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples SVMs and other discriminative methods may make significant mistakes due to lack of data

Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples

Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples

Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples

Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples Attempt to “contract” the distances within each cluster while keeping intracluster distances larger

Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples

Semi-Supervised Methods 1.Kuang, Ie, Wang, Siddiqi, Freund, Leslie 2005Kuang, Ie, Wang, Siddiqi, Freund, Leslie 2005  A Psi-BLAST profile—based method 2.Weston, Leslie, Elisseeff, Noble, NIPS 2003Weston, Leslie, Elisseeff, Noble, NIPS 2003  Cluster kernels

(semi)1. Profile k-mer based SVMs For each sequence X,  Obtain PSI-BLAST profile Q(X) = {p i (  );  : amino acid, 1≤ i ≤ |X|}  For every k-mer in X, x j … x j+k-1, define  -neighborhood M k,  (Q[x j …x j+k-1 ]) = {b 1 …b k | -  i=0…k-1 log p j+i (b i ) <  }  Define K(X, Y) For each b 1 …b k matching m times in X, n times in Y, add m*n In practice, each k-mer can have ≤ 2 mismatches and K(X, Y) can be computed quickly in O(k (|X| + |Y|)) Profile M PSI-BLAST

(semi)1. Discriminative motifs According to this kernel K(X, Y), sequence X is mapped to Φ k,  (X): vector in 20 k dimensions  Φ k,  (X)(b 1 …b k ) = # k-mers in Q(X) whose neighborhood includes b 1 …b k Then, SVM learns a discriminating “hyperplane” with normal vector v:  v =  i=1…N (+/-) i Φ k,  (X (i) ) Consider a profile k-mer Q[x j …x j+k-1 ]; its contribution to v is ~   Φ k,  (Q[x j …x j+k-1 ]), v  Consider a position i in X: count up the contributions of all words containing x i  g(x i ) =  j=1…k max{ 0,  Φ k,  (Q[x i-k+j …x j-1+j ]), v  }  Sort these contributions within all positions of all sequences, to pick important positions or discriminative motifs

(semi)1. Discriminative motifs Consider a position i in X: count up the contributions to v of all words containing x i  Sort these contributions within all positions of all sequences, to pick discriminative motifs

(semi)2. Cluster Kernels Two (more!) methods 1.Neighborhood 1.For each X, run PSI-BLAST to get similar seqs  Nbd(X) 2.Define Φ nbd (X) = 1/|Nbd(X)|  X’  Nbd(X) Φ original (X’) “Counts of all k-mers matching with at most 1 diff. all sequences that are similar to X” 3.K nbd (X, Y) = 1/(|Nbd(X)|*|Nbd(Y))  X’  Nbd(X)  Y’  Nbd(Y) K(X’, Y’) 2.Bagged mismatch

(semi)2. Cluster Kernels Two (more!) methods 1.Neighborhood 1.For each X, run PSI-BLAST to get similar seqs  Nbd(X) 2.Define Φ nbd (X) = 1/|Nbd(X)|  X’  Nbd(X) Φ original (X’) “Counts of all k-mers matching with at most 1 diff. all sequences that are similar to X” 3.K nbd (X, Y) = 1/(|Nbd(X)|*|Nbd(Y))  X’  Nbd(X)  Y’  Nbd(Y) K(X’, Y’) 2.Bagged mismatch 1.Run k-means clustering n times, giving p = 1,…,n assignments c p (X) 2.For every X and Y, count up the fraction of times they are bagged together K bag (X, Y) = 1/n  p 1(c p (X) = c p (Y)) 3.Combine the “bag fraction” with the original comparison K(.,.) K new (X, Y) = K bag (X, Y) K(X, Y)

Some Benchmarks

Google-like homology search The internet and the network of protein homologies have some similarity—scale free Given query X, Google ranks webpages by a flow algorithm  From each webpage W, linked nbrs receive flow  At time t+1, W sends to nbrs flow it received at time t  Finite, ergodic, aperiodic Markov Chain  Can find stationary distribution efficiently as left eigenvector with eigenvalue 1 Start with arbitrary probability distribution, and multiply by the transition matrix

Google-like homology search Weston, Elisseeff, Zhu, Leslie, Noble, PNAS 2004 RANKPROP algorithm for protein homology First, compute a matrix K ij of PSI-BLAST homology between proteins i and j, normalized so that  j K ji = 1 1.Initialization y 1 (0) = 1; y i (0) = 0 2.For t = 0, 1, …, 3. For i = 2 to m 4. y i (t+1) = K 1i +   K ji y j (t) In the end, let y i be the ranking score for similarity of sequence i to sequence 1 (  = 0.95 is good)

Google-like homology search For a given protein family, what fraction of true members of the family are ranked higher than the first 50 non-members?

Protein Structure Prediction

Protein Structure Determination Experimental  X-ray crystallography  NMR spectrometry Computational – Structure Prediction (The Holy Grail) Sequence implies structure, therefore in principle we can predict the structure from the sequence alone

Protein Structure Prediction ab initio  Use just first principles: energy, geometry, and kinematics Homology  Find the best match to a database of sequences with known 3D- structure Threading Meta-servers and other methods

Ab initio Prediction Sampling the global conformation space  Lattice models / Discrete-state models  Molecular Dynamics Picking native conformations with an energy function  Solvation model: how protein interacts with water  Pair interactions between amino acids Predicting secondary structure  Local homology  Fragment libraries

Lattice String Folding HP model: main modeled force is hydrophobic attraction  NP-hard in both 2-D square and 3-D cubic  Constant approximation algorithms  Not so relevant biologically

Lattice String Folding