Protein Classification
Given a new protein, can we place it in its “correct” position within an existing protein hierarchy? Methods BLAST / PsiBLAST Profile HMMs Supervised Machine Learning methods Fold Family Superfamily Proteins ? new protein
PSI-BLAST Given a sequence query x, and database D 1.Find all pairwise alignments of x to sequences in D 2.Collect all matches of x to y with some minimum significance 3.Construct position specific matrix M Each sequence y is given a weight so that many similar sequences cannot have much influence on a position (Henikoff & Henikoff 1994) 4.Using the matrix M, search D for more matches 5.Iterate 1–4 until convergence Profile M
Classification with Profile HMMs Fold Family Superfamily ? new protein
The Fisher Kernel Fisher score U X = log P(X | H 1, ) Quantifies how each parameter contributes to generating X For two different sequences X and Y, can compare U X, U Y D 2 F (X, Y) = ½ 2 |U X – U Y | 2 Given this distance function, K(X, Y) is defined as a similarity measure: K(X, Y) = exp(-D 2 F (X, Y)) Set so that the average distance of training sequences X i H 1 to sequences X j H 0 is 1
The Fisher Kernel To train a classifier for a given family H 1, 1.Build profile HMM, H 1 2.U X = log P(X | H 1, )(Fisher score) 3.D 2 F (X, Y) = ½ 2 |U X – U Y | 2 (distance) 4.K(X, Y) = exp(-D 2 F (X, Y)), (akin to dot product) 5.L(X) = Xi H1 i K(X, X i ) – Xj H0 j K(X, X j ) 6.Iteratively adjust to optimize J( ) = Xi H1 i (2 - L(X i )) – Xj H0 j (2 + L(X j )) To classify query X, Compute U X Compute K(X, X i ) for all training examples X i with I ≠ 0 (few) Decide based on L(X) >? 0
O. Jangmin
QUESTION Running time of Fisher kernel SVM on query X?
k-mer based SVMs Leslie, Eskin, Weston, Noble; NIPS 2002 Highlights K(X, Y) = exp(-½ 2 |U X – U Y | 2 ), requires expensive profile alignment: U X = log P(X | H 1, ) – O(|X| |H 1 |) Instead, new kernel K(X, Y) just “counts up” k-mers with mismatches in common between X and Y – O(|X|) in practice Off-the-shelf SVM software used
k-mer based SVMs For given word size k, and mismatch tolerance l, define K(X, Y) = # distinct k-long word occurrences with ≤ l mismatches Define normalized kernel K’(X, Y) = K(X, Y)/ sqrt(K(X,X)K(Y,Y)) SVM can be learned by supplying this kernel function A B A C A R D I A B R A D A B I X Y K(X, Y) = 4 K’(X, Y) = 4/sqrt(7*7) = 4/7 Let k = 3; l = 1
SVMs will find a few support vectors v After training, SVM has determined a small set of sequences, the support vectors, who need to be compared with query sequence X
Benchmarks
Semi-Supervised Methods GENERATIVE SUPERVISED METHODS
Semi-Supervised Methods DISCRIMINATIVE SUPERVISED METHODS
Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples
Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples SVMs and other discriminative methods may make significant mistakes due to lack of data
Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples
Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples
Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples
Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples Attempt to “contract” the distances within each cluster while keeping intracluster distances larger
Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples
Semi-Supervised Methods 1.Kuang, Ie, Wang, Siddiqi, Freund, Leslie 2005Kuang, Ie, Wang, Siddiqi, Freund, Leslie 2005 A Psi-BLAST profile—based method 2.Weston, Leslie, Elisseeff, Noble, NIPS 2003Weston, Leslie, Elisseeff, Noble, NIPS 2003 Cluster kernels
(semi)1. Profile k-mer based SVMs For each sequence X, Obtain PSI-BLAST profile Q(X) = {p i ( ); : amino acid, 1≤ i ≤ |X|} For every k-mer in X, x j … x j+k-1, define -neighborhood M k, (Q[x j …x j+k-1 ]) = {b 1 …b k | - i=0…k-1 log p j+i (b i ) < } Define K(X, Y) For each b 1 …b k matching m times in X, n times in Y, add m*n In practice, each k-mer can have ≤ 2 mismatches and K(X, Y) can be computed quickly in O(k (|X| + |Y|)) Profile M PSI-BLAST
(semi)1. Discriminative motifs According to this kernel K(X, Y), sequence X is mapped to Φ k, (X): vector in 20 k dimensions Φ k, (X)(b 1 …b k ) = # k-mers in Q(X) whose neighborhood includes b 1 …b k Then, SVM learns a discriminating “hyperplane” with normal vector v: v = i=1…N (+/-) i Φ k, (X (i) ) Consider a profile k-mer Q[x j …x j+k-1 ]; its contribution to v is ~ Φ k, (Q[x j …x j+k-1 ]), v Consider a position i in X: count up the contributions of all words containing x i g(x i ) = j=1…k max{ 0, Φ k, (Q[x i-k+j …x j-1+j ]), v } Sort these contributions within all positions of all sequences, to pick important positions or discriminative motifs
(semi)1. Discriminative motifs Consider a position i in X: count up the contributions to v of all words containing x i Sort these contributions within all positions of all sequences, to pick discriminative motifs
(semi)2. Cluster Kernels Two (more!) methods 1.Neighborhood 1.For each X, run PSI-BLAST to get similar seqs Nbd(X) 2.Define Φ nbd (X) = 1/|Nbd(X)| X’ Nbd(X) Φ original (X’) “Counts of all k-mers matching with at most 1 diff. all sequences that are similar to X” 3.K nbd (X, Y) = 1/(|Nbd(X)|*|Nbd(Y)) X’ Nbd(X) Y’ Nbd(Y) K(X’, Y’) 2.Bagged mismatch
(semi)2. Cluster Kernels Two (more!) methods 1.Neighborhood 1.For each X, run PSI-BLAST to get similar seqs Nbd(X) 2.Define Φ nbd (X) = 1/|Nbd(X)| X’ Nbd(X) Φ original (X’) “Counts of all k-mers matching with at most 1 diff. all sequences that are similar to X” 3.K nbd (X, Y) = 1/(|Nbd(X)|*|Nbd(Y)) X’ Nbd(X) Y’ Nbd(Y) K(X’, Y’) 2.Bagged mismatch 1.Run k-means clustering n times, giving p = 1,…,n assignments c p (X) 2.For every X and Y, count up the fraction of times they are bagged together K bag (X, Y) = 1/n p 1(c p (X) = c p (Y)) 3.Combine the “bag fraction” with the original comparison K(.,.) K new (X, Y) = K bag (X, Y) K(X, Y)
Some Benchmarks
Google-like homology search The internet and the network of protein homologies have some similarity—scale free Given query X, Google ranks webpages by a flow algorithm From each webpage W, linked nbrs receive flow At time t+1, W sends to nbrs flow it received at time t Finite, ergodic, aperiodic Markov Chain Can find stationary distribution efficiently as left eigenvector with eigenvalue 1 Start with arbitrary probability distribution, and multiply by the transition matrix
Google-like homology search Weston, Elisseeff, Zhu, Leslie, Noble, PNAS 2004 RANKPROP algorithm for protein homology First, compute a matrix K ij of PSI-BLAST homology between proteins i and j, normalized so that j K ji = 1 1.Initialization y 1 (0) = 1; y i (0) = 0 2.For t = 0, 1, …, 3. For i = 2 to m 4. y i (t+1) = K 1i + K ji y j (t) In the end, let y i be the ranking score for similarity of sequence i to sequence 1 ( = 0.95 is good)
Google-like homology search For a given protein family, what fraction of true members of the family are ranked higher than the first 50 non-members?
Protein Structure Prediction
Protein Structure Determination Experimental X-ray crystallography NMR spectrometry Computational – Structure Prediction (The Holy Grail) Sequence implies structure, therefore in principle we can predict the structure from the sequence alone
Protein Structure Prediction ab initio Use just first principles: energy, geometry, and kinematics Homology Find the best match to a database of sequences with known 3D- structure Threading Meta-servers and other methods
Ab initio Prediction Sampling the global conformation space Lattice models / Discrete-state models Molecular Dynamics Picking native conformations with an energy function Solvation model: how protein interacts with water Pair interactions between amino acids Predicting secondary structure Local homology Fragment libraries
Lattice String Folding HP model: main modeled force is hydrophobic attraction NP-hard in both 2-D square and 3-D cubic Constant approximation algorithms Not so relevant biologically
Lattice String Folding