Download presentation
Presentation is loading. Please wait.
1
Combining HMMs with SVMs
CISC 841 Bioinformatics Combining HMMs with SVMs Li Liao, CISC841, F07
2
HMM gradients Fisher Score <X> = log P(X|H, )
The gradient of a sequence X with respect to a given model is computed using the forward-backward algorithm. Each dimension corresponds to one parameter of the model. The feature space is tailored to the sequences from which the model was trained. Li Liao, CISC841, F07
3
SVM-Fisher discrimination
A probabilistic hidden Markov model is trained from some example sequences x1 x2 x3 … xN Usually probability model P(xi|) (or function of P(xi|)) is used as a measure of sequence-model membership, and a threshold is used on this measure to decide membership. The Fisher vector is a vector of gradients of P(xi|) (or gradients of function of P(xi|)) w.r.t the parameters of the model. Uxi = P(xi|) One can take the training example sequences (positive set) and other sequences that are known to be non-members (negative set), and transform them into Fisher vectors. A Support Vector Machine (SVM) can be trained using the positive and negative Fisher vectors, and can be used to classify other sequences. Li Liao, CISC841, F07
4
Application: Protein remote homology detection
Li Liao, CISC841, F07
5
1 2 3 SVM-Pairwise method Protein homologs Protein non-homologs
Positive train Negative train Protein homologs Protein non-homologs 1 Positive pairwise score vectors Negative pairwise score vectors Testing data Target protein of unknown function 2 Support vector machine 3 Binary classification Li Liao, CISC841, F07
6
Experiment: known protein families
Li Liao, CISC841, F07 Jaakkola, Diekhans and Haussler 1999
7
Sample family sizes Family ID Positive train Positive test
Negative train Negative test 12 6 2890 1444 10 8 2408 1926 29 7 3477 839 26 23 2256 1994 113 3895 275 17 2686 1579 46 3732 567 11 140 307 3894
8
A measure of sensitivity and specificity
5 6 ROC = 1 ROC = 0.67 ROC = 0 ROC: receiver operating characteristic score is the normalized area under a curve the plots true positives as a function of false positives
9
Application: Discriminating signal peptide from transmembrane proteins
Li Liao, CISC841, F07
10
Feature selection We expect gradients w.r.t transition parameters
to be better discrimination features We look for those transitions that are differentially used by TM proteins and SP proteins - transform each signal peptide sequence (1275) into a Fisher vector w.r.t transition parameters and find the resultant vector - transform each TM sequence into a Fisher vector w.r.t transition parameters and find the resultant vector - compare the two resultant vectors SignalP TM protein Li Liao, CISC841, F07
11
Gradients of P(s|x) In pattern recognition problems, we are interested in P(s|x,) rather than P(x|) Us|x = log P(s|x,) = log P(s, x|) - log P(x|) First term: P(s,x) = aBs1es1(x1) . as1s2 es2(x2) . as2s3 es3(x3) … = i (i/aa)ni(s,x) where ni(s,x) number times i is used, and aa = 1 P(x, s) = (1 - k ) nk(s,x) P(s,x) k k = mk(x)/k – mk(x) mk(x) is the expected number of times k is used in x following the given path s Second term: P(x) = P(x,) P(x,) = a01e1(x1) . a12 e2(x2) . a23 e3(x3)… = i(i/aa)ni(,x) where ni(,x) number times i is used, and aa = 1 log P(x) = P(x, ) k P(x) k But, P(x, ) = (1 - k ) nk(,x) P(x,) Thus, log P(x) = (1 - k ) nk(,x) P(x,) k k P(x) = (1 - k ) nk(,x) P(|x) k = nk(x)/k – nk(x) nk(x) is the expected number of times k is used in x following any path Finally: Us|x = mk(x)/k – mk(x) – nk(x)/k + nk(x) Li Liao, CISC841, F07
12
Classification experiment
10-fold cross validation experiment using - positive set (247 TM proteins) - negative set (1275 signal peptide containing proteins) SVM-light package is used. sequence to vector x Us|x TMMOD SVM Learn SVM Classifier ? subsets of 247 TM proteins 1275 SP Li Liao, CISC841, F07
13
Discrimination results
A third (68) more SP proteins that were incorrectly classified as TM TM proteins are identified correctly. TM proteins incorrectly classified as SP proteins SP proteins incorrectly classified as TM proteins Phobius SignalP-NN SignalP-HMM TMMOD TMMOD + SVM-Fisher 7.7% (19/247) 42.9% 19.0% 6.1% (15/247) 3.5% (45/1275) 2.3% 1.4% 14.5%(185/1275) 9.2% (117/1275) Li Liao, CISC841, F07
14
Application: Protein-Protein Interaction Prediction
Li Liao, CISC841, F07
15
Interaction Profile Hidden Markov Model (ipHMM)
Fredrich et al (2006) Li Liao, CISC841, F07
16
Likelihood Score Vector
Knowledge transfer: Build ipHMM from proteins whose structural information is available. Align the sequences of proteins whose structural information is not available to the model. Likelihood Score Vector <LSai, A, LSai, B, LSbj,A, LSbj, B> Fisher Score Vector U (x) = ∇θ logP(x|θ) Uij = Ej(i) / ej(i) + k Ej(k) Li Liao, CISC841, F07
17
Li Liao, CISC841, F07
18
Li Liao, CISC841, F07
19
Data set Fredrich et al (2006): 2018 proteins in 36 domain families
Scheme mean ROC score FS_NM 0.7487 LS 0.7997 FS_IM 0.8202 FS_IM + LS 0.8626 Li Liao, CISC841, F07
20
Conclusions Structural information at binding sites enhances protein- protein interaction prediction. Interaction profile HMM can transfer structural information Fisher scores extracted from domain profiles further enhance protein-protein interaction prediction for proteins with no available structural information. Li Liao, CISC841, F07
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.