Combining HMMs with SVMs

Combining HMMs with SVMs
CISC 841 Bioinformatics Combining HMMs with SVMs Li Liao, CISC841, F07

HMM gradients Fisher Score <X> =  log P(X|H, )
The gradient of a sequence X with respect to a given model is computed using the forward-backward algorithm. Each dimension corresponds to one parameter of the model. The feature space is tailored to the sequences from which the model was trained. Li Liao, CISC841, F07

SVM-Fisher discrimination
A probabilistic hidden Markov model  is trained from some example sequences x1 x2 x3 … xN Usually probability model P(xi|) (or function of P(xi|)) is used as a measure of sequence-model membership, and a threshold is used on this measure to decide membership. The Fisher vector is a vector of gradients of P(xi|) (or gradients of function of P(xi|)) w.r.t the parameters of the model. Uxi =  P(xi|) One can take the training example sequences (positive set) and other sequences that are known to be non-members (negative set), and transform them into Fisher vectors. A Support Vector Machine (SVM) can be trained using the positive and negative Fisher vectors, and can be used to classify other sequences. Li Liao, CISC841, F07

Application: Protein remote homology detection
Li Liao, CISC841, F07

1 2 3 SVM-Pairwise method Protein homologs Protein non-homologs
Positive train Negative train Protein homologs Protein non-homologs 1 Positive pairwise score vectors Negative pairwise score vectors Testing data Target protein of unknown function 2 Support vector machine 3 Binary classification Li Liao, CISC841, F07

Experiment: known protein families
Li Liao, CISC841, F07 Jaakkola, Diekhans and Haussler 1999

Sample family sizes Family ID Positive train Positive test
Negative train Negative test 12 6 2890 1444 10 8 2408 1926 29 7 3477 839 26 23 2256 1994 113 3895 275 17 2686 1579 46 3732 567 11 140 307 3894

A measure of sensitivity and specificity
5 6 ROC = 1 ROC = 0.67 ROC = 0 ROC: receiver operating characteristic score is the normalized area under a curve the plots true positives as a function of false positives

Application: Discriminating signal peptide from transmembrane proteins

Feature selection We expect gradients w.r.t transition parameters
to be better discrimination features We look for those transitions that are differentially used by TM proteins and SP proteins - transform each signal peptide sequence (1275) into a Fisher vector w.r.t transition parameters and find the resultant vector - transform each TM sequence into a Fisher vector w.r.t transition parameters and find the resultant vector - compare the two resultant vectors SignalP TM protein Li Liao, CISC841, F07

Gradients of P(s|x) In pattern recognition problems, we are interested in P(s|x,) rather than P(x|) Us|x =  log P(s|x,) =  log P(s, x|) -  log P(x|) First term: P(s,x) = aBs1es1(x1) . as1s2 es2(x2) . as2s3 es3(x3) … = i (i/aa)ni(s,x) where ni(s,x) number times i is used, and aa = 1  P(x, s) = (1 - k ) nk(s,x) P(s,x)  k k = mk(x)/k – mk(x) mk(x) is the expected number of times k is used in x following the given path s Second term: P(x) =  P(x,) P(x,) = a01e1(x1) . a12 e2(x2) . a23 e3(x3)… = i(i/aa)ni(,x) where ni(,x) number times i is used, and aa = 1  log P(x) =   P(x, )  k P(x)  k But,  P(x, ) = (1 - k ) nk(,x) P(x,) Thus,  log P(x) =  (1 - k ) nk(,x) P(x,)  k k P(x) =  (1 - k ) nk(,x) P(|x) k = nk(x)/k – nk(x) nk(x) is the expected number of times k is used in x following any path Finally: Us|x = mk(x)/k – mk(x) – nk(x)/k + nk(x) Li Liao, CISC841, F07

Classification experiment
10-fold cross validation experiment using - positive set (247 TM proteins) - negative set (1275 signal peptide containing proteins) SVM-light package is used. sequence to vector x  Us|x TMMOD SVM Learn SVM Classifier ? subsets of 247 TM proteins 1275 SP Li Liao, CISC841, F07

Discrimination results
A third (68) more SP proteins that were incorrectly classified as TM TM proteins are identified correctly. TM proteins incorrectly classified as SP proteins SP proteins incorrectly classified as TM proteins Phobius SignalP-NN SignalP-HMM TMMOD TMMOD + SVM-Fisher 7.7% (19/247) 42.9% 19.0% 6.1% (15/247) 3.5% (45/1275) 2.3% 1.4% 14.5%(185/1275) 9.2% (117/1275) Li Liao, CISC841, F07

Application: Protein-Protein Interaction Prediction

Interaction Profile Hidden Markov Model (ipHMM)
Fredrich et al (2006) Li Liao, CISC841, F07

Likelihood Score Vector
Knowledge transfer: Build ipHMM from proteins whose structural information is available. Align the sequences of proteins whose structural information is not available to the model. Likelihood Score Vector <LSai, A, LSai, B, LSbj,A, LSbj, B> Fisher Score Vector U (x) = ∇θ logP(x|θ) Uij = Ej(i) / ej(i) +  k Ej(k) Li Liao, CISC841, F07

Data set Fredrich et al (2006): 2018 proteins in 36 domain families
Scheme mean ROC score FS_NM 0.7487 LS 0.7997 FS_IM 0.8202 FS_IM + LS 0.8626 Li Liao, CISC841, F07

Conclusions Structural information at binding sites enhances protein- protein interaction prediction. Interaction profile HMM can transfer structural information Fisher scores extracted from domain profiles further enhance protein-protein interaction prediction for proteins with no available structural information. Li Liao, CISC841, F07

Combining HMMs with SVMs

Similar presentations

Presentation on theme: "Combining HMMs with SVMs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Combining HMMs with SVMs

Similar presentations

Presentation on theme: "Combining HMMs with SVMs"— Presentation transcript:

Similar presentations

About project

Feedback