Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell University Presentaion by Andrejus Parfionovas department of Math & Stat, USU
2 Classical methods to predict a structure of a new protein: Sequence comparison to the known proteins in search of similarities Sequences often diverge and become unrecognizable Structure comparison to the known structures in the PDB database Structural data is sparse and not available for newly sequenced genes
3 What other features can be used to improve prediction? Domain content Subcellular location Tissue specificity Species type Pairwise interaction Enzyme cofactors Catalytic activity Expression profiles, etc.
4 Having so many features it is important: To extract relevant information Directly from the sequence Predicted secondary structure Features extracted from database To combine data in a feasible model Mixture model of Probabilistic Decision Trees (PDT) was used
5 Features extracted directly from the sequence (percentage) : 20 individual amino acids 16 amino acid groups percentage (16 amino acid groups: + or – charged, polar, aromatic, hydrophobic, acidic, etc) 20 most informative dipeptides
6 Features predicted from the sequence: Secondary structure predicted by the PSIPRED: Coil Helix Strand
7 Features extracted from SWISSPROT database: Binary features (presence/absence) Alternative products Enzyme cofactors Catalytic activity Nominal features Tissue Specificity (2 different definitions) Subcellular location Organism and Species classification Continuous Number of patterns exhibited by each protein (“complexity” of a protein)
8 Mixture model of PDT (Probabilistic Decision Trees) Can handle nominal data Robust to the errors Missing data is allowed
9 How to select an attribute for a decision node? Use entropy to measure the impurity Impurity must reduce after the split Alternative measure – Mantras distance metric (has lower bias towards low split info).
10 Enhancements of the algorithm: Dynamic attribute filtering Discretizing numerical features Multiple values for attributes Missing attributes Binary splitting Leaf weighting Post-prunning 10-fold cross-validation
11 The probabilistic fremaework Attribute is selected with probability that depends on its information gain Weight the trees by the performance
12 Evaluation of decision trees Accuracy = (tp + tn)/total Sensitivity = tp/(tp + fn) Selectivity = tp/(tp + fp) Jensen-Shannon divergence score
13 Handling skewed distributions (unequal class sizes) Re-weight cases by 1/(# of counts ) Increases the impurity # of false positives Mixed entropy Uses average of weighted & unweighted information gain to split and prune trees Interlaced entropy Start with weighted samples and later use the unweighted entropy
14 Model selection (simplification) Occam’s razor: out of 2 models with the same result choose more simple Bayesian approach : the most probable model has max.posterior probability
15 Learning strategy optimization Configurationsensitivity (selectivity)accepted/rejected Basic (C4.5)0.35initial Mantaras metric0.36accepted Binary branching0.45accepted Weighted entropy0.56accepted 10 fold cross-validation0.65accepted 20 fold cross-validation0.64rejected JS-based post-prunning.07accepted sen/sel post-prunning0.68rejected Weighted leafs0.68rejected Mixed entropy0.63rejected Dipeptide information0.73accepted Propabilistic trees0.81accepted
16 Pfam classification test (comparison to BLAST) PDT performance – 81% BLAST performance – 86% Main reasons: Nodes become impure because weighted entropy stops learning too early Important branches were eliminated by post-pruning when validation set is small
17 EC classification test (comparison to BLAST) PDT performance on average – 71% BLAST performance was often smaller
18 Conclusions Many protein families cannot be defined by sequence similarities only New method makes use of other features (structure, dipeptides, etc.) Besides classification, PDT allow feature selection for further use Results comparable to BLAST
19 Modifications and Improvements Use global optimization for pruning Use probabilities for attribute values Use boosting techniques (combine weighted trees) Use Gini-index to measure node-impurity