Weka solution for the 2004 KDD Cup Protein Homology Prediction task Bernhard Pfahringer Weka Group, University of Waikato, New Zealand
The problem Detect homologous protein sequences 153 train sequences * ~1000 sequences ==> pairs classified as match or not Very skewed: only 1296 matches (< 1%!) BUT: excellent attributes
The attributes
Algorithms doing well 2fold cross-validation, looking only at predictive accuracy: –Linear SVM (with Logistic model on output for better probs, Platt1999) –10 AdaBoosted unpruned decision trees –Random rules (~ RandomForest, ECML2004 Rule learning WS)
Performance criteria Top1: fraction of blocks with a homologous sequence ranked top1 (max) RMSE: root mean squared error (min) RKL: average rank of the lowest ranked homologous sequence (min) APR: average of the average precision in each block (max) Only RMSE depends on absolute values, for all other criteria a good ranking is sufficient
Unique Solution Voted ensemble of three classifiers: –Linear SVM + logistic model on output –Adaboosted 10 unpruned J48 trees –10^5 random rules Non-standard voting: –If SVM and RandomRules agree ==> Average their probabilities –ELSE Use Booster as tie-breaker Lucky (first on Proteins, 18th on Physics)
Ensemble performance Top1RMSERKL APR Boost SMOlin RR10^ Voted
Attribute ranks RRA53A63A55A58A3A34A54 DSA53A55A58A59A60A54A3 SVMA53A58A3A60A59A8A45
What I should have done Optimize separately Bagging for better probability estimates More data engineering (e.g. PCA, …) View it as an outlier detection problem Utilize block structure ?
(Standard) Lessons Data engineering (good attributes) essential Ensembles are more robust Weka is not just an educational tool –[at least some parts scale well] Java/open source DM tools are competitive But: could improve Weka considerably ( volunteers and/or sponsors, get in touch :-)
Finally A big “THANK YOU” to the organizers of the KDD Cup 2004 !