Download presentation
Presentation is loading. Please wait.
1
Weka solution for the 2004 KDD Cup Protein Homology Prediction task Bernhard Pfahringer Weka Group, University of Waikato, New Zealand
2
The problem Detect homologous protein sequences 153 train sequences * ~1000 sequences ==> 145751 pairs classified as match or not Very skewed: only 1296 matches (< 1%!) BUT: excellent attributes
3
The attributes
6
Algorithms doing well 2fold cross-validation, looking only at predictive accuracy: –Linear SVM (with Logistic model on output for better probs, Platt1999) –10 AdaBoosted unpruned decision trees –Random rules (~ RandomForest, ECML2004 Rule learning WS)
7
Performance criteria Top1: fraction of blocks with a homologous sequence ranked top1 (max) RMSE: root mean squared error (min) RKL: average rank of the lowest ranked homologous sequence (min) APR: average of the average precision in each block (max) Only RMSE depends on absolute values, for all other criteria a good ranking is sufficient
8
Unique Solution Voted ensemble of three classifiers: –Linear SVM + logistic model on output –Adaboosted 10 unpruned J48 trees –10^5 random rules Non-standard voting: –If SVM and RandomRules agree ==> Average their probabilities –ELSE Use Booster as tie-breaker Lucky (first on Proteins, 18th on Physics)
9
Ensemble performance Top1RMSERKL APR Boost100.793330.03690500.680.68582 SMOlin0.886670.03699 64.410.82581 RR10^50.893330.04142 53.770.83733 Voted0.906670.03833 52.450.84118
10
Attribute ranks 1234567 RRA53A63A55A58A3A34A54 DSA53A55A58A59A60A54A3 SVMA53A58A3A60A59A8A45
11
What I should have done Optimize separately Bagging for better probability estimates More data engineering (e.g. PCA, …) View it as an outlier detection problem Utilize block structure ?
12
(Standard) Lessons Data engineering (good attributes) essential Ensembles are more robust Weka is not just an educational tool –[at least some parts scale well] Java/open source DM tools are competitive But: could improve Weka considerably ( volunteers and/or sponsors, get in touch :-)
13
Finally A big “THANK YOU” to the organizers of the KDD Cup 2004 !
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.