Presentation is loading. Please wait.

Presentation is loading. Please wait.

Weka solution for the 2004 KDD Cup Protein Homology Prediction task Bernhard Pfahringer Weka Group, University of Waikato, New Zealand.

Similar presentations


Presentation on theme: "Weka solution for the 2004 KDD Cup Protein Homology Prediction task Bernhard Pfahringer Weka Group, University of Waikato, New Zealand."— Presentation transcript:

1 Weka solution for the 2004 KDD Cup Protein Homology Prediction task Bernhard Pfahringer Weka Group, University of Waikato, New Zealand

2 The problem Detect homologous protein sequences 153 train sequences * ~1000 sequences ==> 145751 pairs classified as match or not Very skewed: only 1296 matches (< 1%!) BUT: excellent attributes

3 The attributes

4

5

6 Algorithms doing well 2fold cross-validation, looking only at predictive accuracy: –Linear SVM (with Logistic model on output for better probs, Platt1999) –10 AdaBoosted unpruned decision trees –Random rules (~ RandomForest, ECML2004 Rule learning WS)

7 Performance criteria Top1: fraction of blocks with a homologous sequence ranked top1 (max) RMSE: root mean squared error (min) RKL: average rank of the lowest ranked homologous sequence (min) APR: average of the average precision in each block (max) Only RMSE depends on absolute values, for all other criteria a good ranking is sufficient

8 Unique Solution Voted ensemble of three classifiers: –Linear SVM + logistic model on output –Adaboosted 10 unpruned J48 trees –10^5 random rules Non-standard voting: –If SVM and RandomRules agree ==> Average their probabilities –ELSE Use Booster as tie-breaker Lucky (first on Proteins, 18th on Physics)

9 Ensemble performance Top1RMSERKL APR Boost100.793330.03690500.680.68582 SMOlin0.886670.03699 64.410.82581 RR10^50.893330.04142 53.770.83733 Voted0.906670.03833 52.450.84118

10 Attribute ranks 1234567 RRA53A63A55A58A3A34A54 DSA53A55A58A59A60A54A3 SVMA53A58A3A60A59A8A45

11 What I should have done Optimize separately Bagging for better probability estimates More data engineering (e.g. PCA, …) View it as an outlier detection problem Utilize block structure ?

12 (Standard) Lessons Data engineering (good attributes) essential Ensembles are more robust Weka is not just an educational tool –[at least some parts scale well] Java/open source DM tools are competitive But: could improve Weka considerably ( volunteers and/or sponsors, get in touch :-)

13 Finally A big “THANK YOU” to the organizers of the KDD Cup 2004 !


Download ppt "Weka solution for the 2004 KDD Cup Protein Homology Prediction task Bernhard Pfahringer Weka Group, University of Waikato, New Zealand."

Similar presentations


Ads by Google