UCSpv: Principled Voting in UCS Rule Populations Gavin Brown, Tim Kovacs, James Marshall
Context Ensemble: any collection of predictors which aggregate predictions by voting LCS: a dynamically updated population of predictors (which aggregate predictions…) Our view: One LCS is an ensemble Our long-term goal: Transfer ideas between LCS and ensembles Related work by Barry, Drugowitsch, Bull
Overview UCS is a version of XCS adapted for supervised learning (Bernadó & Garrell-Guiu, 2003). We implemented it and noted two undocumented features. We adapted the voting method from Adaboost ensemble method and compared it to original UCS method. Preliminary results show as good or better performance on clean data. But can be beaten by well-tuned UCS on very noisy or class-imbalanced data.
Replicating UCS Two undocumented issues related to voting: Contribution of inexperienced rules is discounted System prediction for an action is divided by the sum of numerosities advocating it Both are derived from XCS (although the latter is not a standard part of XCS) Thank you to Albert Orriols-Puig and Kamran Shafi
UCS Voting Error rate of rule i: Fitness of rule i: where γ implements inexperience discount and ν tunes vote System prediction for a class c:
UCS Voting UCS is supervised so there’s no need to take exploratory actions (unlike in RL) We greedily take action with highest vote Voting process modeled on XCS –While XCS separated fitness and prediction, UCS combines them again Error rate used by Frey & Slate (1991)
UCS Fitness Tuning Larger v = steeper curve v=5 v = 50
Replicating UCS Accuracy 11 multiplexer, no noise, balanced classes training set classification, 20 replicates Results (next slide) fit original UCS implementation Higher v gives better accuracy (to a point)
Replicating UCS: Inexperience Discount If we do not discount the vote of low fitness rules, UCS does not quite converge
Inexperience Discount Why is it needed? –A rule which correctly classifies 1 example (and matches no other) has error rate zero. –Without discount it gets maximum vote. –GA continues to generate bad rules which briefly disrupt voting. XCS does not have same need because fitness is initially low and rises slowly. UCS method is all or nothing. Marshall et al. (2007) introduce a smoother Bayesian scheme.
Replicating UCS %[B] Agrees with original UCS paper(s) No effect of v on %[B]? So we should optimise GA selective pressure independently of voting! (Future work…)
Derivation of Vote Weights Derived from Adaboost Optimal for a particular exponential bound on cost function Other bounds appear in literature Limited to 2 classes Doesn’t need training set distribution update mechanism of AdaBoost, so it’s simpler
Bound on Error Function
Weights to Minimize the Bound Derived Weight of rule i in vote: α tends to infinity as ε tends to 0 If ε = 0 we set α to a constant α’ This adds another parameter, but: –α’ gives us a (weak) way to tune vote –We see α’ disappears later We call this UCSpv – UCS with Principled Voting
Comparison of Fitness Functions
UCS With/out Discount (11mux)
UCSpv With/out Discount (11mux)
Class-imbalanced Training Data 8:1 unbalanced16:1 unbalanced
Noisy Training Data 15% noise (note that UCS requires fine tuning of v parameter) 5% noise
Conclusions So Far UCSpv matches/outperforms UCS except –high noise (15%) and high imbalance (16:1) –AND when UCS is optimally parameterised Why? –Noise and imbalance violate assumptions and make UCSpv weights sub-optimal –ν is much better than α’ at tuning voting ν affects whole curve but α’ only one point
“Monks3” Problem
Correcting for Noise If UCSpv weights become suboptimal with noisy data, can we correct for noise? A crude method is to factor in an estimate of noise κ:
Monks3 with κ=0.5
Confidence in Accuracy Accuracy estimates are wrong when training data is unrepresentative: –when training set is very small (sampling error) –when class labels are noisy –when target concept is non-stationary –when distribution over train and test sets differ –probably other cases too … Various forms of correction –Inexperience discounts & probability smoothing –Confidence intervals
α’ Disappears If we correct accuracy estimates with confidence measure then accuracy is never 1 in realistic learning problems (see Marshall et al. 2007) Hence α never infinity and α’ not needed α’ was not a good way of tuning UCSpv anyway
Summary Uncovered inexperience discount and normalisation by numerosity in UCS Applied principled weights from Adaboost On balanced noiseless data (11mux) –Accuracy as good or better than UCS –One less parameter to set Imbalanced and noisy data (11mux, Monks3) –UCS can be tuned better and can outperform UCSpv on the most noisy/imbalanced data if parameterised just right
Future Work How to tune UCSpv more effectively when assumptions of derived weights do not hold Other bounds on cost function How to separately tune selective pressure in the genetic algorithm Principled approaches to factor in confidence in accuracy Generally: exchange between LCS and ensembles
References E. Bernadó & J. Garrell-Guiu (2003). Accuracy- Based Learning Classifier Systems: Models, Analysis and Applications to Classification Tasks P. Frey & D. Slate (1991). Letter Recognition using Holland-Style Adaptive Classifiers. J.A.R. Marshall, G. Brown & T. Kovacs (2007). Bayesian Estimation of Rule Accuracy in UCS.