UCSpv: Principled Voting in UCS Rule Populations Gavin Brown, Tim Kovacs, James Marshall.

Slides:

Advertisements

Similar presentations

On-line learning and Boosting

Advertisements

Data Mining and Machine Learning

Linear Classifiers (perceptrons)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.

Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.

Boosting Approach to ML

FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.

Games of Prediction or Things get simpler as Yoav Freund Banter Inc.

CMPUT 466/551 Principal Source: CMU

Longin Jan Latecki Temple University

Introduction to Boosting Slides Adapted from Che Wanxiang( 车万翔 ) at HIT, and Robin Dhamankar of Many thanks!

Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.

Ensemble Learning what is an ensemble? why use an ensemble?

Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.

Ensemble Learning: An Introduction

Taking the Kitchen Sink Seriously: An Ensemble Approach to Word Sense Disambiguation from Christopher Manning et al.

Tracking using the Kalman Filter. Point Tracking Estimate the location of a given point along a sequence of images. (x 0,y 0 ) (x n,y n )

Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

Examples of Ensemble Methods

Machine Learning: Ensemble Methods

Sparse vs. Ensemble Approaches to Supervised Learning

Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.

Ensemble Learning (2), Tree and Forest

For Better Accuracy Eick: Ensemble Learning

Online Learning Algorithms

Constant process Separate signal & noise Smooth the data: Backward smoother: At any give T, replace the observation yt by a combination of observations.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Machine Learning CS 165B Spring 2012

Dr. Richard Young Optronic Laboratories, Inc..  Uncertainty budgets are a growing requirement of measurements.  Multiple measurements are generally.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

CS 391L: Machine Learning: Ensembles

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.

Benk Erika Kelemen Zsolt

CSC2515 Fall 2008 Introduction to Machine Learning Lecture 11a Boosting and Naïve Bayes All lecture slides will be available as.ppt,.ps, &.htm at

BOOSTING David Kauchak CS451 – Fall Admin Final project.

LCS Case Studies BRANDEN PAPALIA, JAMES PATRICK, MICHAEL STEWART FACULTY OF ENGINEERING, COMPUTING AND MATHEMATICS.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

1 E. Fatemizadeh Statistical Pattern Recognition.

ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Ensemble Methods in Machine Learning

Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:

1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.

Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Genetic Algorithms (in 1 Slide) l GA: based on an analogy to biological evolution l Each.

… Algo 1 Algo 2 Algo 3 Algo N Meta-Learning Algo.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

Ensemble Classifiers.

Machine Learning: Ensemble Methods

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Alternative Representations for Artificial Immune Systems

Boosting and Additive Trees

CS 4/527: Artificial Intelligence

Data Mining Practical Machine Learning Tools and Techniques

Introduction to Data Mining, 2nd Edition

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Ensemble learning.

Tuning CNN: Tips & Tricks

Ensemble learning Reminder - Bagging of Trees Random Forest

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

UCSpv: Principled Voting in UCS Rule Populations Gavin Brown, Tim Kovacs, James Marshall

Context Ensemble: any collection of predictors which aggregate predictions by voting LCS: a dynamically updated population of predictors (which aggregate predictions…) Our view: One LCS is an ensemble Our long-term goal: Transfer ideas between LCS and ensembles Related work by Barry, Drugowitsch, Bull

Overview UCS is a version of XCS adapted for supervised learning (Bernadó & Garrell-Guiu, 2003). We implemented it and noted two undocumented features. We adapted the voting method from Adaboost ensemble method and compared it to original UCS method. Preliminary results show as good or better performance on clean data. But can be beaten by well-tuned UCS on very noisy or class-imbalanced data.

Replicating UCS Two undocumented issues related to voting: Contribution of inexperienced rules is discounted System prediction for an action is divided by the sum of numerosities advocating it Both are derived from XCS (although the latter is not a standard part of XCS) Thank you to Albert Orriols-Puig and Kamran Shafi

UCS Voting Error rate of rule i: Fitness of rule i: where γ implements inexperience discount and ν tunes vote System prediction for a class c:

UCS Voting UCS is supervised so there’s no need to take exploratory actions (unlike in RL) We greedily take action with highest vote Voting process modeled on XCS –While XCS separated fitness and prediction, UCS combines them again Error rate used by Frey & Slate (1991)

UCS Fitness Tuning Larger v = steeper curve v=5 v = 50

Replicating UCS Accuracy 11 multiplexer, no noise, balanced classes training set classification, 20 replicates Results (next slide) fit original UCS implementation Higher v gives better accuracy (to a point)

Replicating UCS: Inexperience Discount If we do not discount the vote of low fitness rules, UCS does not quite converge

Inexperience Discount Why is it needed? –A rule which correctly classifies 1 example (and matches no other) has error rate zero. –Without discount it gets maximum vote. –GA continues to generate bad rules which briefly disrupt voting. XCS does not have same need because fitness is initially low and rises slowly. UCS method is all or nothing. Marshall et al. (2007) introduce a smoother Bayesian scheme.

Replicating UCS %[B] Agrees with original UCS paper(s) No effect of v on %[B]? So we should optimise GA selective pressure independently of voting! (Future work…)

Derivation of Vote Weights Derived from Adaboost Optimal for a particular exponential bound on cost function Other bounds appear in literature Limited to 2 classes Doesn’t need training set distribution update mechanism of AdaBoost, so it’s simpler

Bound on Error Function

Weights to Minimize the Bound Derived Weight of rule i in vote: α tends to infinity as ε tends to 0 If ε = 0 we set α to a constant α’ This adds another parameter, but: –α’ gives us a (weak) way to tune vote –We see α’ disappears later We call this UCSpv – UCS with Principled Voting

Comparison of Fitness Functions

UCS With/out Discount (11mux)

UCSpv With/out Discount (11mux)

Class-imbalanced Training Data 8:1 unbalanced16:1 unbalanced

Noisy Training Data 15% noise (note that UCS requires fine tuning of v parameter) 5% noise

Conclusions So Far UCSpv matches/outperforms UCS except –high noise (15%) and high imbalance (16:1) –AND when UCS is optimally parameterised Why? –Noise and imbalance violate assumptions and make UCSpv weights sub-optimal –ν is much better than α’ at tuning voting ν affects whole curve but α’ only one point

“Monks3” Problem

Correcting for Noise If UCSpv weights become suboptimal with noisy data, can we correct for noise? A crude method is to factor in an estimate of noise κ:

Monks3 with κ=0.5

Confidence in Accuracy Accuracy estimates are wrong when training data is unrepresentative: –when training set is very small (sampling error) –when class labels are noisy –when target concept is non-stationary –when distribution over train and test sets differ –probably other cases too … Various forms of correction –Inexperience discounts & probability smoothing –Confidence intervals

α’ Disappears If we correct accuracy estimates with confidence measure then accuracy is never 1 in realistic learning problems (see Marshall et al. 2007) Hence α never infinity and α’ not needed α’ was not a good way of tuning UCSpv anyway

Summary Uncovered inexperience discount and normalisation by numerosity in UCS Applied principled weights from Adaboost On balanced noiseless data (11mux) –Accuracy as good or better than UCS –One less parameter to set Imbalanced and noisy data (11mux, Monks3) –UCS can be tuned better and can outperform UCSpv on the most noisy/imbalanced data if parameterised just right

Future Work How to tune UCSpv more effectively when assumptions of derived weights do not hold Other bounds on cost function How to separately tune selective pressure in the genetic algorithm Principled approaches to factor in confidence in accuracy Generally: exchange between LCS and ensembles

References E. Bernadó & J. Garrell-Guiu (2003). Accuracy- Based Learning Classifier Systems: Models, Analysis and Applications to Classification Tasks P. Frey & D. Slate (1991). Letter Recognition using Holland-Style Adaptive Classifiers. J.A.R. Marshall, G. Brown & T. Kovacs (2007). Bayesian Estimation of Rule Accuracy in UCS.