Download presentation
1
Issues with Data Mining
2
Data Mining involves Generalization
Data mining (Machine Learning) learns generalizations of the instances in training data E.g. a decision tree learnt from weather data captures generalizations about the prediction of values for Play attribute This means, generalizations predict (or describe) the behaviour of instances beyond the training data This in turn means, knowledge is extracted from raw data using data mining This knowledge drives end-user’s decision making process
3
Generalization as Search
The process of generalization can be viewed as searching a space of all possible patterns or models For a pattern that fits the data This view provides a standard framework for understanding all data mining techniques E.g decision tree learning involves searching through all possible decision trees Lecture 4 shows two example decision trees that fit the weather data One of them is a better generalization than the other (Example 2)
4
Bias Important choices made in a data mining system are
representation language– the language chosen to represent the patterns or models, search method – the order in which the space is searched model pruning method– the way overfitting to the training data is avoided This means, each data mining scheme involves Language bias Search bias Overfitting-avoidance bias
5
Language Bias Different languages used for representing patterns and models E.g. rules and decision trees A concept fits a subset of training data That subset can be described as a disjunction of rules E.g classifier for the weather data can be represented as a disjunction of rules Languages differ in their ability to represent patterns and models This means, when a language with lower representation ability is used, the data mining system may not achieve good performance Domain knowledge (external to training data) helps to cut down search space
6
Search Bias An exhaustive search over the search space is computationally expensive Search is speeded up by using heuristics Pure children nodes indicate good tree stumps in decision tree learning By definition heuristics cannot guarantee optimum patterns or models Using information gain may mislead us to select a suboptimal attribute at the root Complex search strategies possible Those that pursue several alternatives parallelly Those that allow backtracking A high-level search bias General-to-specific: start with a root node and grow the decision tree to fit the specific data Specific-to-general: choose specific examples in each class and then generalize the class by including k-nearest neighbour examples
7
Overfitting-avoidance bias
We want to search for ‘best’ patterns and models Simple models are the best Two strategies Start with the simplest model and stop building model when it starts to become complex Start with a complex model and prune it to make it simpler Each strategy biases search in a different way Biases are unavoidable in practice Each data mining scheme might involve a configuration of biases These biases may serve some problems well There is no universal best learning scheme! We saw this in our practicals with Weka
8
Combining Multiple Models
Because there is no ideal data mining scheme, it is useful to combine multiple models Idea of democracy – decisions made based on collective wisdom Each data mining scheme acts like an expert using its knowledge to make decisions Three general approaches Bagging Boosting Stacking Bagging and boosting both follow the same approach Take a vote on the class prediction from all the different schemes Bagging uses a simple average of votes while boosting uses a weighted average Boosting gives more weight to more knowledgeable experts Boosting is generally considered the most effective
9
Bias-Variance Decomposition
Assume Infinite training data sets of the same size, n Infinite number of classifiers trained on the above data sets For any learning scheme Bias = expected error of the classifier even after increasing training data infinitely Variance = expected error due to the particular training set used Total expected error = bias + variance Combining multiple classifiers decreases the expected error by reducing the variance component
10
Bagging Bagging stands for bootstrap aggregating
Combines equally weighted predictions from multiple models Bagging exploits instability in learning schemes Instability – small change in training data results in big change in model Idealized version for classifier Collect several independent training sets Build a classifier from each training set E.g learn a decision tree from each training set The class of a test instance is the prediction that received most votes from all the classifiers Practically it is not feasible to obtain several independent training sets
11
Bagging Algorithm Involves two stages Model Generation Classification
Let n be the number of instances in the training data For each of t iterations Sample n instances with replacement from training data Apply the learning algorithm to the sample Store the resulting model Classification For each of the t models: Predict class of instance using model Return class that has been predicted most often
12
Boosting Multiple data mining methods might complement each other
Each method performing well on a subset of data Boosting combines complementing models Using weighted voting Boosting is iterative Each new model is built to overcome the deficiencies in the earlier models Several variants of boosting AdaBoost.M1 – based on the idea of giving weights to instances Boosting involves two stages Model generation Classification
13
Boosting Model generation Classification
Assign equal weight to each training instance For each of t iterations: Apply learning algorithm to weighted dataset and store resulting model Compute error e of model on weighted dataset and store error If e=0 or e>=0.5 Terminate model generation For each instance in dataset: If instance classified correctly by model: Multiply weight of instance by e/(1-e) Normalize weight of all instances Classification Assign weight of zero to all classes For each of the t (or less) models: Add –log(e/(1-e)) to weight of class predicted by model Return class with highest weight
14
Stacking Bagging and boosting combine models of the same type
E.g. a set of decision trees Stacking is applied to models of different types Because voting may not work when different models do not perform comparably well, Voting is problematic when two out of three classifiers perform poorly Stacking uses a metalearner to combine different base learners Base learners: level-0 models Meta learner: level-1 model Predictions of base learners fed as inputs to the meta learner Base learner predictions on training data cannot be input to meta learner Instead use cross-validation results on base learner Because classification is done by base learners, meta learners use simple learning schemes
15
Combining models using Weka
Weka offers methods to perform bagging, boosting and stacking over classifiers In the Explorer, under the classify tab, expand the ‘meta’ section of the hierarchical menu AdaboostM1 (one of the boosting methods) on Iris data classifies only 7 out of 150 incorrectly You are encouraged to try these methods on your own
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.