Rotem Golan Department of Computer Science Ben-Gurion University of the Negev, Israel.

Slides:



Advertisements
Similar presentations
DECISION TREES. Decision trees  One possible representation for hypotheses.
Advertisements

Random Forest Predrag Radenković 3237/10
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Decision Tree Approach in Data Mining
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
BAYESIAN NETWORKS. Bayesian Network Motivation  We want a representation and reasoning system that is based on conditional independence  Compact yet.
Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.
Classification Techniques: Decision Tree Learning
Model Assessment, Selection and Averaging
CMPUT 466/551 Principal Source: CMU
Rotem Golan Department of Computer Science Ben-Gurion University of the Negev, Israel.
TEMPLATE DESIGN © Genetic Algorithm and Poker Rule Induction Wendy Wenjie Xu Supervised by Professor David Aldous, UC.
Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11
Decision Tree Rong Jin. Determine Milage Per Gallon.
Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.
Decision Tree Algorithm
Bayesian Belief Networks
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Ensemble Learning: An Introduction
Lecture 5 (Classification with Decision Trees)
Three kinds of learning
1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Machine Learning: Ensemble Methods
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Experimental Evaluation
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Bayesian Reasoning. Tax Data – Naive Bayes Classify: (_, No, Married, 95K, ?)
Ensemble Learning (2), Tree and Forest
Learning Chapter 18 and Parts of Chapter 20
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Ensembles of Classifiers Evgueni Smirnov
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
SOFT COMPUTING (Optimization Techniques using GA) Dr. N.Uma Maheswari Professor/CSE PSNA CET.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Benk Erika Kelemen Zsolt
1 Chapter 14 Probabilistic Reasoning. 2 Outline Syntax of Bayesian networks Semantics of Bayesian networks Efficient representation of conditional distributions.
Learning from observations
Learning from Observations Chapter 18 Through
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
2 Syntax of Bayesian networks Semantics of Bayesian networks Efficient representation of conditional distributions Exact inference by enumeration Exact.
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
Today Ensemble Methods. Recap of the course. Classifier Fusion
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Genetic Algorithms Genetic algorithms provide an approach to learning that is based loosely on simulated evolution. Hypotheses are often described by bit.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
Validation methods.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11 CS479/679 Pattern Recognition Dr. George Bebis.
Ensemble Classifiers.
Machine Learning: Ensemble Methods
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Decision Trees.
Data Mining Practical Machine Learning Tools and Techniques
CSCI N317 Computation for Scientific Applications Unit Weka
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Chapter 7: Transformations
Avoid Overfitting in Classification
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Rotem Golan Department of Computer Science Ben-Gurion University of the Negev, Israel

 Competition overview  What is a Bayesian Network?  Learning Bayesian Networks through evolution  ECOC and Recursive entropy-based discretizaion  Decision trees and C4.5  A new prediction model  Boosting and K-fold cross validation  References

Competition overview A database of 60 music performers has been prepared for the competition. The material is divided into six categories: classical music, jazz, blues, pop, rock and heavy metal. For each of the performers music pieces have been collected. All music pieces are partitioned into 20 segments and parameterized. The feature vector consists of 191 parameters.

Competition overview (Cont.) Our goal is to estimate the music genre of newly given fragments of music tracks. Input: A training set of 12,495 vectors and their genre A test set of 10,269 vectors without their genre Output: 10,269 labels (Classical, Jazz, Rock, Blues, Metal or Pop). One for each vector in the test set. The metric used for evaluating the solutions is standard accuracy, i.e. the ratio of the correctly classified samples to the total number of samples.

 Competition overview  What is a Bayesian Network?  Learning Bayesian Networks through evolution  ECOC and Recursive entropy-based discretizaion  Decision trees and C4.5  A new prediction model  Boosting and K-fold cross validation  References

A long story You have a new burglar alarm installed at home. It is fairly reliable at detecting a burglary, but also responds on occasion to minor earthquakes. You also have two neighbors, John and Mary, who have promised to call you at work when they hear the alarm. John always calls when he hears the alarm, but sometimes confuses the telephone ringing with the alarm and calls then, too. Mary, on the other hand, likes rather loud music and sometimes misses the alarm altogether. Given the evidence of who has or has not called, we would like to estimate the probability of a burglary.

A short representation

Observations In our algorithm, all the values of the network are known except the genre value, which we would like to estimate. The variables in our algorithm are continuous and not Discrete (except the genre variable). We divide the possible values of each variables into fixed size intervals. The number of intervals is changed throughout the evolution. We refer to this process as the discretization of the variable. We refer to the Conditional Probability Table of each variable (node) as CPT

Naïve Bayesian Network

Bayesian Network construction Once we determined the chosen variables (amount and choice), their fixed discretization and the structure of the graph, we can easily compute the CPT values for each of the nodes in the graph (according to the training set). For each vector in the training set, we will update all the network’s CPTs by increasing the appropriate entry by one. After this process, we will divide each value with the sum of its row (Normalization).

Exact Inference in Bayesian Networks For each vector in the test set, we compute six different probabilities (Multiplying the appropriate entries of all the network’s CPTs) and chose the highest one as the genre of this vector. Each probability is for a different assumption on the genre variable value (Rock, Pop, Blues, Jazz, Classical and Metal).

 Competition overview  What is a Bayesian Network?  Learning Bayesian Networks through evolution  ECOC and Recursive entropy-based discretizaion  Decision trees and C4.5  A new prediction model  Boosting and K-fold cross validation  References

Preprocessing I divided the training set into two sets. A training set – used for constructing each Bayesian Network in the population. A validation set – used for computing the fitness of each network in the population. These sets has the same amount of vectors for each category (Rock vectors, Pop vectors, etc.)

The three dimensions of the evolutionary algorithm The three dimensions are: Variables amount. Variables choice. Fixed discretization of the variables. Every network in the population is a Naïve Bayesian Network, which means that its structure is already determined.

Fitness function In order to compute the fitness of a network, we estimate the genre of each vector in the validation set, and compare it to it’s known genre. The metric used for computing the fitness is standard accuracy, i.e. the ratio of the correctly classified vectors to the total number of vectors in the validation set.

Selection In each generation, we choose population_size/2 different networks at most. We prefer networks that have the highest fitness and are distinct from each other. After choosing these networks we use them to build a fully sized population by mutating each one of them. We use bitwise mutation to do so. Notice that we may use a mutated network to generate a new mutated network.

Mutation Bitwise mutation. Parent: BitSet Dis Child: BitSet Dis

Crossover Single point crossover. Parent 1: Parent 2: Child 1: Child2:

Results (Cont.) Model - Naive Bayesian Population size - 40 Generations Variables - [1,191] discretization - [5,15] First population score Best score Test Set score Website’s score: Preliminary result Final result “Zeroes” = cpt_min/10

Observation Notice that there’s approximately 10% difference between my score and the website’s score. We will discuss this issue (over fitting) later on.

Adding the forth dimension The forth dimension is the structure of the Bayesian Network Now, the population includes different Bayesian Networks. Meaning, networks with different structures, variables choice, variables amount and Discretization array.

Evolution operations The selection process is the same as in the previous algorithm. The crossover and mutation are similar. First, we start like the previous algorithm (Handling the BitSet and the discretization array) Then, we add all the edges we can from the parent (mutation) or parents (crossover) to the child’s graph. Finally, we make sure that the child’s graph is a connected acyclic graph.

Results Model - Bayesian Network Population size – 20 Generations – Crashed on generation 104 Variables - [1,191] discretization - [2,6] First population score Best score - ~ Website’s score : It Crashed

Memory problems The program was executed on amdsrv3, with a 4.5 GB memory limit. Even though the discretization interval is [2-6], the program has crashed due to java heap space error. As a result I decided to decrease the population size to 10 instead of 20.

Results (Cont.) Model - Bayesian Network Population size – 10 Generations – 800 Variables - [1,191] discretization - [2,10] First population score Best score Website’s score: Preliminary score

Results (Cont.) Model - Bayesian Network Population size – 10 Generations – 800 Variables - [1,191] discretization - [2,20] First population score Best score Website’s score: Preliminary score

Overfitting As we increase the discretization interval, my score increases and the website’s score decreases. One explanation can be that increasing the search space may cause the algorithm to find patterns with strong correlation to the specific input data I received. While these patterns has no correlation at all to the real life data. One possible solution is using k-fold cross validation

Final competition scores

My previous score

 Competition overview  What is a Bayesian Network?  Learning Bayesian Networks through evolution  ECOC and Recursive entropy-based discretizaion  Decision trees and C4.5  A new prediction model  Boosting and K-fold cross validation  References

ECOC

ECOC properties

ECOC (Cont.) Rock Pop Blues Jazz Classical Metal

Entropy

Recursive minimal entropy partitioning (Fayyad & Irani ) The goal of this algorithm is to discretizes all numeric attributes in the dataset into nominal attributes. The discretization is performed by selecting a bin boundary minimizing the entropy in the induced partitions. The method is then applied recursively for both new partitions until a stopping criterion is reached. Fayyad and Irani make use of the Minimal-Description length principle to determined a stopping criteria.

RMEP (Cont.)

Results

 Competition overview  What is a Bayesian Network?  Learning Bayesian Networks through evolution  ECOC and Recursive entropy-based discretizaion  Decision trees and C4.5  A new prediction model  Boosting and K-fold cross validation  References

Example of a decision tree

C4.5 algorithm

Splitting criteria

Results of c4.5 alone

 Competition overview  What is a Bayesian Network?  Learning Bayesian Networks through evolution  ECOC and Recursive entropy-based discretizaion  Decision trees and C4.5  A new prediction model  Boosting and K-fold cross validation  References

A new prediction model

Results

Results (Cont.)

My new score

 Competition overview  What is a Bayesian Network?  Learning Bayesian Networks through evolution  ECOC and Recursive entropy-based discretizaion  Decision trees and C4.5  A new prediction model  Boosting and K-fold cross validation  References

Boosting (AdaBoost) I’ve tried to use boosting as a tool for building an ensemble of Naïve Bayesian networks. Each of these networks were trained on different training set weights according to the AdaBoost algorithm. Intuitively, AdaBoost updated the training set weights in correlation with the performance of previous trained networks. The algorithm reduces the weights of instances which were correctly predicated by previous networks and increases the weights for instances which hadn’t been predicted correctly.

AdaBoost - training

AdaBoost - testing

AdaBoost - parameters

AdaBoost

K-fold Cross validation Cross-validation is a technique for assessing how the results of a statistical analysis will generalize to an independent data set In k-fold cross-validation, the original sample is randomly partitioned into k subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data.

K-fold Cross validation (Cont.)

 Competition overview  What is a Bayesian Network?  Learning Bayesian Networks through evolution  ECOC and Recursive entropy-based discretizaion  Decision trees and C4.5  A new prediction model  Boosting and K-fold cross validation  References

References