Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

G53MLE | Machine Learning | Dr Guoping Qiu
My name is Dustin Boswell and I will be presenting: Ensemble Methods in Machine Learning by Thomas G. Dietterich Oregon State University, Corvallis, Oregon.
Decision Tree Approach in Data Mining
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Data Mining Classification: Alternative Techniques
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Data Mining Classification: Alternative Techniques
Indian Statistical Institute Kolkata
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
Sparse vs. Ensemble Approaches to Supervised Learning
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Ensemble Learning: An Introduction
Three kinds of learning
Machine Learning: Ensemble Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ensemble Learning (2), Tree and Forest
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Machine Learning CS 165B Spring 2012
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Bayesian Networks. Male brain wiring Female brain wiring.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Classification Data Mining Experiment Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
1 Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1classifier 2classifier.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
CLASSIFICATION: Ensemble Methods
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Classification Ensemble Methods 1
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
1 Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1classifier 2classifier.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Deep Feedforward Networks
Trees, bagging, boosting, and stacking
Predict House Sales Price
Data Mining Lecture 11.
Advanced Analytics Using Enterprise Miner
Overview of Supervised Learning
Asymmetric Gradient Boosting with Application to Spam Filtering
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Data Mining Practical Machine Learning Tools and Techniques
Introduction to Data Mining, 2nd Edition
Implementing AdaBoost
Multiple Decision Trees ISQS7342
EE513 Audio Signals and Systems
Ensemble learning.
Ensemble learning Reminder - Bagging of Trees Random Forest
Model generalization Brief summary of methods
Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1 classifier 2 classifier.
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta CSC 7333 PROJECT, SPRING’ 13 LOUISIANA STATE UNIVERSITY

Agenda Objective Data Methods  Artificial Neural Network  Normal Bayes Classifier  Decision Trees  Boosted Trees  Random Forest Results Comparisons Observations CSC Dr. Jianhua Chen2

Objective Analysis of Census Data to determine certain trends Prediction task is to determine whether a person makes over 50K a year. Analyze the accuracy and run time of different machine learning algorithms CSC Dr. Jianhua Chen3

Data instances (train = 32561, test = 16281) if instances with unknown values are removed (train = 30162, test = 15060) Duplicate or conflicting instances : 6 2 classes : >50K, <=50K Probability for the label '>50K' : 23.93% / 24.78% (without unknowns) 14 attributes : both continuous and discreet-valued.

The Attributes Age Workclass fnlwgt Education Education-num Marital-status Occupation Relationship Race Sex Capital-gain Capital-loss Hours-per-week Native-country

Data SnapShot

Artificial Neural Network Sigmoid function is used as the squashing function. No. of Layers = nodes in first layer. Second and third layers have 10 nodes each. Terminate if no. of epochs exceed 1000 or rate of change of network weights falls below Learning rate = 0.1

Normal Bayes Classifier The classifier assumes that: Features are fairly independent in nature the attributes are normally distributed. It is not necessary for the attributes to be independent; but does yield better results if they are. Data distribution function is assumed to be a Gaussian mixture – one component per class. Training data  Min vectors and co-variance matrices for every class  Predict

Decision Trees Regression tree  partition continuous values Maximum depth of tree = 25 Minimum sample count = 5 Maximum no. of categories = 15 No. of cross validation folds = 15 CART(Classification and Regression Tree) is used as the tree algorithm  Rules for splitting data at a node based on the value of variable  Stopping rules for deciding on terminal nodes  Prediction of target variable for terminal nodes CSC Dr. Jianhua Chen9

Boosted Trees Real AdaBoost algorithm has been used. Misclassified events  Reweight them  Build & optimize new tree with reweighted events  Score each tree  Use tree-scores as weights and average over all trees Weak classifier  classifiers with error rate slightly better than random guessing.  No. of weak classifiers used = 10 Trim rate  Threshold to eliminate samples with boosting weight < 1 – trim rate.  Trim rate used = 0.95

Random Forest Another Ensemble Learning Method Collection of tree predictors : forest At first, it grows many decision trees. To classify a new object from an input vector,: 1. It is classified by each of the trees in the forest 2. Mode of the classes is chosen. All the trees are trained with the same parameters but on different training sets

Random Forest (contd.) No. of variables randomly selected at node and used to find best split(s) = 4 Maximum no. of trees in the forest = 100 Forest accuracy = 0.01 Terminate if no. of iterations exceed 50 or error percentage exceeds 0.1

Results Unknown data included Method Correct Classification Wrong Classification Class 0 false positives Class 1 false positivesTimeAccuracy Neural Network Normal Bayes Decision Tree Boosted Tree Random Forest Unknown data excluded Method Correct Classification Wrong Classification Class 0 false positives Class 1 false positivesTimeAccuracy Neural Network Normal Bayes Decision Tree Boosted Tree Random Forest CSC Dr. Jianhua Chen13

Comparisons (unknown data included)

Observations Removing non relevant attributes improves accuracy (Curse of Dimensionality)  Some attributes seemed to have little relevance to salary. For example: Race, Sex.  Removing the attributes improves accuracy from by 0.21% in decision trees.  For Random Forest, accuracy improves by 0.33%  For Boosted Trees, accuracy falls slightly by 0.12%  For ANN, accuracy improves by 1.12% Bayes Classifier – Removing co-related attributes improves accuracy.  Education-num highly related to Education. Removing education-num improves accuracy by 0.83% CSC Dr. Jianhua Chen15

Thank you!!! Thank you!!! CSC Dr. Jianhua Chen16