A Machine Learning Analysis of US Census Salary Data.

Slides:



Advertisements
Similar presentations
Lazy Paired Hyper-Parameter Tuning
Advertisements

Causal Data Mining Richard Scheines Dept. of Philosophy, Machine Learning, & Human-Computer Interaction Carnegie Mellon.
Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.
CMPUT 466/551 Principal Source: CMU
Collaborative Filtering in iCAMP Max Welling Professor of Computer Science & Statistics.
Lecture 14 – Neural Networks
Sparse vs. Ensemble Approaches to Supervised Learning
Lecture 5 (Classification with Decision Trees)
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning (2), Tree and Forest
Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A data mining approach to the prediction of corporate failure.
Treatment Learning: Implementation and Application Ying Hu Electrical & Computer Engineering University of British Columbia.
Biswanath Panda, Mirek Riedewald, Daniel Fink ICDE Conference 2010 The Model-Summary Problem and a Solution for Trees 1.
Suborna Shekhor Ahmed Department of Forest Resources Management Faculty of Forestry, UBC Western Mensurationists Conference Missoula, MT June 20 to 22,
Gary M. Weiss Alexander Battistin Fordham University.
Detecting New a Priori Probabilities of Data Using Supervised Learning Karpov Nikolay Associate professor NRU Higher School of Economics.
Rehospitalization Analytics: Modeling and Reducing the Risks of Rehospitalization Chandan K. Reddy Department of Computer Science, Wayne State University.
DECISION TREE Ge Song. Introduction ■ Decision Tree: is a supervised learning algorithm used for classification or regression. ■ Decision Tree Graph:
Finding τ → μ−μ−μ+ Decays at LHCb with Data Mining Algorithms
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Kaggle Competition Rossmann Store Sales.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Kaggle competition Airbnb Recruiting: New User Bookings
Introduction to Machine Learning, its potential usage in network area,
Usman Roshan Dept. of Computer Science NJIT
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
EDUCAUSE Annual Conference
SNS COLLEGE OF TECHNOLOGY
Week 2 Presentation: Project 3
An Empirical Comparison of Supervised Learning Algorithms
Online Multiscale Dynamic Topic Models
Physical activity motives and their ability
Rule Induction for Classification Using
Boosting and Additive Trees (2)
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Boosting and Additive Trees
Algorithmic Transparency with Quantitative Input Influence
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
COMP61011 : Machine Learning Ensemble Models
First Workshop on Metalearning in 1998 at ECML
Introduction Feature Extraction Discussions Conclusions Results
Asymmetric Gradient Boosting with Application to Spam Filtering
Predicting Primary Myocardial Infarction from Electronic Health Records -Jitong Lou.
Combining Species Occupancy Models and Boosted Regression Trees
Adaboost Team G Youngmin Jun
A weight-incorporated similarity-based clustering ensemble method based on swarm intelligence Yue Ming NJIT#:
CIKM Competition 2014 Second Place Solution
Machine Learning practical
Informing the Use of Hyper-Parameter Optimization Through Metalearning
Should They Stay or Should They Go
Face Components detection
Using decision trees and their ensembles for analysis of NIR spectroscopic data WSC-11, Saint Petersburg, 2018 In the light of morning session on superresolution.
Predicting Pneumonia & MRSA in Hospital Patients
COSC 4335: Other Classification Techniques
iSRD Spam Review Detection with Imbalanced Data Distributions
Opening Weka Select Weka from Start Menu Select Explorer Fall 2003
Implementing AdaBoost
Week One - Review.
Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen, Zne-Jung Lee
Case Mining From Large Databases
Analysis for Predicting the Selling Price of Apartments Pratik Nikte
Predicting Loan Defaults
A task of induction to find patterns
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Logistic Regression Geoff Hulten.
Presentation transcript:

A Machine Learning Analysis of US Census Salary Data. “Factors” of Income: A Machine Learning Analysis of US Census Salary Data. By Jordan Evans-Kaplan Econ 490: Machine Learning

Motivation Studying Census data allows economists to grasp the driving forces behind Income. My goal was to regress two factor income levels: “<=$50K” and “>$50K” and identify the most critical variables to predict annual income thresholds. Some of the most effective variables included were: Marital Status Age Capital Gains The future applications of this work include generalizing economic health in regions without income data, but where the predictor variables are available. This allows data scientists to analyze the economic health of entire populations, rather than just one component. Potential downside: High computing/processing cost due to large datasets.

Literature Review Kohavi, "Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid.", Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. Landmark paper which began the study into Census Income classification. Chakrabarty, “A Statistical Approach to Adult Census Income Level Prediction.” Recent paper which set the current record with this dataset at 88.16%.

What is XGBoost? Stands for: eXtreme Gradient Boosting. XGBoost is a powerful iterative learning algorithm based on gradient boosting. Robust and highly versatile, with custom objective loss function compatibility.

How does XGBoost work? Tree-Based Boosting algorithm. 4 Critical Parameters for Tuning: η: ETA or “Learning Rate” max_depth: Controls the “height” of the tree via splits. γ: Minimum required loss for the model to justify a split. λ: L2 (Ridge) regularization on variable weights.

Why use Xgboost? All of the advantages of gradient boosting, plus more. Frequent Kaggle data competition champion. Utilizes CPU Parallel Processing by default. Two main reasons for use: Low Runtime High Model Performance

Tuning XGBoost In order to produce the optimal XGBoost model, a grid-search method was employed against a hyper-grid of possible parameters. After tuning the model against these possibilities, the following graph was produced: Figure shows the accuracy in each of the 9 possible grids. Highest accuracy was obtained in the bottom right quadrant, where the value of Gamma is 1, and Max Tree Depth is 3. The blue line represents the shrinkage rate of 1/3, otherwise called ETA.

Model Results:

Conclusion: XGBoost Accuracy before tuning: 84.66% Tuned XGBoost Accuracy: 87.09460% Models such as these allow for robust income classification, and identify key variables associated with higher income brackets. Accuracy : 0.8466 Accuracy: 0.870946

Thank You! Special appreciation goes out to: Econ 490 Machine Learning Staff: Nazanin Khazra Abdollah Farhoodi Designer of XGBoost: Tianqi Chen Personal Contact Information: Jordan Evans-Kaplan Email: Evanska2@Illinois.edu