Predicting Income of Customers

Predicting Income of Customers

1. Motivations of the study
Develop models predicting income of customers Overcome limitations of statistical models Develop models with better performance Develop models with better understandability Compare statistical models with data mining models

Data Mining for Prediction of Customer Income
This paper describes how data mining techniques can be used to improve accuracy in predicting customer income given a database containing information about his/her name, age, profession, etc. In current situation, the prediction accuracy of correct classification is only 84.03% using a statistical method such as a regression model.

2. Methodologies Decision tree analysis Neural networks
Regression analysis

Decision Tree Analysis
Decision tree algorithm is an inductive learning technique, which structures decision making rules using tree figures to solve problems of classification and prediction. The data is iteratively split into regions based upon attribute-based criteria.

Decision Tree as a Popular Tool in DM

The Form of Decision Tree

Decision Tree and Rules

Decision Tree Algorithms
There are two types algorithms that support decision tree algorithm; 1) artificial intelligence technology based algorithms like ID3 (Iterative Dichotomizer), C4.5 (mostly used under Unix-based operating systems) and C5.0 (available as a standalone windows based rule induction systems entitle See5,Quinlan,1997), 2) statistics based, i.e., CART(Classification And Regression Tree)and CHAID(Chi-square Automatic Interaction Detection).

3. Data Collection .For our study we selected the United States Census (5%) 1990 Public Use Microsample data (Census 1990). This data, which was divided into 18 files, contained the entire 5% sample made public domain from the 1990 U.S. Census in STATA 6.0 format. Combined, these 18 files included about 4.5 million males and 5 million females, totaling to 9.1 million records. Census

Data description and measurement of variables
This study employs a total number of 30,000 records using public as a pilot study domain US census data, but the dependent variable income had only two classification values (> 50K (1), <=50K (2)).

Decision Variables :14 variables as explanatory variables
1. Age : Continuous variable 2. workclass: Present working class, Private(1), Self-emp-not-inc(2), Self-emp-inc(3), Federal-gov(4), Local-gov(5), State-gov(6), Without-pay(7), Never-worked(8). 3. Education: Bachelors(1), Some-college(2), 11th(3), HS-grad(4), Profschool(5), Assoc-acdm(6), Assoc-voc(7), 9th(8), 7th-8th(9), 12th(10), Masters(11), 1st-4th(12), 10th(13), Doctorate(14), 5th-6th(15), Preschool(16). 4. edunum: continuous variable(how many years of education was done) 5. ms: Marital status,Married-civ-spouse(1), Divorced(2), Nevermarried(3), Separated(4), Widowed(5), Married-spouse-absent(6),Married-AF-spouse(7).

6. occupation: Tech-support(1), Craft-repair(2), Other-service(3), Sales(4), Exec managerial(5), Prof-specialty(6), Handlers-cleaners(7), Machine-op-inspct(8), Adm-clerical(9), Farming-fishing(10), Transport-moving(11), Priv-house-serv(12), Protective-serv(13), Armed-Forces(14). 7. relationship: Wife(1), Own-child(2), Husband(3), Not-in-family(4), Other-relative(5), Unmarried(6). 8.race: White(1), Asian-Pac-Islander(2), Amer-Indian-Eskimo(3), Other(4), Black(5). 9. sex: Female(1), Male(2). 10. gain: continuous variable (income gain w.r.t previous year) 11. loss: continuous variable (income loss w.r.t previous year) 12. hoursperweek: continuous(no of working hours per week)

13. country: United-States(1), Cambodia(2), England(3), Puerto-Rico(4), Canada(5), Germany(6), Outlying-US(Guam-USVI-etc)(7), India(8), Japan(9), Greece(10), South(11), China(12), Cuba(13), Iran(14), Honduras(15), Philippines(16), Italy(17), Poland(18), Jamaica(19), Vietnam(20), Mexico(21), Portugal(22), Ireland(23), France(24), Dominican-Republic(25), Laos(26), Ecuador(27), Taiwan(28), Haiti(29), Columbia(30), Hungary(31), Guatemala(32), Nicaragua(33), Scotland(34), Thailand(35), Yugoslavia(36), El-Salvador(37), Trinadad&Tobago(38), Peru(39), Hong(40), Holand- Netherlands(41). 14.Income :> 50K (1), <=50K (2). (dependant output classification variable)

3. Tests 30,000 records are randomly selected. Training and testin is conducted using software SAS Enterprise Miner 9.1. Internally data is divided into 60% training data set and 40% test data set. 3 prediction models of decision tree, neural network and linear regression are tested and compared for their performance.

Induced Rules

Performance Comparison

4. Applications - Freud in Tax Report Customer Segmentation
Target Marketing - Identifying Churn and Loyal Customers - Customer Retention - Customer Acquisition

5. Conclusion The major contribution of this study is predicting customer income by examining the three data mining techniques; decision tree, neural networks and linear regression and it also compares performance of classification and predictive accuracy among them based on US census data.

Predicting Income of Customers

Similar presentations

Presentation on theme: "Predicting Income of Customers"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Predicting Income of Customers

Similar presentations

Presentation on theme: "Predicting Income of Customers"— Presentation transcript:

Similar presentations

About project

Feedback