Download presentation
Presentation is loading. Please wait.
1
Predicting Income of Customers
2
1. Motivations of the study
Develop models predicting income of customers Overcome limitations of statistical models Develop models with better performance Develop models with better understandability Compare statistical models with data mining models
3
Data Mining for Prediction of Customer Income
This paper describes how data mining techniques can be used to improve accuracy in predicting customer income given a database containing information about his/her name, age, profession, etc. In current situation, the prediction accuracy of correct classification is only 84.03% using a statistical method such as a regression model.
4
2. Methodologies Decision tree analysis Neural networks
Regression analysis
5
Decision Tree Analysis
Decision tree algorithm is an inductive learning technique, which structures decision making rules using tree figures to solve problems of classification and prediction. The data is iteratively split into regions based upon attribute-based criteria.
6
Decision Tree as a Popular Tool in DM
7
The Form of Decision Tree
8
Decision Tree and Rules
9
Decision Tree Algorithms
There are two types algorithms that support decision tree algorithm; 1) artificial intelligence technology based algorithms like ID3 (Iterative Dichotomizer), C4.5 (mostly used under Unix-based operating systems) and C5.0 (available as a standalone windows based rule induction systems entitle See5,Quinlan,1997), 2) statistics based, i.e., CART(Classification And Regression Tree)and CHAID(Chi-square Automatic Interaction Detection).
10
3. Data Collection .For our study we selected the United States Census (5%) 1990 Public Use Microsample data (Census 1990). This data, which was divided into 18 files, contained the entire 5% sample made public domain from the 1990 U.S. Census in STATA 6.0 format. Combined, these 18 files included about 4.5 million males and 5 million females, totaling to 9.1 million records. Census
11
Data description and measurement of variables
This study employs a total number of 30,000 records using public as a pilot study domain US census data, but the dependent variable income had only two classification values (> 50K (1), <=50K (2)).
12
Decision Variables :14 variables as explanatory variables
1. Age : Continuous variable 2. workclass: Present working class, Private(1), Self-emp-not-inc(2), Self-emp-inc(3), Federal-gov(4), Local-gov(5), State-gov(6), Without-pay(7), Never-worked(8). 3. Education: Bachelors(1), Some-college(2), 11th(3), HS-grad(4), Profschool(5), Assoc-acdm(6), Assoc-voc(7), 9th(8), 7th-8th(9), 12th(10), Masters(11), 1st-4th(12), 10th(13), Doctorate(14), 5th-6th(15), Preschool(16). 4. edunum: continuous variable(how many years of education was done) 5. ms: Marital status,Married-civ-spouse(1), Divorced(2), Nevermarried(3), Separated(4), Widowed(5), Married-spouse-absent(6),Married-AF-spouse(7).
13
6. occupation: Tech-support(1), Craft-repair(2), Other-service(3), Sales(4), Exec managerial(5), Prof-specialty(6), Handlers-cleaners(7), Machine-op-inspct(8), Adm-clerical(9), Farming-fishing(10), Transport-moving(11), Priv-house-serv(12), Protective-serv(13), Armed-Forces(14). 7. relationship: Wife(1), Own-child(2), Husband(3), Not-in-family(4), Other-relative(5), Unmarried(6). 8.race: White(1), Asian-Pac-Islander(2), Amer-Indian-Eskimo(3), Other(4), Black(5). 9. sex: Female(1), Male(2). 10. gain: continuous variable (income gain w.r.t previous year) 11. loss: continuous variable (income loss w.r.t previous year) 12. hoursperweek: continuous(no of working hours per week)
14
13. country: United-States(1), Cambodia(2), England(3), Puerto-Rico(4), Canada(5), Germany(6), Outlying-US(Guam-USVI-etc)(7), India(8), Japan(9), Greece(10), South(11), China(12), Cuba(13), Iran(14), Honduras(15), Philippines(16), Italy(17), Poland(18), Jamaica(19), Vietnam(20), Mexico(21), Portugal(22), Ireland(23), France(24), Dominican-Republic(25), Laos(26), Ecuador(27), Taiwan(28), Haiti(29), Columbia(30), Hungary(31), Guatemala(32), Nicaragua(33), Scotland(34), Thailand(35), Yugoslavia(36), El-Salvador(37), Trinadad&Tobago(38), Peru(39), Hong(40), Holand- Netherlands(41). 14.Income :> 50K (1), <=50K (2). (dependant output classification variable)
15
3. Tests 30,000 records are randomly selected. Training and testin is conducted using software SAS Enterprise Miner 9.1. Internally data is divided into 60% training data set and 40% test data set. 3 prediction models of decision tree, neural network and linear regression are tested and compared for their performance.
16
Induced Rules
17
Performance Comparison
18
4. Applications - Freud in Tax Report Customer Segmentation
Target Marketing - Identifying Churn and Loyal Customers - Customer Retention - Customer Acquisition
19
5. Conclusion The major contribution of this study is predicting customer income by examining the three data mining techniques; decision tree, neural networks and linear regression and it also compares performance of classification and predictive accuracy among them based on US census data.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.