Mail Order Company in USA › Would like to find out if there is a way › To reduce mailing cost › By analyzing the past data
Business Objectives: › To find out which customers that are good candidate to purchase products › To explore the data to determine company’s valuable customers
Assess the Situation: › One CSV file from 3 data sources Census Group A Census Group B Tax Filers › Personnel Six MTech Students Minimum experience in data mining › Software MS Excel Clementine, Data Scope
Data Mining Goals › Predict which variables affects customer buying decision › Build models and compare the cost against randomly-chosen customers › Suggest a model to achieve >1% mailing response
ActivitiesDays/Resources Data Preparation -Prepare Excel/CSV File n/a Data Understanding -Explore each variable -Perform some normalizations -Derive new useful variables Each team member for 4 days Knowledge Discovery -Generate Decision Tree -Suggest variables as most important Each team member for 2 days Modelling -Build predictive model -Iterate steps to improve results Each team member for 3 days Reporting -Consolidate all results 2 persons for 2 days
First Insights Discovery › Total record is 2158 › Distribution by Objective › Distribution by Gender
Data Quality Problems › Some columns are normalized others not › All values are number, harder to visualize › Many data is incomplete › Missing recency, no of transactions and dollars of spending data for individual products
Describe Data › Gross properties of data The data is extracted from a larger set with respond rate of ~1%.All 1079 responders and 1079 randomly chosen non- responders › Relationship between attributes firstmonth and tenure have a linear relationship, Thus tenure can be omited.
Select Data › Variables chosen Clean Data › Some normalizations Construct Data › Chose the variables as input Data Transformation › Rescaling › Derive new variables
Reduce redundancy caused by data integration › Replace lowincome and highincome with IncomeGroup. › Replace gender1,gender2 and gender3 with Gender. › Discard V171 Total taxfilers with unemployment benefits › Discard V175, V181,V184, V190,V193,V196. they equal to male data plus female data
Rescaling › Log() of totalspend and totaltrans to reduce effect of large variables Derive Data › Derive ActAccInMostRecMon from product recency data(no of active accounts in most recent month) › Derive the ratio of low taxfiler income from V156-V163 › Value=V156/sum(V156:V163) › Convert value to 5 categories.
Histogram of new variable with Objective overlaid
Inverse correlation between English and French speaking regions No region with significant Tagalog, Spanish or other language-speaking populations Can probably discard amtspanish, amttagalog, amtsingres, amtengnon, amtmultilin Cluster/segment English/French areas
Linear relationship for English and French across Census A & B Can merge amtenglish and bhlenglish Can merge amtfrench and bhlfrench
Linear relationship Merge acflonepar & bfslonepar Filter out noisy data
Most data below 0.1 Objective remains constant throughout Not important to business objective – discard
Lack of data from other age groups Very specific targeted marketing to females group Normalize values from 0 to 0.1 if necessary Objective improves as proportion increases
Objective clearly improves when afp1child is on lower end of normal curve
7 regions with acfwchcom = 0.19 and objective = 1
Most regions have above 60% married couples, assuming normalized data Acftotmar and acfhuswife mirror one another Can discard either field Filter noisy data Categorical : lone-parent and husband- wife
As the other cencus and taxfiler data, these data represents the distribution of the region.
There is a similar trend, the number of construction between the two period is more or less the same number. The sample population only represents a small number of people of construction in the region.
Those who does regular maintenance does not have major nor minor repair
Those who has major repair, tend to have less minor repair.
These sample population represents majority of the English or British ethnic origin in the region. Those who has British ethnic origin also has English ethnic origin. Those who has English ethnic origin is less than British ethnic origin.
This data only represents a very low number of people who is French ethnic origin.
Both have the same trend, some who doesn’t answer for family income, answered for household income
Both of them has the same description. Need to check which one is which.
The population sample is mostly locals