Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12
2 Assignment 3-
3 Assignment 3 1) continued
4 Assignment 3 2) Listed below is a table containing a number of variables. Explain the reason why each variable is useful or notuseful in a future analysis. Variables# of records Data Field Format # of unique values # of missing values 1st 3 digits of postal code100000character10000 household size100000numeric Credit score100000numeric mortgage account100000character20 Product code100000character50000 Median Income of Postal Code of record100000numeric200000
5 Assignment 3 2)continued
6 Recap- Data,Data,Data-What Phase of the Data Mining Process are we in? Data Formats? –Examples? Data Transformations? What do we mean here? Examples? In all data mining projects, what must the final data values be?
7 Recap- Data,Data,Data-What Phase of the Data Mining Process are we in? Data Types? What are they? What is discrete vs. index vs. continuous and how do they relate to Data Type. –Birthdate-Gender –Product category-Model rank –Spending percentile-Income –Promotion Date-Model Score
8 Recap Let’s take a look at postal code Let’s take a look at postal code How would you use the info here. Create binary variables for every postal code value. Is there another better way to group? How would you use the info here. Create binary variables for every postal code value. Is there another better way to group?
9 Types of Data Nominal Ordinal Interval Nominal is basically a yes/no variable or variable with outcomes that have no order or sense of magnitude to the numbers –Derived variables are coded as 0,1. –Give me an example of this and how you would create a nominal variable for data mining?. Assume you are analyzing response rate trends for customers?
10 Types of Data Ordinal –There is order to the values of the variable –Give me examples of this and what it would like in a data mining exercise. Assume you are analyzing response rate trends for customers? Interval –There is a sense of magnitude between two values. –Give me examples of this and what it would like in a data mining exercise. Assume you are analyzing response rate trends for customers? How does ordinal and interval differ. Explain it within the context of a data mining exercise where we analyze response?
11 Data Usefulness When is Data Useful? –Few Missing values –Variable does not consist primarily of one value –Non-Numeric Data consists of only a few values which can be properly grouped into more meaningful categories
12 Examples-Analytical Perspective What fields are useful and why?
13 Examples Closer look at income Closer look at gender Create a data mining response rate trend with each variable For both variables, demonstrate how no response rate might exist.
14 Examples Closer Look at Customer Type Closer look at Product Type Create a data mining response rate trend with each variable For both variables, demonstrate how no response rate trend might exist.
15 Examples What variables would be useful here What variables would be useful here What would be the number of unique values What would be the number of unique values What would some of these look like in a data mining response rate analysis exercise? What would some of these look like in a data mining response rate analysis exercise?
16 Examples What variables would be useful here What variables would be useful here What would some of these look like in a data mining response rate analysis exercise? What would some of these look like in a data mining response rate analysis exercise?
17 Examples-Marketing Perspective A mortgage company is conducting a campaign to its high value customers. One of the key characteristics of value is high income which is self-reported at time of application. As a marketer, how will you use this information and what do you need to consider? What might the results be if you applied this learning to a marketing campaign.
18 Examples-Marketing Perspective An insurance company is marketing an insurance product to people over the age of 60. Listed below is a report indicating the distribution of age. As a marketer, how will you use this information? What might the results be if you applied this learning to a marketing campaign.
19 Examples-Marketing Perspective An retail company has over 1000 product SKU’s. After investigation, it has been determined that the 1 st digit represents a broader product category. You have been asked to design the product layout for all stores. As a marketer, how will you use this information?
20 Examples-Marketing Perspective What can be done here, if anything and what else can we consider in terms of using gender and income information? What might it look like in a data mining exercise?
21 Examples-Marketing Perspective You have postal code information for each customer. You are asked to design customer reports by province.How would you do this? What would this look like in a data mining response rate analysis exercise?
22 Examples-Data Mining Perspective You have the following variables and values –Gender: ’M’:Male ‘F’:Female –Income ‘B’: <20 ‘F’: – ‘R’:40-60 ‘S’:60-80 ‘T’: ‘Z’: 100+ What must be done here? What would this look like in a data mining response rate analysis exercise?
23 Concepts Operational Database –Customer DB –Transactional DB Data Warehouse Data Mart Analysis Flat file vs. OLAP External Data Overlays –Postal Code Overlays –Survey/Registration Data
24 Databases Operational Databases vs. Data Warehouses vs. Data Marts vs. Analytical File Operational data consists of information from the source systems –Customer File –Transaction System –Finance System –Operations –Human Resources –Etc. In practice, what do you think an operations database is really dealing with?
25 Databases Data warehouse –Pulls elements and fields from each source system –May summarize/organize or aggregate information with each system to present the information in a more meaningful way? –Warehouse can comprise information from disparate areas of company –What do we mean by this?
26 Databases Data mart –Can in many cases be very similar to data warehouses in the way that information is summarized and aggregrated –Pulls elements and fields from each source system –May summarize/organize or aggregate information with each system to present the information in a more meaningful way? –Usually is focussed solely towards one functional area of the company Marketing data Mart Let’s think of some information that might be contained in a data mart?
27 DatabasesCustomerTransaction Finance.. Etc. Data Warehouse Data Mart- Marketing Data Mart- Finance Data Mart- Etc.
28 Database Structure In Database Design, most databases are relational –Creates a key which becomes a database index –This index or key becomes the link between different files Customer TransactionPromotion Customer ID is the link between all the tables Why do we need to think about the notion relational?
29 Database Structure Relational DB –Database indexes allow very quick processing of data when joining and merging files together The key in all database design is to create a database that optimizes processing of all information. In database design, you want the right data to be stored which is useful from a data mining perspective From a marketing standpoint, can you think of some examples? Why is this important from a data mining standpoint?
30 Database Structure Other approaches used in speeding up database processing –Inverted flat files This technology allows each field to be indexed Very common amongst the leading-edge DB suppliers today. Is much faster at processing data than traditional relational DB technology Again, why is this relevant from a data mining perspective?
31 Databases Analytical File For most data mining applications, your analytic file needs to be in the format of one record per customer with all known attributes Generally, the database is not in that format. ECTL – extraction, clean, transform, load – is the process/methodology for preparing data for data mining Typically a flat file used for analysis What do think is the most important concept for data mining? Databases or Analytical File How do they work together?
32 Databases File 1 -Cust ID -Income-Age -Household Size File 2 -Cust ID -Trans. Type -Trans Date -Trans Amt
33 Databases In building databases, the notion of continuity management is important In the context of household or customers on a database, continuity management is the process by which you are able to track customers through events in time. Why is this important?
34 Analytical file All data mining algorithms want their input in tabular form – rows & columns as in a spreadsheet or database table Typically, if we saw data like this, what typically needs to be done? Assume reference number is the customer I.D. What does continuity mean here?
35 What the Data Should Look Like A customer “snapshot” = a single row Each row represents the customer and whatever might be useful for data mining
36 What the Data Should Look Like The columns –Contain data that describe aspects of the customer (e.g., sales $ and quantity for each of product A, B, C) –Contain the results of calculations referred to as derived variables (e.g., total sales $) Derived variables are Total Price in 1 st chart and # of months since last purchase in 2 nd chart
37 Sourcing the Data from External Data Sources Typical Data Sources - External Geo-demographic information –Statistics Canada (aggregated level data) Census data Taxfiler data Geo-demographic Cluster Codes –Generation 5 – Mosaic –Equifax -Psyte Survey Data –ICOM
38 Sourcing the Data (Extraction) Census data Data collected every 5 years. Enumeration Area level. ~ 250 households on average. ~ 440 households in large urban areas. ~125 households in rural areas. ~ 50,000 EA’s in Canada Can be converted to postal code level and appended to your file. Type of data -immigration/ethnicity/language patterns -occupation -education -income/gender/age/employment -religion
39 Sourcing the Data (Extraction) Taxfiler data. Data collected every year. Postal walk level. ~ 450 households on average. ~ 26,000 Postal Walks in Canada. ·Contains data from previous year tax returns. · Income by source and type. Employment, investment. · RRSP contributions and room. Etc. Can also be appended to your files at postal code level.
40 Sourcing the Data (Extraction) Geo-Demographic cluster codes. Uses Stats Can data in most cases plus other external data overlays to determine postal code cluster groups –Quebec farm families –Young and Struggling –Empty Nesters –Upper Income Family-Oriented Equifax –High credit risk –Medium credit Risk –Etc.
41 Sourcing the Data-Stats Can Type Table Postal AreaMedian IncomeAvg. Age Avg. Household Size% French Area % Area % Area % Area % Area % Area % Area % Area %
42 Sourcing the Data (Extraction) Typical Data Sources - External Business to Business “Firmographics” –SIC, Number of Employees, Revenue etc. –Sources: D&B CBI / InfoCanada Scott’s
43 Sourcing the Data (Extraction) Typical Data Sources - Survey Attitudinal- Needs, preferences, social values, opinions Behavioral- Buying habits, lifestyle, brand usage For most data mining projects, we want to assign a value to all customers; therefore the information used must be available for all customers –survey-based information generally cannot be used as it typically can only be applied a small portion of the database
44 Sourcing the Data (Extraction) Typical Data Sources - Survey ICOM –Surveys to approx. 10MM Canadians –Fully updated every 2 years –Contains attitude behaviour and purchase behaviours across all industry sectors What do you think the value is here?
45 Examples A marketer wants to target high risk cancels for a retention campaign for a Telco. Information is contained in legacy database systems containing a customer file, transaction file, and call detail file. As a marketer and analyst, answer the following requirements –5 Key Data fields from above files that should be created in analytical exercise –Create a diagram or schema of how this data would be linked into an analytical file –What resources would you need and why? People Software
46 Examples How would the previous example change if the information was available in a data mart or warehouse
47 Examples A university is conducting a fund raising campaign to its alumni( members). On its database, it has the following information: –Age of alumnus –Year graduated –Degree and specialization –Donation value –Current Address It has also collected information from a survey. 10% of members have responded to the survey with the following %’s of members answering the following information: –Current Occupation-5% –Current Income-8% –Why they give?-7% –How much they give As a marketer and analyst, how would you use the information to conduct a campaign to its high value donors
48 Examples A computer company collects information from all customers who purchase a new product. This new product information is collected through a product registration form which the customer fills in at point of purchase. This information relates to the following: –Product preferences,Income,household size and hobbies All customer tombstone information as well as purchase information related to products bought has been summarized and stored onto a data mart. As a marketer and analyst, how would you use the information to develop a cross-sell campaign.
49 Examples A credit card company has customers containing tombstone information and detailed transactional information on their database customers have addresses. 10% of customers have responded to a survey in which 5% have indicated that they consider themselves loyal customers. Web activity of these loyal customers indicate that many of them have clicked on travel-related packages. As a marketer and analyst, how would you use this information to sell travel-related insurance.