Download presentation
Presentation is loading. Please wait.
Published byClaud Bailey Modified over 9 years ago
1
Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12
2
2 Assignment 4
3
3
4
4 Recap Where are we within the data mining phase? Types of Data –Nominal –Ordinal –Interval What are some key things to look for in determining whether or not a variable is good for data mining analysis Databases Why do we need to have some understanding of databases? How does a database facilitate the data miner’s work?
5
5 Database Structure In Database Design, most databases are relational –Creates a key which becomes a database index –This index or key becomes the link between different files Customer TransactionPromotion Customer ID is the link between all the tables Why do we need to think about the notion relational?
6
6 Database Structure Relational DB –Database indexes allow very quick processing of data when joining and merging files together The key in all database design is to create a database that optimizes processing of all information. In database design, you want the right data to be stored which is useful from a data mining perspective From a marketing standpoint, can you think of some examples? Why is this important from a data mining standpoint?
7
7 Database Structure Other approaches used in speeding up database processing –Inverted flat files This technology allows each field to be indexed Very common amongst the leading-edge DB suppliers today. Is much faster at processing data than traditional relational DB technology Again, why is this relevant from a data mining perspective?
8
8 Databases Analytical File For most data mining applications, your analytic file needs to be in the format of one record per customer with all known attributes Generally, the database is not in that format. ECTL – extraction, clean, transform, load – is the process/methodology for preparing data for data mining Typically a flat file used for analysis What do think is the most important concept for data mining? Databases or Analytical File How do they work together?
9
9 Databases File 1 -Cust ID -Income-Age -Household Size File 2 -Cust ID -Trans. Type -Trans Date -Trans Amt
10
10 Databases In building databases, the notion of continuity management is important In the context of household or customers on a database, continuity management is the process by which you are able to track customers through events in time. Why is this important?
11
11 Analytical file All data mining algorithms want their input in tabular form – rows & columns as in a spreadsheet or database table Typically, if we saw data like this, what typically needs to be done? Assume reference number is the customer I.D. What does continuity mean here?
12
12 What the Data Should Look Like A customer “snapshot” = a single row Each row represents the customer and whatever might be useful for data mining
13
13 What the Data Should Look Like The columns –Contain data that describe aspects of the customer (e.g., sales $ and quantity for each of product A, B, C) –Contain the results of calculations referred to as derived variables (e.g., total sales $) Derived variables are Total Price in 1 st chart and # of months since last purchase in 2 nd chart
14
14 Sourcing the Data from External Data Sources Typical Data Sources - External Geo-demographic information –Statistics Canada (aggregated level data) Census data Taxfiler data Geo-demographic Cluster Codes –Generation 5 – Mosaic –Equifax -Psyte Survey Data –ICOM
15
15 Sourcing the Data (Extraction) Census data Data collected every 5 years. Enumeration Area level. ~ 250 households on average. ~ 440 households in large urban areas. ~125 households in rural areas. ~ 50,000 EA’s in Canada Can be converted to postal code level and appended to your file. Type of data -immigration/ethnicity/language patterns -occupation -education -income/gender/age/employment -religion
16
16 Sourcing the Data (Extraction) Taxfiler data. Data collected every year. Postal walk level. ~ 450 households on average. ~ 26,000 Postal Walks in Canada. ·Contains data from previous year tax returns. · Income by source and type. Employment, investment. · RRSP contributions and room. Etc. Can also be appended to your files at postal code level.
17
17 Sourcing the Data (Extraction) Geo-Demographic cluster codes. Uses Stats Can data in most cases plus other external data overlays to determine postal code cluster groups –Quebec farm families –Young and Struggling –Empty Nesters –Upper Income Family-Oriented Equifax –High credit risk –Medium credit Risk –Etc.
18
18 Sourcing the Data-Stats Can Type Table Postal AreaMedian IncomeAvg. Age Avg. Household Size% French Area 14200040210.00% Area 14200040210.00% Area 14200040210.00% Area 25000035185.00% Area 25000035185.00% Area 3370004335.00% Area 3370004335.00% Area 3370004335.00%
19
19 Sourcing the Data (Extraction) Typical Data Sources - External Business to Business “Firmographics” –SIC, Number of Employees, Revenue etc. –Sources: D&B CBI / InfoCanada Scott’s
20
20 Sourcing the Data (Extraction) Typical Data Sources - Survey Attitudinal- Needs, preferences, social values, opinions Behavioral- Buying habits, lifestyle, brand usage For most data mining projects, we want to assign a value to all customers; therefore the information used must be available for all customers –survey-based information generally cannot be used as it typically can only be applied a small portion of the database
21
21 Sourcing the Data (Extraction) Typical Data Sources - Survey ICOM –Surveys to approx. 10MM Canadians –Fully updated every 2 years –Contains attitude behaviour and purchase behaviours across all industry sectors What do you think the value is here?
22
22 Examples A marketer wants to target high risk cancels for a retention campaign for a Telco. Information is contained in legacy database systems containing a customer file, transaction file, and call detail file. As a marketer and analyst, answer the following requirements –5 Key Data fields from above files that should be created in analytical exercise –Create a diagram or schema of how this data would be linked into an analytical file –What resources would you need and why? People Software
23
23 Examples How would the previous example change if the information was available in a data mart or warehouse
24
24 Examples A university is conducting a fund raising campaign to its alumni(100000 members). On its database, it has the following information: –Age of alumnus –Year graduated –Degree and specialization –Donation value –Current Address It has also collected information from a survey. 10% of members have responded to the survey with the following %’s of members answering the following information: –Current Occupation-5% –Current Income-8% –Why they give?-7% –How much they give As a marketer and analyst, how would you use the information to conduct a campaign to its high value donors
25
25 Examples A computer company collects information from all customers who purchase a new product. This new product information is collected through a product registration form which the customer fills in at point of purchase. This information relates to the following: –Product preferences,Income,household size and hobbies All customer tombstone information as well as purchase information related to products bought has been summarized and stored onto a data mart. As a marketer and analyst, how would you use the information to develop a cross-sell campaign.
26
26 Examples A credit card company has 100000 customers containing tombstone information and detailed transactional information on their database. 50000 customers have email addresses. 10% of 50000 customers have responded to a survey in which 5% have indicated that they consider themselves loyal customers. Web activity of these loyal customers indicate that many of them have clicked on travel-related packages. Database information contains –Age,gender,income, where they spend, recency of spend,frequency of spend, and amount of spend. As a marketer and analyst, how would you use this information to sell travel-related insurance.
27
27 Creating the Analytical File-Reviewing Data Dumps Initial dump of 1 st few records
28
28 Creating the Analytical File-Reviewing Data Dumps Initial dump of 1 st few records
29
29 Creating the Analytical File-Reviewing Data Dumps View of the Transaction File
30
30 Creating the Analytical File-Reviewing Data Dumps View of the Promo History File
31
31 Creating the Analytical File-Reviewing Data Dumps Using your marketing knowledge, give me examples of variables that we might create from the last three slides –Slide 14 –Slide 15 –Slide 16
32
32 Creating the Analytical File-Data Hygiene and Cleansing Once the data has been dumped in order to view records, typically data hygiene and cleansing have to take place Two key deliverables –Clean name and address information –Standard rules for coding of data values
33
33 Creating the Analytical File-Data Hygiene and Cleansing Clean Name and Address Information –Market to right Individual –Create Match keys
34
34 Clean Name and Address Information –Market to right Individual –Create Match keys –Name and Address Standardization BankID 987654321 Name JONH SMITH JR. Address1 123 WILLIAMS STRET Address2 2ND FLOOR Address3 TRT., O.N. M5G-1F3 Country CDN UnIndivID 123456789 BankID 987654321 PreName FirstName Surname JONH SMITH JR. PostName Street1 123 WILLIAMS STRET Street2 2ND FLOOR City TRT Province O.N. Postal Code M5G-1F3 Country CANADA UnIndivID 123456789 Origin Bank Creating the Analytical File Name and Address Standardization
35
35 DATA CLEANING Address correction Name parsing Genderizing Casing BankID 987654321 PreName Mr. FirstName John Surname Smith PostName Jr. Street1 200-123 Williams Street Street2 City Toronto Province ON Postal Code M5G 1F3 Country Canada UnIndivID 123456789 Origin Bank BankID 987654321 PreName FirstName Surname JONH SMITH JR. PostName Street1 123 WILLIAMS STRET Street2 2ND FLOOR City TRT Province O.N. Postal Code M5G-1F3 Country CANADA UnIndivID 123456789 Origin Bank Creating the Analytical File- Creating the Analytical File-Name and Address Standardization
36
36 Creating the Analytical File-Merge Purge of Names What are the reasons for creating unique match customer keys –Generating a marketing list –Conducting analysis Should the match keys be the same for both above scenarios? What are the situations when match keys that are numeric?
37
37 Creating the Analytical File-Merge Purge of Names Common fields to use in creating Match keys First Name; Surname; Unique Individual ID; Postal Code Credit Card Number Duns Number for Businesses Phone Number Unique I.D’s or number type I.D’s are the preferred choice when creating match keys Let’s take a closer look at creating match keys using name and address
38
38 Creating the Analytical File-Merge Purge of Names Let’s take a look at 6 records and see what this means.
39
39 Creating the Analytical File-Merge Purge of Names Example: You have one record here: –Richard Boire-4628 Mayfair Ave. H4B2E5 –How would you use the above information for a backend analysis if I were a responder to an acquisition campaign? –What about if you were conducting analysis on me as an existing customer who responded to a cross-sell campaign. –How about if you wanted to send me a direct mail piece
40
40 Creating the Analytical File- Data standardization Refers to a process where values from a common variable from different files are mapped to the same value. Some common examples: SIC Code Industry Classification Table –Industry categories have common set of codes Postal Code Variable –Postal code has to have 6 digits comprised of alpha,numeric,alpha,numeric,alpha,numeric which exclude the following alphas: D,F,O,Q,U, and Z. Give me examples of bad postal codes vs. good postal codes.
41
41 Creating the Analytical File- Data Standardization Here is an example of how disposition codes for telemarketing outcomes might be handled
42
42 Creating the Analytical File- Data Standardization Postal Code Standardization –Six digit code comprising Alpha,numeric,alpha,numeric,alpha,numeric –1 st letters: A,B,C,E,G,H,J,K,L,M,N,P,R,S,T,V,X,Y SIC(Standard Industry Code Classification –4 digit code used to classify all companies into standard set of industries
43
43 Creating the Analytical File- Data standardization Example: –You have been asked to build retention model You have two years worth of transaction data. Changes in the product category codes occurred six months ago. Key information that you would look at would be as follows: Income category Product Category Transaction Codes Transaction Amount Postal Code Transaction Date Gender What would you need to do
44
44 Geocoding is the process that assigns a latitude-longitude coordinate to an address. Once a latitude-longitude coordinate is assigned, the address can be displayed on a map or used in a spatial search. Data miners often use these coordinates to calculate such things as “distance to the nearest store” Creating the Analytical File- Geo-Codingn
45
45 Demographic AnalysisPopulationCountPopulationCount AgeDistributionAgeDistribution Average Age StoreLocationStoreLocation GeoProfile
46
46 Creating the Analytical File-What is Geocoding? Let’s look at a sample of what some data might look like? How do we use this data to create meaningful variables?
47
47 Creating the Analytical File-What is Geocoding Example: –A retailer has the following information: Name and address of its customers Address of its stores Stats Can Information –As a marketer, how would you intelligently use this information
48
Region# of Customers% of Total Prairie Provinces25 M2.5% Quebec100 M10% Ontario350 M35% West25 M2.5% Missing Values500 M50% Total1 MM100% Frequency Distribution The report below uses first digit of postal code to assign customers to region. For example, postal codes beginning with ‘G’, ‘H’, or ’J’ represent the Quebec region. Customer Profiling
49
49 Frequency Distribution
50
50 Frequency Distribution
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.