April 11, 2008 Data Mining Competition 2008 The 4 th Annual Business Intelligence Symposium Hualin Wang Manager of Advanced.

Slides:



Advertisements
Similar presentations
CHAPTER TWELVE ANALYSING DATA I: QUANTITATIVE DATA ANALYSIS.
Advertisements

Brief introduction on Logistic Regression
1 COMM 301: Empirical Research in Communication Lecture 15 – Hypothesis Testing Kwan M Lee.
ADVANCED STATISTICS FOR MEDICAL STUDIES Mwarumba Mwavita, Ph.D. School of Educational Studies Research Evaluation Measurement and Statistics (REMS) Oklahoma.
Chapter 15 (Ch. 13 in 2nd Can.) Association Between Variables Measured at the Interval-Ratio Level: Bivariate Correlation and Regression.
By Wendiann Sethi Spring  The second stages of using SPSS is data analysis. We will review descriptive statistics and then move onto other methods.
Chapter 9 Business Intelligence Systems
Statistics for the Social Sciences
Statistics for the Social Sciences Psychology 340 Fall 2006 Hypothesis testing.
Statistics for the Social Sciences Psychology 340 Spring 2005 Hypothesis testing.
Topic 3: Regression.
Educational Research by John W. Creswell. Copyright © 2002 by Pearson Education. All rights reserved. Slide 1 Chapter 8 Analyzing and Interpreting Quantitative.
Statistics for the Social Sciences Psychology 340 Spring 2005 Course Review.
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Chapter 5 Data mining : A Closer Look.
Quantifying Data.
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
APPENDIX B Data Preparation and Univariate Statistics How are computer used in data collection and analysis? How are collected data prepared for statistical.
Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.
Estimation and Confidence Intervals
© Copyright McGraw-Hill CHAPTER 3 Data Description.
Chapter 15 Data Analysis: Testing for Significant Differences.
Chapter 8 Introduction to Hypothesis Testing
Making decisions about distributions: Introduction to the Null Hypothesis 47:269: Research Methods I Dr. Leonard April 14, 2010.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved.
ANOVA and Linear Regression ScWk 242 – Week 13 Slides.
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
Loan Default Model Saed Sayad 1www.ismartsoft.com.
L. Liu PM Outreach, USyd.1 Survey Analysis. L. Liu PM Outreach, USyd.2 Types of research Descriptive Exploratory Evaluative.
Academic Research Academic Research Dr Kishor Bhanushali M
September 18-19, 2006 – Denver, Colorado Sponsored by the U.S. Department of Housing and Urban Development Conducting and interpreting multivariate analyses.
Statistical Analysis Quantitative research is first and foremost a logical rather than a mathematical (i.e., statistical) operation Statistics represent.
Chapter 6: Analyzing and Interpreting Quantitative Data
RESEARCH & DATA ANALYSIS
PCB 3043L - General Ecology Data Analysis.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
ANOVA, Regression and Multiple Regression March
Classification Ensemble Methods 1
Copyright © 2011, 2005, 1998, 1993 by Mosby, Inc., an affiliate of Elsevier Inc. Chapter 19: Statistical Analysis for Experimental-Type Research.
LIS 570 Summarising and presenting data - Univariate analysis.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
CHAPTER 2: Basic Summary Statistics
Chapter Eleven Sample Size Determination Chapter Eleven.
Descriptive Statistics(Summary and Variability measures)
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
AP PSYCHOLOGY: UNIT I Introductory Psychology: Statistical Analysis The use of mathematics to organize, summarize and interpret numerical data.
Statistics & Evidence-Based Practice
Chapter 12 Understanding Research Results: Description and Correlation
Statistical tests for quantitative variables
Regression Analysis Module 3.
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
PCB 3043L - General Ecology Data Analysis.
Analyzing and Interpreting Quantitative Data
Single Variable Data Analysis
Description of Data (Summary and Variability measures)
Part Three. Data Analysis
Chapter 12 Using Descriptive Analysis, Performing
Advanced Analytics Using Enterprise Miner
Introduction to Statistics
Basic Statistical Terms
CHAPTER 2: Basic Summary Statistics
Clinical prediction models
Presentation transcript:

April 11, 2008 Data Mining Competition 2008 The 4 th Annual Business Intelligence Symposium Hualin Wang Manager of Advanced Analytics Retail Marketing Insights Alliance Data, Columbus, Ohio

April 11, 2008 – Data Mining Competition 2008 Presentation2 About Alliance Data Alliance Data develops data driven solutions that help partners build lasting relationships with their customers. As one of the largest providers of retail and co-brand card services, loyalty and marketing solutions, payment processing, and business process outsourcing, we serve the retail, petroleum, utility, financial services and hospitality markets.

April 11, 2008 – Data Mining Competition 2008 Presentation3 Approach Summary  Exploratory Data Analysis Identify data issues Re-code variables Transform variables Frequency, UNIVARIATE, BIVARIATE, ANOVA analysis, etc.  Modeling Methodology LOGISTIC & PROBIT regression models Develop a set of regression models of both types on bootstrapping samples with a range of weights for responders and non-responders.  Ensemble Models Ensemble the set of LOGISTIC & PROBIT models

April 11, 2008 – Data Mining Competition 2008 Presentation4 Exploratory Data Analysis Missing Imputation – Substitute missing value with mean, median, mode, ‘logical’ values, and others based on bivariate results. Notes: Twenty variables are formatted differently for the training and test datasets. For example, some variables have value ‘YE’ in one dataset and ‘YES’ in the other. X2 has the value of HILLSBOROUGH in one set and HILLSBOROUG in the other. Univariate / Bivariate – Check distributions, extreme values, trend and other patterns. Significance Investigation – Conduct contingency table analysis to understand whether character variables and their levels are significant in predicting response. Information Value – Compute information values. Clustering Analysis – Reveal correlation among numerical variables. Play the MUSIC gracefully or face it!

April 11, 2008 – Data Mining Competition 2008 Presentation5 Variable Creation Capping – Extreme tails are typically capped to reduce their undue influence and to produce more robust parameter estimates. Binning – Small and insignificant levels of character variables are regrouped. Box-Cox Transformations – These transformations are commonly included, specially, the square root and logarithm. Johnson Transformations – Performed on numeric variables to make them more ‘normal’. Weight of Evidence – Created for character variables and binned numeric variables. Interaction – Explore possible interactions with the help of decision tree analyses.

April 11, 2008 – Data Mining Competition 2008 Presentation6 Modeling Methodology Step 1 – Pick an integer from 3 through 16 and draw 10 bootstrapping samples. Step 2 – Develop a LOGISTIC model on each sample with responders’ weight equal to the integer and non-responders’ weight equal to 1. Step 3 – Average the10 probabilities to produce an ensemble LOGISTIC model. In this way, we create 14 ensemble LOGISTIC models, one for each integer from 3 through 16. Steps 4-6 – Similarly, we obtain 14 ensemble PROBIT models. Together there are 28 models.

April 11, 2008 – Data Mining Competition 2008 Presentation7 Ensemble Models Use each of the 28 models to rank order the 95,960 observations in the test dataset from 95,960 to 1 based on its decreased predicted probabilities. The average of the 28 ranks for each observation is the final score.

April 11, 2008 – Data Mining Competition 2008 Presentation8 What have been considered throughout the process? The two judgment criteria: c-statistic & the response rate in the top 10K. The response rate in the top 10K requires a model to be able to push the responders to the top as much as possible. The rank order capability in the middle may not be strong. The c- statistic criterion requires a model to be able to rank order for the whole population. See the chart on the right. Modeling methods: There are few options for modeling the response, such as LOGISTIC models, PROBIT models (or any one in the family), decision trees, SVM, TreeNet and neural networks. I decided to use the one that I had used before and was known to work well in a similar situation. Sight difference is that this time I combined both LOGISTIC and PROBIT models instead of choosing one over the other.

April 11, 2008 – Data Mining Competition 2008 Presentation9 My Experiences Play the MUSIC gracefully or face it! It usually pays off to develop disciplined procedures to discover and deal with data issues. Develop models with different methods and then combine them. In general, ensemble models outperform models with any single method. Spend good amount of time on trying to discover trends, patterns and other true data relationships. Make good use of them in modeling.

April 11, 2008 – Data Mining Competition 2008 Presentation10 Thanks! Many thanks: To the Data Mining Program at University of Central Florida and BlueCross BlueShield of Florida for organizing and sponsoring the competition. Specially to Professor Su for his analytical work and timely responses to our inquires. To the 4 th Annual Business Intelligence Symposium for providing the opportunity for us to present and discuss the problem and the competition.