Model Building and Validation An overview using the discriminant analysis technique.

Slides:



Advertisements
Similar presentations
FACTORIAL ANOVA Overview of Factorial ANOVA Factorial Designs Types of Effects Assumptions Analyzing the Variance Regression Equation Fixed and Random.
Advertisements

Logistic Regression Psy 524 Ainsworth.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Logit & Probit Regression
Overview of Logistics Regression and its SAS implementation
Discriminant Analysis To describe multiple regression analysis and multiple discriminant analysis. Discriminant Analysis.
Confidence intervals. Population mean Assumption: sample from normal distribution.
Regression with a Binary Dependent Variable. Introduction What determines whether a teenager takes up smoking? What determines if a job applicant is successful.
QM Spring 2002 Business Statistics SPSS: A Summary & Review.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Intro to Statistics for the Behavioral Sciences PSYC 1900
1 Validation and Verification of Simulation Models.
How to deal with missing data: INTRODUCTION
An Introduction to Logistic Regression
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
Survey Experiments. Defined Uses a survey question as its measurement device Manipulates the content, order, format, or other characteristics of the survey.
Three Common Misinterpretations of Significance Tests and p-values 1. The p-value indicates the probability that the results are due to sampling error.
Copyright © 2008 by Pearson Education, Inc. Upper Saddle River, New Jersey All rights reserved. John W. Creswell Educational Research: Planning,
Chapter 13: Inference in Regression
Multiple Discriminant Analysis and Logistic Regression.
N318b Winter 2002 Nursing Statistics Specific statistical tests: Correlation Lecture 10.
Tutor: Prof. A. Taleb-Bendiab Contact: Telephone: +44 (0) CMPDLLM002 Research Methods Lecture 8: Quantitative.
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
Learning Objectives Copyright © 2002 South-Western/Thomson Learning Multivariate Data Analysis CHAPTER seventeen.
Statistical analysis Prepared and gathered by Alireza Yousefy(Ph.D)
ANOVA and Linear Regression ScWk 242 – Week 13 Slides.
Linear correlation and linear regression + summary of tests
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Discriminant Analysis Discriminant analysis is a technique for analyzing data when the criterion or dependent variable is categorical and the predictor.
Lecture 2 Review Probabilities Probability Distributions Normal probability distributions Sampling distributions and estimation.
Calibrated imputation of numerical data under linear edit restrictions Jeroen Pannekoek Natalie Shlomo Ton de Waal.
MK346 – Undergraduate Dissertation Preparation Part II - Data Analysis and Significance Testing.
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Experimental Design and Statistics. Scientific Method
Apr. 22 Stat 100. Final Wednesday April 24 About 40 or so multiple choice questions Comprehensive Study the midterms Copies and answers are at the course.
Linear Discriminant Analysis (LDA). Goal To classify observations into 2 or more groups based on k discriminant functions (Dependent variable Y is categorical.
Chap 8-1 Fundamentals of Hypothesis Testing: One-Sample Tests.
7.4 DV’s and Groups Often it is desirous to know if two different groups follow the same or different regression functions -One way to test this is to.
1 Chapter 15 Data Analysis: Basic Questions © 2005 Thomson/South-Western.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.2 Testing Categorical.
Using Propensity Score Matching in Observational Services Research Neal Wallace, Ph.D. Portland State University February
Logistic Regression Analysis Gerrit Rooks
Machine Learning 5. Parametric Methods.
D/RS 1013 Data Screening/Cleaning/ Preparation for Analyses.
Data Preparation 14-1.
Review: Stages in Research Process Formulate Problem Determine Research Design Determine Data Collection Method Design Data Collection Forms Design Sample.
Chapter Seventeen Copyright © 2004 John Wiley & Sons, Inc. Multivariate Data Analysis.
MANOVA Lecture 12 Nuance stuff Psy 524 Andrew Ainsworth.
Lecturer: Ing. Martina Hanová, PhD.. Regression analysis Regression analysis is a tool for analyzing relationships between financial variables:  Identify.
2 NURS/HSCI 597 NURSING RESEARCH & DATA ANALYSIS GEORGE MASON UNIVERSITY.
Descriptive and Inferential Statistics Descriptive Statistics – consists of the collection, organization, and overall summery of the data presented. Inferential.
AP Statistics From Randomness to Probability Chapter 14.
BINARY LOGISTIC REGRESSION
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Dr. Siti Nor Binti Yaacob
Equivalent Ratios.
t-Tests: Measuring the Differences Between Group Means
12 Inferential Analysis.
Ass. Prof. Dr. Mogeeb Mosleh
CH2. Cleaning and Transforming Data
Introduction to Logistic Regression
12 Inferential Analysis.
Analyzing the Association Between Categorical Variables
Discriminant Analysis
Regression Forecasting and Model Building
Scientific Method Science Ms. Kellachow.
Presentation transcript:

Model Building and Validation An overview using the discriminant analysis technique

Assumption for this lecture There are several types of models, but this lecture assumes we are building one with a 2-valued dependent variable. –e.g. We want to predict who will respond to a mailing – dependent var. has two values – responders/non-responders. –e.g. Predict who is at risk for a heart attack – dependent variable is – had a heart attack/did not have a heart attack

What will it tell us? The model is built using past data to generate a score to predict the likelihood of something occurring or not. –(What is the probability that this person will respond to the mailing?)

The Modeling Process Sample Design Data Collection and Cleaning Sample selection Data aggregation Build Model Test the Model

Sample Design What data do you need? Where is it? How much is needed? What is the dependent variable?

Data Collection and Cleaning Read, validate data Deal with Missing values Delete unwanted records and variables.

Selecting a sample Choose a sample to analyze. For 0/1 regression (discriminant analysis equivalent) use approximately equal records of each type. Select twice the number you need to build the model, so you can set aside 50% of the data for validation.

Data Aggregation Data from multiple sources merged –This may occur as a first step before data cleaning, depending on the situation. New variables defined –(eg: ratio of satisfactory trades to total trades).

Model Building Break up each independent variable into classes. Each class should have roughly 2 to 10% of the observations. Run Crosstabs of each variable with the dependent variable. Redefine the independent variable as multiple dummy (0/1) variables. Run regression with the dummies.

Example: Data looks like this Bad/Good (Y) Age (X1)# Trades (X2) Ratio of Sat. trades to Total Trades (X3) % % % %

It is transformed to look like this: Bad/Good (Y) Age01 (18 to 30) (X1) Age02 (40 to 55) (X2) Age03 (56+) (X3)

Model Building, contd. Eliminate variables that are not significant, until you have a model with variables that are significant and intuitively meaningful.

Testing the model Perform Kolmogorov-Smirnov (K-S Test) to test how well the model performs on: –The analysis sample –The validation sample –The total sample If it separates the 0 and the 1s well in each of the three cases, you have a good model.