Statistical Learning for Complex Survey Data:

Slides:



Advertisements
Similar presentations
All Possible Regressions and Statistics for Comparing Models
Advertisements

Kin 304 Regression Linear Regression Least Sum of Squares
Chapter 12 Simple Linear Regression
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Regression: (2) Multiple Linear Regression and Path Analysis Hal Whitehead BIOL4062/5062.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Simple Linear Regression
Multiple Regression Predicting a response with multiple explanatory variables.
MULTIPLE REGRESSION. OVERVIEW What Makes it Multiple? What Makes it Multiple? Additional Assumptions Additional Assumptions Methods of Entering Variables.
The Mean Square Error (MSE):. Now, Examples: 1) 2)
1-1 Regression Models  Population Deterministic Regression Model Y i =  0 +  1 X i u Y i only depends on the value of X i and no other factor can affect.
Chapter 10 Simple Regression.
REGRESSION What is Regression? What is the Regression Equation? What is the Least-Squares Solution? How is Regression Based on Correlation? What are the.
Correlation Coefficients Pearson’s Product Moment Correlation Coefficient  interval or ratio data only What about ordinal data?
SIMPLE LINEAR REGRESSION
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition
1 G Lect 11W Logistic Regression Review Maximum Likelihood Estimates Probit Regression and Example Model Fit G Multiple Regression Week 11.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory.
Basics of Regression Analysis. Determination of three performance measures Estimation of the effect of each factor Explanation of the variability Forecasting.
The Mean of a Discrete RV The mean of a RV is the average value the RV takes over the long-run. –The mean of a RV is analogous to the mean of a large population.
Simple Linear Regression. The term linear regression implies that  Y|x is linearly related to x by the population regression equation  Y|x =  +  x.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 13-1 Introduction to Regression Analysis Regression analysis is used.
Stage 1 Statistics students from Auckland university Using a sample to make a point estimate.
Environmental Modeling Basic Testing Methods - Statistics III.
Econ 3790: Business and Economic Statistics Instructor: Yogesh Uppal
Stage 1 Statistics students from Auckland university Sampling variability & the effect of sample size.
Roger B. Hammer Assistant Professor Department of Sociology Oregon State University Conducting Social Research Specification: Choosing the Independent.
1 Building the Regression Model –I Selection and Validation KNN Ch. 9 (pp )
1 Simple Linear Regression and Correlation Least Squares Method The Model Estimating the Coefficients EXAMPLE 1: USED CAR SALES.
1 1 Slide The Simple Linear Regression Model n Simple Linear Regression Model y =  0 +  1 x +  n Simple Linear Regression Equation E( y ) =  0 + 
Using SPSS Note: The use of another statistical package such as Minitab is similar to using SPSS.
Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,
Multiple Regression Scott Hudson January 24, 2011.
Chapter 11 REGRESSION Multiple Regression  Uses  Explanation  Prediction.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Multiple Regression.
Chapter 4: Basic Estimation Techniques
BINARY LOGISTIC REGRESSION
REGRESSION G&W p
STATISTICAL INFERENCE
Chapter 9 Multiple Linear Regression
Kin 304 Regression Linear Regression Least Sum of Squares
Ch12.1 Simple Linear Regression
Statistics in MSmcDESPOT
Psychology 202a Advanced Psychological Statistics
Forward Selection The Forward selection procedure looks to add variables to the model. Once added, those variables stay in the model even if they become.
BPK 304W Regression Linear Regression Least Sum of Squares
Chapter 6 Predicting Future Performance
Multivariate Analysis Lec 4
BPK 304W Correlation.
Simple Linear Regression - Introduction
Econ 3790: Business and Economic Statistics
Simple Regression Mary M. Whiteside, PhD.
Multiple Regression.
INTRODUCTION TO Machine Learning
Sampling Distribution
Sampling Distribution
CHAPTER- 17 CORRELATION AND REGRESSION
Linear Model Selection and regularization
Cross-validation for the selection of statistical models
SIMPLE LINEAR REGRESSION
CHAPTER 14 MULTIPLE REGRESSION
Simple Linear Regression
Product moment correlation
SIMPLE LINEAR REGRESSION
Model generalization Brief summary of methods
Chapter 6 Predicting Future Performance
2.3. Measures of Dispersion (Variation):
Presentation transcript:

Statistical Learning for Complex Survey Data: Using Cross-Validation for Variable Selection in Generalized Linear Models Darryl V. Creel

I want to develop a model I want to develop a model. How do I determine which independent variables I should include in my model? P-value based approaches Forwards selection Backwards elimination Stepwise Hosmer-Lemeshow Relative quality statistics (indirect estimation) Akaike information criterion Bayesian information criterion Mallows Cp Direct estimation of the model error Validation data set V-fold cross-validation

Reminder, what is v-fold cross-validation Reminder, what is v-fold cross-validation? Here is an example of 5-fold cross-validation. Iteration Fold = 1 Fold = 2 Fold = 3 Fold = 4 Fold = 5 1 Test Training 2 3 4 5

How was the data generated How was the data generated? Population size = 10,000 and sample size = 400 using Poisson sampling Seven independent N(0,1) random variables, x1-x6 and error Three coefficients from an N(0,0.16), b1-b3 PSI makes the error homoscedastic or hetroscedastic, 0, 0.2, 0.5 Y = 1 + b1*x1 + b2*x2 + b3*x3 + (1 + psi*x1 + psi*x2)*error Informative sampling POS depends on Y z <- N(1+y, 0.25) k <- 1/(1+exp(2.5-0.5*z)), size variable sumPopK <- sum(k) probSel <- sampSize*k/sumPopK ranUni <- runif(n = popSize, min = 0, max = 1) sampInd <- ifelse(ranUni <= probSel, 1, 0) psuWt <- ifelse(sampInd == 1, 1/probSel, 0)

What does the distribution of cross-validation mean square error look like treating the data as if it was from a simple random sample?

What is the minimum mean cross-validation error treating the data as if it was from a simple random sample? numVars cvErrorMin <dbl> 1 1.33858 2 1.18070 3 1.04503 4 1.04180 5 1.03917 6 1.03687

What are the first ten models with the lowest model mean cross-validation mean square error treating the data as if it was from a simple random sample? ModelVars numVars cvErrorMean <chr> <dbl> x1 + x2 + x3 + x4 + x5 + x6 6 1.03687 x1 + x2 + x3 + x4 + x5 5 1.03917 x1 + x2 + x3 + x4 + x6 1.03949 x1 + x2 + x3 + x5 + x6 1.04010 x1 + x2 + x3 + x4 4 1.04180 x1 + x2 + x3 + x5 1.04238 x1 + x2 + x3 + x6 1.04274 x1 + x2 + x3 3 1.04503 x1 + x2 + x4 + x5 + x6 1.17146 x1 + x2 + x4 + x5 1.17421

What does the distribution of cross-validation mean square error look like for Poisson sample?

What is the minimum mean cross-validation error for a Poisson sample? numVars cvErrorMin <dbl> 1 1.37361 2 1.22905 3 1.08761 4 1.09346 5 1.09955 6 1.10648

What are the first ten models with the lowest model mean cross-validation mean square error for a Poisson sample? ModelVars numVars cvErrorMean <chr> <dbl> x1 + x2 + x3 3 1.08761 x1 + x2 + x3 + x6 4 1.09346 x1 + x2 + x3 + x5 1.09380 x1 + x2 + x3 + x4 1.09444 x1 + x2 + x3 + x5 + x6 5 1.09955 x1 + x2 + x3 + x4 + x6 1.10015 x1 + x2 + x3 + x4 + x5 1.10086 x1 + x2 + x3 + x4 + x5 + x6 6 1.10648 x1 + x3 2 1.22905 x1 + x2 1.23545

What does distribution of the model mean of the cross-validation mean square error look like for a Poisson sample?

What does distribution of the model mean of the cross-validation mean square error look like treating the data as if it was from a simple random sample?

Considerations for v-fold cross-validation with data from a complex survey design. How do you create the v-folds for cross-validation? Random Sorted weights Once you have the v-folds and you start partitioning the data into training and test data based on the v-folds, how do you treat the weights? Do you ignore them? Use them as is? Ratio adjust them to sum to the population? Weighted MSE Do not blindly apply these methods.

RTI International Name Email Phone Number