Robert Anderson SAS JMP

Slides:



Advertisements
Similar presentations
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Advertisements

CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
1 Simple Linear Regression and Correlation The Model Estimating the Coefficients EXAMPLE 1: USED CAR SALES Assessing the model –T-tests –R-square.
Chapter 13 Multiple Regression
Simple Linear Regression
Chapter 12 Multiple Regression
1 Simple Linear Regression Chapter Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.
Chapter 11 Multiple Regression.
Handling Categorical Data. Learning Outcomes At the end of this session and with additional reading you will be able to: – Understand when and how to.
Lecture 5: Simple Linear Regression
Richard M. Jacobs, OSA, Ph.D.
Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.
LEARNING PROGRAMME Hypothesis testing Intermediate Training in Quantitative Analysis Bangkok November 2007.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Dynamic Lines. Dynamic analysis n Health of people and activity of medical establishments change in time. n Studying of dynamics of the phenomena is very.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Regression Analysis A statistical procedure used to find relations among a set of variables.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Chapter 10 Verification and Validation of Simulation Models
Relative Values. Statistical Terms n Mean:  the average of the data  sensitive to outlying data n Median:  the middle of the data  not sensitive to.
Ch. 10 Correlation and Regression 10-2 Notes Linear Regression and the Coefficient of Determination.
© Copyright McGraw-Hill 2004
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.
Regression Analysis: A statistical procedure used to find relations among a set of variables B. Klinkenberg G
CORRELATION-REGULATION ANALYSIS Томский политехнический университет.
Virtual University of Pakistan
Stats Methods at IC Lecture 3: Regression.
I. ANOVA revisited & reviewed
Lecture #8 Thursday, September 15, 2016 Textbook: Section 4.4
The simple linear regression model and parameter estimation
AP Seminar: Statistics Primer
Elizabeth R McMahon 14 April 2017
Statistical analysis.
Chapter 7. Classification and Prediction
Linear Regression CSC 600: Data Mining Class 12.
Linear Regression.
Statistics for the Social Sciences
Statistical analysis.
AP Seminar: Statistics Primer
Relative Values.
PCB 3043L - General Ecology Data Analysis.
Break and Noise Variance
Multiple Regression Analysis and Model Building
Simple Linear Regression
Robust Optimisation of Processes and Products by Using Monte Carlo Simulation Experiments Robert Anderson – JMP.
Chapter 10 Verification and Validation of Simulation Models
Difference Between Means Test (“t” statistic)
Correlation and Regression
Adjustment of Temperature Trends In Landstations After Homogenization ATTILAH Uriah Heat Unavoidably Remaining Inaccuracies After Homogenization Heedfully.
Introduction to Instrumentation Engineering
Simple Linear Regression
Prepared by Lee Revere and John Large
Simple Linear Regression
Multiple Regression Models
Statistics for the Social Sciences
iSRD Spam Review Detection with Imbalanced Data Distributions
Product moment correlation
Multiple Regression – Split Sample Validation
CHAPTER 10 Comparing Two Populations or Groups
Regression Analysis.
UNIT V CHISQUARE DISTRIBUTION
S.M.JOSHI COLLEGE, HADAPSAR
BEC 30325: MANAGERIAL ECONOMICS
Introduction to Regression
Objectives 6.1 Estimating with confidence Statistical confidence
Objectives 6.1 Estimating with confidence Statistical confidence
Analysis of two-way tables
Pearson Correlation and R2
Presentation transcript:

Robert Anderson SAS JMP Missing Genuine Effects Is Bad, but Identifying False Effects Can Be Worse Robert Anderson SAS JMP

Today’s talk Quick introduction to modelling and cross-validation Demo in JMP using simulated data and cross-validation To show it working and not working To show the benefit of using multiple validation columns Results from sensitivity studies on cross-validation success Using simulated data and many runs under a variety of conditions

What do we mean by a model? x1 x2 xn Factors/inputs System (black box) y1 Responses/outputs Equation y = f(x) + Error Scientists and engineers need to be able to find the best possible model and correctly identify which factors are genuinely important and which are not y2 y3 The model is just an equation or expression that defines the relationship between the inputs and the outputs Often the greatest concern is that an important or vital factor will be missed. However, statistical modelling methods frequently identify factors which are statistically significant but not genuinely active and that can be an even worse problem.

Prediction Profiler Allows the Model to be Visualized This is the prediction profiler for a model obtained from analysing historical data Model equation: Y = 2*X1 – 2.5*X2 + 3*X3 + 3*(X3*X4) – 2*(X5)2 Linear terms Interaction term Squared term

Identifying which terms to include in model Implications of finding the incorrect model terms? True situation (Actual) Include or exclude a variable or term in a model Include variable or term Exclude variable or term Variable or term is genuinely important ? Variable or term is not genuinely important

Identifying which terms to include in model Implications of finding the incorrect model terms? True situation (Actual) Include or exclude a variable or term in a model Include variable or term Exclude variable or term Variable or term is genuinely important True Positive Correct decision made No adverse implications ? Variable or term is not genuinely important True Negative

Identifying which terms to include in model Implications of finding the incorrect model terms? True situation (Actual) Include or exclude a variable or term in a model Include variable or term Exclude variable or term Variable or term is genuinely important True Positive Correct decision made No adverse implications False Negative Important effect is missed Poorer understanding Can’t explain all the variation Need to continue looking Variable or term is not genuinely important ? True Negative Missing a real effect

Identifying which terms to include in model Implications of finding the incorrect model terms? True situation (Actual) Include or exclude a variable or term in a model Include variable or term Exclude variable or term Variable or term is genuinely important True Positive Correct decision made No adverse implications False Negative Important effect is missed Poorer understanding Can’t explain all the variation Need to continue looking Variable or term is not genuinely important False Positive Non-genuine effect included Incorrect understanding Wastes time and effort Unexplained variation missed True Negative Missing a real effect Identifying a false effect

Cross-validation in JMP Pro Cross-validation is a way to suppress over-fitting and to reduce the chance of a model containing non-genuine or false effects Data randomly split into 3 samples “Training” sample “Validation” sample “Test” sample How the data will be used (Validation methodology) Most of the data will be used to build (or train) the model Some data will be held back to ensure that the model is not ‘over fitted’ and is the best possible model using that model building technique Some data will be held back and not used in the model building process at all. This data will allow a fair comparison of how accurate the predictions from competing models are likely to be.

Measuring your model’s performance R2 used to measure the performance of your model JMP stops adding terms to the model when the validation R2 reaches a maximum. This suppresses over-fitting. Training sample Validation sample

Measuring your model’s performance R2 used to measure the performance of your model JMP stops adding terms to the model when the validation R2 reaches a maximum. This suppresses over-fitting. Training sample Validation sample Explanatory power of model high low

Measuring your model’s performance R2 used to measure the performance of your model JMP stops adding terms to the model when the validation R2 reaches a maximum. This suppresses over-fitting. Training sample Validation sample Model complexity low high Explanatory power of model

Measuring your model’s performance R2 used to measure the performance of your model JMP stops adding terms to the model when the validation R2 reaches a maximum. This suppresses over-fitting. 8 model terms gives the maximum validation R2 Training sample Validation sample Model complexity low high Explanatory power of model

Let’s look at an example in JMP now

Benefit of Using Cross-Validation Simulated data was used so that the correct model was known Over-fitted model obtained when validation isn’t used Correct model is obtained when validation is used Over-fitted model includes many statistically significant terms which are non-genuine and false signals Actual model used to simulate the data

Some simulation studies to see how sensitive the validation method is to certain parameters The results of the following simulation studies were obtained by drawing random samples from a 1000 row randomly generated dataset in which the response Y was simulated using a column formula of the form shown below. In each of the simulation studies, a single validation column was tried and the number of times the correct model was obtained was recorded. Model equation: Y =

The Effect of Sample Size on Cross-validation Success Each data point represents the percentage of correct models obtained from 10 trials using simulated data and a single validation column Sample size = varied Effect size S/N ratio = 2 Training/Validation ratio = 0.7/0.3 Number of active terms = 3 Number of columns = 30

The Effect of Effect Size on Cross-validation Success Sample size = 50 Effect size S/N ratio = varied Training/Validation ratio = 0.7/0.3 Number of active terms = 3 Number of columns = 30

The Effect of Training\Validation Proportions on Cross-validation Success Sample size = 50 Effect size S/N ratio = varied Training/Validation ratio = varied Number of active terms = 3 Number of columns = 30

The Effect of Model Complexity on Cross-validation Success Sample size = 50 Effect size S/N ratio = varied Training/Validation ratio = 0.7/0.3 Number of active terms = varied Number of columns = 30

The Effect of the Number of Variables on Cross-validation Success Sample size = 50 Effect size S/N ratio = varied Training/Validation ratio = 0.7/0.3 Number of active terms = 3 Number of columns = varied

Conclusions If you are building models from historical or observational data, you should be using cross-validation If you use cross-validation, you shouldn’t rely on a single validation column, you should try multiple validation columns The simplest and most frequently occurring model using multiple validation columns is likely to be the ‘correct’ model Cross-validation suppresses overfitting (or finding non-genuine effects) but it doesn’t always prevent it.