Presentation is loading. Please wait.

Presentation is loading. Please wait.

Robert Anderson SAS JMP

Similar presentations


Presentation on theme: "Robert Anderson SAS JMP"— Presentation transcript:

1 Robert Anderson SAS JMP
Missing Genuine Effects Is Bad, but Identifying False Effects Can Be Worse Robert Anderson SAS JMP

2 Today’s talk Quick introduction to modelling and cross-validation
Demo in JMP using simulated data and cross-validation To show it working and not working To show the benefit of using multiple validation columns Results from sensitivity studies on cross-validation success Using simulated data and many runs under a variety of conditions

3 What do we mean by a model?
x1 x2 xn Factors/inputs System (black box) y1 Responses/outputs Equation y = f(x) + Error Scientists and engineers need to be able to find the best possible model and correctly identify which factors are genuinely important and which are not y2 y3 The model is just an equation or expression that defines the relationship between the inputs and the outputs Often the greatest concern is that an important or vital factor will be missed. However, statistical modelling methods frequently identify factors which are statistically significant but not genuinely active and that can be an even worse problem.

4 Prediction Profiler Allows the Model to be Visualized
This is the prediction profiler for a model obtained from analysing historical data Model equation: Y = *X1 – 2.5*X2 + 3*X *(X3*X4) – 2*(X5)2 Linear terms Interaction term Squared term

5 Identifying which terms to include in model
Implications of finding the incorrect model terms? True situation (Actual) Include or exclude a variable or term in a model Include variable or term Exclude variable or term Variable or term is genuinely important ? Variable or term is not genuinely important

6 Identifying which terms to include in model
Implications of finding the incorrect model terms? True situation (Actual) Include or exclude a variable or term in a model Include variable or term Exclude variable or term Variable or term is genuinely important True Positive Correct decision made No adverse implications ? Variable or term is not genuinely important True Negative

7 Identifying which terms to include in model
Implications of finding the incorrect model terms? True situation (Actual) Include or exclude a variable or term in a model Include variable or term Exclude variable or term Variable or term is genuinely important True Positive Correct decision made No adverse implications False Negative Important effect is missed Poorer understanding Can’t explain all the variation Need to continue looking Variable or term is not genuinely important ? True Negative Missing a real effect

8 Identifying which terms to include in model
Implications of finding the incorrect model terms? True situation (Actual) Include or exclude a variable or term in a model Include variable or term Exclude variable or term Variable or term is genuinely important True Positive Correct decision made No adverse implications False Negative Important effect is missed Poorer understanding Can’t explain all the variation Need to continue looking Variable or term is not genuinely important False Positive Non-genuine effect included Incorrect understanding Wastes time and effort Unexplained variation missed True Negative Missing a real effect Identifying a false effect

9 Cross-validation in JMP Pro
Cross-validation is a way to suppress over-fitting and to reduce the chance of a model containing non-genuine or false effects Data randomly split into 3 samples “Training” sample “Validation” sample “Test” sample How the data will be used (Validation methodology) Most of the data will be used to build (or train) the model Some data will be held back to ensure that the model is not ‘over fitted’ and is the best possible model using that model building technique Some data will be held back and not used in the model building process at all. This data will allow a fair comparison of how accurate the predictions from competing models are likely to be.

10 Measuring your model’s performance
R2 used to measure the performance of your model JMP stops adding terms to the model when the validation R2 reaches a maximum. This suppresses over-fitting. Training sample Validation sample

11 Measuring your model’s performance
R2 used to measure the performance of your model JMP stops adding terms to the model when the validation R2 reaches a maximum. This suppresses over-fitting. Training sample Validation sample Explanatory power of model high low

12 Measuring your model’s performance
R2 used to measure the performance of your model JMP stops adding terms to the model when the validation R2 reaches a maximum. This suppresses over-fitting. Training sample Validation sample Model complexity low high Explanatory power of model

13 Measuring your model’s performance
R2 used to measure the performance of your model JMP stops adding terms to the model when the validation R2 reaches a maximum. This suppresses over-fitting. 8 model terms gives the maximum validation R2 Training sample Validation sample Model complexity low high Explanatory power of model

14 Let’s look at an example in JMP now

15 Benefit of Using Cross-Validation
Simulated data was used so that the correct model was known Over-fitted model obtained when validation isn’t used Correct model is obtained when validation is used Over-fitted model includes many statistically significant terms which are non-genuine and false signals Actual model used to simulate the data

16 Some simulation studies to see how sensitive the validation method is to certain parameters
The results of the following simulation studies were obtained by drawing random samples from a 1000 row randomly generated dataset in which the response Y was simulated using a column formula of the form shown below. In each of the simulation studies, a single validation column was tried and the number of times the correct model was obtained was recorded. Model equation: Y =

17 The Effect of Sample Size on Cross-validation Success
Each data point represents the percentage of correct models obtained from 10 trials using simulated data and a single validation column Sample size = varied Effect size S/N ratio = 2 Training/Validation ratio = 0.7/0.3 Number of active terms = 3 Number of columns = 30

18 The Effect of Effect Size on Cross-validation Success
Sample size = 50 Effect size S/N ratio = varied Training/Validation ratio = 0.7/0.3 Number of active terms = 3 Number of columns = 30

19 The Effect of Training\Validation Proportions on Cross-validation Success
Sample size = 50 Effect size S/N ratio = varied Training/Validation ratio = varied Number of active terms = 3 Number of columns = 30

20 The Effect of Model Complexity on Cross-validation Success
Sample size = 50 Effect size S/N ratio = varied Training/Validation ratio = 0.7/0.3 Number of active terms = varied Number of columns = 30

21 The Effect of the Number of Variables on Cross-validation Success
Sample size = 50 Effect size S/N ratio = varied Training/Validation ratio = 0.7/0.3 Number of active terms = 3 Number of columns = varied

22 Conclusions If you are building models from historical or observational data, you should be using cross-validation If you use cross-validation, you shouldn’t rely on a single validation column, you should try multiple validation columns The simplest and most frequently occurring model using multiple validation columns is likely to be the ‘correct’ model Cross-validation suppresses overfitting (or finding non-genuine effects) but it doesn’t always prevent it.


Download ppt "Robert Anderson SAS JMP"

Similar presentations


Ads by Google