Is My Model Valid? Using Simulation to Understand Your Model and If It Can Accurately Predict Events Brad Foulkes JMP Discovery Summit 2016
All Rights Reserved. No part of this document may be reproduced, transmitted, stored in a retrieval system nor translated into any human or computer language, in any form or by any means, electronic, mechanical, magnetic, optical, chemical, manual, or otherwise, without the prior written permission of the General Electric Company.
Agenda Types of models we’re talking about How can we usually tell if a model is valid? Using survival models to predict an event How does simulation help? Building a simple script to simulate Let’s do an example
Types of models we’re talking about Discrete events… i.e. will it or won’t it happen? Will a part fail at XYZ time? Will I roll a 7? Will I get a ticket? Will someone leave the company? Will the student graduate? Will I survive a heart attack?
Types of models we’re talking about Weibull plot Survival, Reliability, Weibull Logistic regression Neural Net Decision tree / Bootstrap Forest Anything with a probability of occurrence
Types of models we’re talking about Event prediction, also where there are distinct groups Anything with a probability of occurrence Survival models, logistic regression, bootstrap forest, etc… P(t) of occurrence Event
How can we usually tell if a model is valid? Review goodness of fit, R-squared, AICc ROC curves Confusion matrix What is wrong with these? For some data sets… Results can be misleading Unsure of how far off the model is Can be off for unbalanced data sets ( i.e. low number of events) AUC = 0.987 Accuracy 92.7% Accuracy 91.2% … but missed most of the actual events
Using survival models to predict events Start with a Weibull model Look at the risk at each event Event Time With no clear difference, is the model valid? For some groups, the model is good, others are outside the CBs Event Count Groupings of data
What is simulating an event Probability of an event occurring When the randomness is less than the risk of it happening, it counts as an event Add up the number of events over several runs to figure out the average and standard deviation for the probability of an event Random Chance!
Reliability simulation modeling Given a certain probability (0.25), sum the number of times the random value was less than that probability Then divide by the total number of iterations SimProb.jmp from sample data sets This works well for predicting and understanding 1 event, but what about multiple events? Or the same event on multiple units?
Simulating many units If 10 units with different probabilities are simulated, the same principles can be used to determine the number of failures Now, a mean and standard deviation can be found for the group What about running multiple groups?
Interpreting the results Using the Western Electric (WeCo) Rules for SPC, you can determine if the data is outside the confidence of the model Example using Blenders.jmp, showing WeCo rules Ref: https://en.wikipedia.org/wiki/Western_Electric_rules
Running the simulation - Normally For loops can be used to reevaluate the number of events that occur 1. Set up a comparison column to identify “events” 2. Iterate to count the number of events each time 3. Append each trial to a summary table 4. Calculate the statistics on the entire data set For 10K iterations, this code creates 10K summary tables and runs the formulas 10K times
Vectorization – i.e. using matrices to speed up your code From this…. … to this Vectorization just means instead of doing one at a time, run groups at the same time It can move the computational tough tasks to a group, so they happen less frequently In short, it’s just linear algebra A few references on the topic http://www.jmp.com/support/help/Expression_Data.shtml https://en.wikipedia.org/wiki/Array_programming http://www.r-bloggers.com/how-to-use-vectorization-to-streamline-simulations/
Saving a vector into a column To use a vector in a column, the data type needs to change to “Expression” Change the column type in the column viewer “Vector” type Or “None” type Either type will work in this situation Vector type is available in JMP13 An expression like this can then be used
The new code… 1. Set up a comparison column to identify “events” 2. Create new summary table 3. Create a subset of each group Roughly the same length of code, but much more efficient 4. Calculate & store the statistics on each group For 10K iterations, this code creates 1 table and runs the formulas once
Running the final script All sorts of options can be added in… Sub-setting of data Run multiple different groupings Using only data after a certain date Enter a model, select a formula Conditional risk/reliability
An example…Worcester Heart Attack Study https://www.umass.edu/statdata/statdata/stat-survival.html If you came into the hospital with a heart attack, would you survive?
WHAS – Bootstrap Forest model AUC=0.9 AUC=0.81 Confusion matrix seems off in predicting survivals
WHAS - Simulation
A few references https://en.wikipedia.org/wiki/Bernoulli_trial https://en.wikipedia.org/wiki/Bootstrapping_(statistics) https://en.wikipedia.org/wiki/Backtesting https://en.wikipedia.org/wiki/Monte_Carlo_method https://en.wikipedia.org/wiki/Western_Electric_rules The certified reliability engineer handbook, Benbow and Broome, 11/28/2008, ASQ Quality Press, pages 146-150 https://en.wikipedia.org/wiki/Vectorization_(mathematics) https://en.wikipedia.org/wiki/Array_programming https://www.value-at-risk.net/title-page/ https://www.umass.edu/statdata/statdata/stat-survival.html
Summary Predicting events is tough Depending on the data set, it can be tough to know if the model is any good Probability simulation can help or at least provide another path to try Vectorization can speed up the simulation
Questions?