Is My Model Valid? Using Simulation to Understand Your Model and If It Can Accurately Predict Events Brad Foulkes JMP Discovery Summit 2016.

Slides:



Advertisements
Similar presentations
Access 2007 ® Use Databases How can Microsoft Access 2007 help you structure your database?
Advertisements

Probability Distributions CSLU 2850.Lo1 Spring 2008 Cameron McInally Fordham University May contain work from the Creative Commons.
Using Statistics to Analyze your Results
Sections 4.1 and 4.2 Overview Random Variables. PROBABILITY DISTRIBUTIONS This chapter will deal with the construction of probability distributions by.
Copyright © 2006 Pearson Addison-Wesley. All rights reserved. Lecture 6: Interpreting Regression Results Logarithms (Chapter 4.5) Standard Errors (Chapter.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Evaluation.
8-1 Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall Chapter 8 Confidence Interval Estimation Statistics for Managers using Microsoft.
Copyright ©2011 Pearson Education 8-1 Chapter 8 Confidence Interval Estimation Statistics for Managers using Microsoft Excel 6 th Global Edition.
Chapter 5 Discrete Probability Distributions
Probability and Statistics in Engineering Philip Bedient, Ph.D.
Copyright ©2011 Pearson Education 4-1 Chapter 4 Basic Probability Statistics for Managers using Microsoft Excel 6 th Global Edition.
Evaluating Classifiers
9-1 Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall Multicriteria Decision Making Chapter 9.
1 Work Sampling Can provide information about men and machines in less time and lower cost. It has three main uses: 1.Activity and delay sampling To measure.
GO! with Office 2013 Volume 1 By: Shelley Gaskin, Alicia Vargas, and Carolyn McLellan Excel Chapter 2 Using Functions, Creating Tables, and Managing Large.
HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Section 10.2.
5-1 Business Statistics: A Decision-Making Approach 8 th Edition Chapter 5 Discrete Probability Distributions.
Basic Statistics Concepts Marketing Logistics. Basic Statistics Concepts Including: histograms, means, normal distributions, standard deviations.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 1 – Slide 1 of 34 Chapter 11 Section 1 Random Variables.
Copyright ©2014 Pearson Education Chap 4-1 Chapter 4 Basic Probability Statistics for Managers Using Microsoft Excel 7 th Edition, Global Edition.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Discrete Distributions The values generated for a random variable must be from a finite distinct set of individual values. For example, based on past observations,
LECTURER PROF.Dr. DEMIR BAYKA AUTOMOTIVE ENGINEERING LABORATORY I.
1 6. Reliability computations Objectives Learn how to compute reliability of a component given the probability distributions on the stress,S, and the strength,
Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Chapter 5 Discrete Random Variables.
Determining the Size of a Sample 1 Copyright © 2014 Pearson Education, Inc.
Simulation Using computers to simulate real- world observations.
Simulation is the process of studying the behavior of a real system by using a model that replicates the system under different scenarios. A simulation.
Chap 8-1 Chapter 8 Confidence Interval Estimation Statistics for Managers Using Microsoft Excel 7 th Edition, Global Edition Copyright ©2014 Pearson Education.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
RESEARCH & DATA ANALYSIS
Chapter 8: Probability: The Mathematics of Chance Probability Models and Rules 1 Probability Theory  The mathematical description of randomness.  Companies.
Talk about the ethics of calculating probability. Probability – How likely something will occur Probability is usually expressed as a decimal number. A.
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
PROBABILITY AND COMPUTING RANDOMIZED ALGORITHMS AND PROBABILISTIC ANALYSIS CHAPTER 1 IWAMA and ITO Lab. M1 Sakaidani Hikaru 1.
Estimating standard error using bootstrap
CSE 4705 Artificial Intelligence
Robert Anderson SAS JMP
Linear Algebra Review.
Chapter 7. Classification and Prediction
Anshuman Singh John Korsedal Brad Foulkes Sep 16, 2015
GO! with Microsoft Office 2016
Data Mining: Concepts and Techniques
Discrete Probability Distributions
What is Probability? Quantification of uncertainty.
Understanding Regression Analysis Basics
PCB 3043L - General Ecology Data Analysis.
GO! with Microsoft Access 2016
Introduction to Summary Statistics
Chapter 12 Using Descriptive Analysis, Performing
Statistics for 8th Edition Chapter 3 Probability
GO! with Microsoft® Access e
Discrete Event Simulation - 4
SME1013 PROGRAMMING FOR ENGINEERS
What is Regression Analysis?
Honors Statistics From Randomness to Probability
SME1013 PROGRAMMING FOR ENGINEERS
The General Ledger Setting Up the General Ledger
Chapter 6 Introduction to Continuous Probability Distributions
Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.
Unit 1: Reliability of Measurements
Determining the Size of a Sample
Introduction to Sampling Distributions
Logistic Regression.
ELEMENTARY STATISTICS, BLUMAN
Presentation transcript:

Is My Model Valid? Using Simulation to Understand Your Model and If It Can Accurately Predict Events Brad Foulkes JMP Discovery Summit 2016

All Rights Reserved. No part of this document may be reproduced, transmitted, stored in a retrieval system nor translated into any human or computer language, in any form or by any means, electronic, mechanical, magnetic, optical, chemical, manual, or otherwise, without the prior written permission of the General Electric Company.

Agenda Types of models we’re talking about How can we usually tell if a model is valid? Using survival models to predict an event How does simulation help? Building a simple script to simulate Let’s do an example

Types of models we’re talking about Discrete events… i.e. will it or won’t it happen? Will a part fail at XYZ time? Will I roll a 7? Will I get a ticket? Will someone leave the company? Will the student graduate? Will I survive a heart attack?

Types of models we’re talking about Weibull plot Survival, Reliability, Weibull Logistic regression Neural Net Decision tree / Bootstrap Forest Anything with a probability of occurrence

Types of models we’re talking about Event prediction, also where there are distinct groups Anything with a probability of occurrence Survival models, logistic regression, bootstrap forest, etc… P(t) of occurrence Event

How can we usually tell if a model is valid? Review goodness of fit, R-squared, AICc ROC curves Confusion matrix What is wrong with these? For some data sets… Results can be misleading Unsure of how far off the model is Can be off for unbalanced data sets ( i.e. low number of events) AUC = 0.987 Accuracy 92.7% Accuracy 91.2% … but missed most of the actual events

Using survival models to predict events Start with a Weibull model Look at the risk at each event Event Time With no clear difference, is the model valid? For some groups, the model is good, others are outside the CBs Event Count Groupings of data

What is simulating an event Probability of an event occurring When the randomness is less than the risk of it happening, it counts as an event Add up the number of events over several runs to figure out the average and standard deviation for the probability of an event Random Chance!

Reliability simulation modeling Given a certain probability (0.25), sum the number of times the random value was less than that probability Then divide by the total number of iterations SimProb.jmp from sample data sets This works well for predicting and understanding 1 event, but what about multiple events? Or the same event on multiple units?

Simulating many units If 10 units with different probabilities are simulated, the same principles can be used to determine the number of failures Now, a mean and standard deviation can be found for the group What about running multiple groups?

Interpreting the results Using the Western Electric (WeCo) Rules for SPC, you can determine if the data is outside the confidence of the model Example using Blenders.jmp, showing WeCo rules Ref: https://en.wikipedia.org/wiki/Western_Electric_rules

Running the simulation - Normally For loops can be used to reevaluate the number of events that occur 1. Set up a comparison column to identify “events” 2. Iterate to count the number of events each time 3. Append each trial to a summary table 4. Calculate the statistics on the entire data set For 10K iterations, this code creates 10K summary tables and runs the formulas 10K times

Vectorization – i.e. using matrices to speed up your code From this…. … to this Vectorization just means instead of doing one at a time, run groups at the same time It can move the computational tough tasks to a group, so they happen less frequently In short, it’s just linear algebra  A few references on the topic http://www.jmp.com/support/help/Expression_Data.shtml https://en.wikipedia.org/wiki/Array_programming http://www.r-bloggers.com/how-to-use-vectorization-to-streamline-simulations/

Saving a vector into a column To use a vector in a column, the data type needs to change to “Expression” Change the column type in the column viewer “Vector” type Or “None” type Either type will work in this situation Vector type is available in JMP13 An expression like this can then be used

The new code… 1. Set up a comparison column to identify “events” 2. Create new summary table 3. Create a subset of each group Roughly the same length of code, but much more efficient 4. Calculate & store the statistics on each group For 10K iterations, this code creates 1 table and runs the formulas once

Running the final script All sorts of options can be added in… Sub-setting of data Run multiple different groupings Using only data after a certain date Enter a model, select a formula Conditional risk/reliability

An example…Worcester Heart Attack Study https://www.umass.edu/statdata/statdata/stat-survival.html If you came into the hospital with a heart attack, would you survive?

WHAS – Bootstrap Forest model AUC=0.9 AUC=0.81 Confusion matrix seems off in predicting survivals

WHAS - Simulation

A few references https://en.wikipedia.org/wiki/Bernoulli_trial https://en.wikipedia.org/wiki/Bootstrapping_(statistics) https://en.wikipedia.org/wiki/Backtesting https://en.wikipedia.org/wiki/Monte_Carlo_method https://en.wikipedia.org/wiki/Western_Electric_rules The certified reliability engineer handbook, Benbow and Broome, 11/28/2008, ASQ Quality Press, pages 146-150 https://en.wikipedia.org/wiki/Vectorization_(mathematics) https://en.wikipedia.org/wiki/Array_programming https://www.value-at-risk.net/title-page/ https://www.umass.edu/statdata/statdata/stat-survival.html

Summary Predicting events is tough Depending on the data set, it can be tough to know if the model is any good Probability simulation can help or at least provide another path to try Vectorization can speed up the simulation

Questions?