Saskatoon SAS user group Efficiency and data mining?
Agenda Background Case Study
Agenda Background Case Study
It means different things to different people? Predictive Analytics…Data science…Statistics…Machine Learning…Data mining It means different things to different people? Uses a variety of tools Data Scientist Business Analyst Heavy Excel user IT Management Executive Consistent answers Tries to avoid next migraine How do we manage this? Show me the easy button Show me the power So what? Data Scientist: Modern machine learning algorithms Quickly build hundreds or thousands of models. Reusable assets and best practices Business Users: Sound, reliable, analytically backed decisions Analytics integrated into day to day operations Easy to use and understand interfaces Easily combine analytical models and rules into business decisions in a single interface
The Data Mining Process CRISP-DM Methodology CRISP-DM is good methodology SEMMA is a process in Enterprise Miner. It aligns well with CRISP-DM This process is your friend. Use it. Iterate. Fail fast. SEMMA Process Sample Explore Modify Model Assess Deploy
Building a predictive model 3 Approaches Rapid Predictive Modeler (RPM) Enterprise Miner Preconfigured Enterprise Miner workflow in Enterprise Guide Easy Quick Good models Auditable and reusable Visual workflows Powerful Medium difficulty Great models Auditable and reusable Programming Difficult to learn Some Data Scientists prefer this Not suitable for the business analyst
The Data Mining Process How to add efficiency Understand the problem Understand the data Use visualization early in the process Don’t be afraid to build models, start with RPM Fail fast
Agenda Background Case Study
The Data Mining Process Case study We have a problem! Use actionable, in-memory, big-data, cloud, machine-learning, analytics to fix it You mean use predictive modeling to find the trucks that are going to blow up Last time it was altitude related
40 000 vehicles – Fleet is ageing Trucks are equipped with Telematics The data scientist is on vacation Dataset = 1,5GB (2M rows) !!!!!!!!!! - my spreadsheet won’t open it….. Business Analyst Data Scientist
What I am going to show you Case study What I am going to show you Use visualization early in the process to formulate a strategy Sample Explore Modify Model Assess Deploy Demo 1 Visual exploration of timeline Cluster analysis
Case study What I am going to show you Don’t be afraid to model Sample Explore Modify Model Assess Deploy Rapid Predictive Modeler Enterprise Miner Demo 2 Feature engineering 2 Minute model Enterprise Model
What I am going to show you Case study What I am going to show you This is how we derive value from the model Sample Explore Modify Model Assess Deploy Demo 3 Create score-code Geo spatial representation of scored data
Sample & Explore Data Demo 1 Visual exploration of timeline Modify Model Assess Deploy Missing data is a landmine. Identify and remediate. Visualize - Reconstruct a timeline Explore before sub setting or filtering Demo 1 Visual exploration of timeline Cluster Analysis
Sample & Explore Data Sample Explore Modify Model Assess Deploy Cluster Analysis in Visual Analytics Now that I understand the data, I have a plan Sample only Alternator faults Focus on recent data. Using all the history may pollute my model
Modify Model Assess Demo 2 Feature engineering RPM Advanced EM Model Sample Explore Modify Model Assess Deploy Use Rapid Predictive Modeler to fail fast Look at the variable importance chart Engineer features into the data Mitigate the risk of overfitting – (holdouts, model selection criteria) Demo 2 Feature engineering RPM Advanced EM Model
Modify Data Engineered Features Sample Explore Modify Model Assess Deploy Engineered Features Binning into deciles Altitude Engine hours Years in service Odometer mileage Oil temp Water temp Computed variables RPM Days since service origin Water temp * Oil temp Binning into quartiles Speed RPM Water temp*oil temp Days since service origin
Modify Model Assess Sample Explore Modify Model Assess Deploy Step Misclassification rate % % Improvement Champion Model Just do it – Model on full dataset 10.30 Logistic regression RPM - Regression on segmented data 8.56 16.89 Logistic regression (segmented dataset; sampled) RPM - Intermediate 8.02 6.31 Decision tree 2 RPM - Advanced 7.27 9.35 Decision Tree 3 Add feature engineered variables 6.94 4.54 Use Enterprise Miner 6.46 6.92 Ensemble (neural network and decision tree) We improve the model by iterating
Pre release version of SAS Visual Data Mining and Machine Learning
Deploy Sample Explore Modify Model Assess Deploy Demo 3 How will the model output be used by someone that knows nothing about data science? Scorecode is useful. A model is not. Visualize the output Demo 3 Create score-code Geo spatial representation of scored data
Deploy Sample Explore Modify Model Assess Deploy Out of a truck fleet of 2000+ 72 have fault codes on alternators 12 are prioritized for maintenance based on the prediction This is where they are
The Data Mining Process How to add efficiency Use visualization early in the process Don’t be afraid to build models, it is easy, start with RPM Fail fast
Ideas? Questions?