Saskatoon SAS user group

Slides:



Advertisements
Similar presentations
On the application of GP for software engineering predictive modeling: A systematic review Expert systems with Applications, Vol. 38 no. 9, 2011 Wasif.
Advertisements

Random Forest Predrag Radenković 3237/10
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
CS487 Software Engineering Omar Aldawud
Firewall Query Engine and Firewall Comparison Engine Mohamed Gouda Alex X. Liu Computer Science Department The University of Texas at Austin.
© 2014 Fair Isaac Corporation. Confidential. This presentation is provided for the recipient only and cannot be reproduced or shared without Fair Isaac.
Introduction to Data Mining with XLMiner
Clementine Server Clementine Server A data mining software for business solution.
Software Process and Product Metrics
Oracle Data Mining Ying Zhang. Agenda Data Mining Data Mining Algorithms Oracle DM Demo.
Computer Science Universiteit Maastricht Institute for Knowledge and Agent Technology Data mining and the knowledge discovery process Summer Course 2005.
Beyond Opportunity; Enterprise Miner Ronalda Koster, Data Analyst.
Predictive Analytics in Customs Administration Duncan Cleary Fiscal Affairs Department – Revenue Administration International Monetary Fund WCO IT Conference.
1 Chapter 1: Introduction 1.1 Introduction to SAS Enterprise Miner.
Chapter 1: Introduction
The Integration Story: Rational Quality Manager / Team Foundation Server / Quality Center Introductions This presentation will provide an introduction.
Application of SAS®! Enterprise Miner™ in Credit Risk Analytics
2012 National BDPA Technology Conference Creating Rich Data Visualizations using the Google API Yolanda M. Davis Senior Software Engineer AdvancED August.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
DATA MINING Team #1 Kristen Durst Mark Gillespie Banan Mandura University of DaytonMBA APR 09.
© 2010 IBM Corporation © 2011 IBM Corporation September 6, 2012 NCDHHS FAMS Overview for Behavioral Health Managed Care Organizations.
Weka Project assignment 3
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A data mining approach to the prediction of corporate failure.
WyGEO Conference September 17 & 18, About Me Shawn Lanning – GIS Research WyGISC – ModelBuilder Experience About You?
Introduction to SQL Server Data Mining Nick Ward SQL Server & BI Product Specialist Microsoft Australia Nick Ward SQL Server & BI Product Specialist Microsoft.
Ch 6. The Evolution of Analytic Tools and Methods Taming The Big Data Tidal Wave 31 May 2012 SNU IDB Lab. Sengyu Rim.
1 STAT 5814 Statistical Data Mining. 2 Use of SAS Data Mining.
CONFIDENTIAL1 Hidden Decision Trees to Design Predictive Scores – Application to Fraud Detection Vincent Granville, Ph.D. AnalyticBridge October 27, 2009.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
Anubha Gupta | Software Engineer Visual Studio Online Microsoft Corp. Visual Studio Enterprise Leveraging modern tools to streamline Build and Release.
Special Challenges With Large Data Mining Projects CAS PREDICTIVE MODELING SEMINAR Beth Fitzgerald ISO October 2006.
Data Mining With SQL Server Data Tools Mining Data Using Tools You Already Have.
Take Your Data Analysis and Reporting to the Next Level by Combining SAS Office Analytics, SAS Visual Analytics, and SAS Studio David Bailey Tim Beese.
Cloud Analytics Platforms Christian Frey. About AIDA Our mission is to advance knowledge in data analytics through research, education and outreach Our.
Introduction to Machine Learning, its potential usage in network area,
SAS® Viya™ Overview ANDRĖ DE WAAL, GLOBAL ACADEMIC PROGRAM
Collage Score Card & Software defect prediction
Big Data Analytics and HPC Platforms
CMPT 201 Computer Science II for Engineers
Experience Report: System Log Analysis for Anomaly Detection
Business Analysis for Data Science Teams
Office 365 Security Assessment Workshop
Decision Trees in Analytical Model Development
Test Around the Clock Testing Revolutionized
Make Predictions Using Azure Machine Learning Studio
Владимир Гусаров Директор R&D, Dell Visual Studio ALM MVP ALM Ranger
Modern Systems Analysis and Design Third Edition
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Introduction to R Programming with AzureML
Kathi Kellenberger Redgate
HP Operations Orchestration
Introduction to Data Mining and Classification
Advanced Analytics Using Enterprise Miner
NBA Draft Prediction BIT 5534 May 2nd 2018
Kathi Kellenberger Redgate Software
Machine Learning & Data Science
Lecturer: Geoff Hulten TAs: Kousuke Ariga & Angli Liu
The Institute of Scientific and Technical Information of China
What's New in eCognition 9
Microsoft Project Past, Present and Future
Analytics: Its More than Just Modeling
Chapter 8 Software Evolution.
CHAPTER 7: Information Visualization
What's New in eCognition 9
Data Wrangling as the key to success with Data Lake
Modern Systems Analysis and Design Third Edition
What's New in eCognition 9
Lecturer: Geoff Hulten TAs: Alon Milchgrub, Andrew Wei
Is Statistics=Data Science
Presentation transcript:

Saskatoon SAS user group Efficiency and data mining?

Agenda Background Case Study

Agenda Background Case Study

It means different things to different people? Predictive Analytics…Data science…Statistics…Machine Learning…Data mining It means different things to different people? Uses a variety of tools Data Scientist Business Analyst Heavy Excel user IT Management Executive Consistent answers Tries to avoid next migraine How do we manage this? Show me the easy button Show me the power So what? Data Scientist: Modern machine learning algorithms Quickly build hundreds or thousands of models. Reusable assets and best practices Business Users: Sound, reliable, analytically backed decisions Analytics integrated into day to day operations Easy to use and understand interfaces Easily combine analytical models and rules into business decisions in a single interface

The Data Mining Process CRISP-DM Methodology CRISP-DM is good methodology SEMMA is a process in Enterprise Miner. It aligns well with CRISP-DM This process is your friend. Use it. Iterate. Fail fast. SEMMA Process Sample Explore Modify Model Assess Deploy

Building a predictive model 3 Approaches Rapid Predictive Modeler (RPM) Enterprise Miner Preconfigured Enterprise Miner workflow in Enterprise Guide Easy Quick Good models Auditable and reusable Visual workflows Powerful Medium difficulty Great models Auditable and reusable Programming Difficult to learn Some Data Scientists prefer this Not suitable for the business analyst

The Data Mining Process How to add efficiency Understand the problem Understand the data Use visualization early in the process Don’t be afraid to build models, start with RPM Fail fast

Agenda Background Case Study

The Data Mining Process Case study We have a problem! Use actionable, in-memory, big-data, cloud, machine-learning, analytics to fix it You mean use predictive modeling to find the trucks that are going to blow up Last time it was altitude related

40 000 vehicles – Fleet is ageing Trucks are equipped with Telematics The data scientist is on vacation Dataset = 1,5GB (2M rows) !!!!!!!!!! - my spreadsheet won’t open it….. Business Analyst Data Scientist

What I am going to show you Case study What I am going to show you Use visualization early in the process to formulate a strategy Sample Explore Modify Model Assess Deploy Demo 1 Visual exploration of timeline Cluster analysis

Case study What I am going to show you Don’t be afraid to model Sample Explore Modify Model Assess Deploy Rapid Predictive Modeler Enterprise Miner Demo 2 Feature engineering 2 Minute model Enterprise Model

What I am going to show you Case study What I am going to show you This is how we derive value from the model Sample Explore Modify Model Assess Deploy Demo 3 Create score-code Geo spatial representation of scored data

Sample & Explore Data Demo 1 Visual exploration of timeline Modify Model Assess Deploy Missing data is a landmine. Identify and remediate. Visualize - Reconstruct a timeline Explore before sub setting or filtering Demo 1 Visual exploration of timeline Cluster Analysis

Sample & Explore Data Sample Explore Modify Model Assess Deploy Cluster Analysis in Visual Analytics Now that I understand the data, I have a plan Sample only Alternator faults Focus on recent data. Using all the history may pollute my model

Modify Model Assess Demo 2 Feature engineering RPM Advanced EM Model Sample Explore Modify Model Assess Deploy Use Rapid Predictive Modeler to fail fast Look at the variable importance chart Engineer features into the data Mitigate the risk of overfitting – (holdouts, model selection criteria) Demo 2 Feature engineering RPM Advanced EM Model

Modify Data Engineered Features Sample Explore Modify Model Assess Deploy Engineered Features Binning into deciles Altitude Engine hours Years in service Odometer mileage Oil temp Water temp Computed variables RPM Days since service origin Water temp * Oil temp Binning into quartiles Speed RPM Water temp*oil temp Days since service origin

Modify Model Assess Sample Explore Modify Model Assess Deploy Step Misclassification rate % % Improvement Champion Model Just do it – Model on full dataset 10.30 Logistic regression RPM - Regression on segmented data 8.56 16.89 Logistic regression (segmented dataset; sampled) RPM - Intermediate 8.02 6.31 Decision tree 2 RPM - Advanced 7.27 9.35 Decision Tree 3 Add feature engineered variables 6.94 4.54 Use Enterprise Miner 6.46 6.92 Ensemble (neural network and decision tree) We improve the model by iterating

Pre release version of SAS Visual Data Mining and Machine Learning

Deploy Sample Explore Modify Model Assess Deploy Demo 3 How will the model output be used by someone that knows nothing about data science? Scorecode is useful. A model is not. Visualize the output Demo 3 Create score-code Geo spatial representation of scored data

Deploy Sample Explore Modify Model Assess Deploy Out of a truck fleet of 2000+ 72 have fault codes on alternators 12 are prioritized for maintenance based on the prediction This is where they are

The Data Mining Process How to add efficiency Use visualization early in the process Don’t be afraid to build models, it is easy, start with RPM Fail fast

Ideas? Questions?