Data Science that’s scale

Slides:



Advertisements
Similar presentations
Florida International University COP 4770 Introduction of Weka.
Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Introduction to Data Mining with XLMiner
Data Mining: A Closer Look
Chapter 5 Data mining : A Closer Look.
Gavin Russell-Rockliff BI Technical Specialist Microsoft BIN305.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Introduction to SQL Server Data Mining Nick Ward SQL Server & BI Product Specialist Microsoft Australia Nick Ward SQL Server & BI Product Specialist Microsoft.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Consul- ting Services Outsour- cing Services Techno- logy Services Local Profes- sional Services Competence Centers Business Intelligence WebTech SAP.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
Data Mining and Decision Support
Data Analytics CMIS Short Course part II Day 1 Part 1: Introduction Sam Buttrey December 2015.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Data Mining Copyright KEYSOFT Solutions.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Fraud Detection Notes from the Field. Introduction Dejan Sarka –Data.
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
Blog: R YOU READY FOR.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
Revolutionary R integration with SQL Server 2016.
Please Visit Sponsors and Enter Raffles
9/24/2017 7:27 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
Bhakthi Liyanage SQL Saturday Atlanta 15 July 2017
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Machine Learning with Spark MLlib
Big Data is a Big Deal!.
SNS COLLEGE OF TECHNOLOGY
Chapter 7. Classification and Prediction
Predicting Azure Consumption using Ensemble Learning
Supervised Time Series Pattern Discovery through Local Importance
Introduction to R Programming with AzureML
Encryption in SQL Server
Classification with Perceptrons Reading:
Basic machine learning background with Python scikit-learn
NBA Draft Prediction BIT 5534 May 2nd 2018
Kathi Kellenberger Redgate Software
Vincent Granville, Ph.D. Co-Founder, DSC
Introducing the SQL Server 2016 Query Store
Machine Learning & Data Science
Advanced Analytics. Advanced Analytics What is Machine Learning?
Intro to Machine Learning
Exam #3 Review Zuyin (Alvin) Zheng.
Azure SQL DWH: Tips and Tricks for developers
Logistic Regression & Parallel SGD
Machine Learning with Weka
Alain Goossens & Jean-Pierre Van Loo Data scientists – SII Belgium
Overview of Machine Learning
iSRD Spam Review Detection with Imbalanced Data Distributions
Course Introduction CSC 576: Data Mining.
Azure Data Factory v2: What’s new?
SQL Database on IoT devices could you? should you? would you?
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Intro to Machine Learning
Data Science in Industry
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
What is this and how can I use it?
Analysis for Predicting the Selling Price of Apartments Pratik Nikte
Roc curves By Vittoria Cozza, matr
Instructor Materials Chapter 5: Ensuring Integrity
Get data insights faster with Data Wrangling
SQL Like Languages in Azure IoT
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Machine Learning for Cyber
Getting Started with Microsoft Azure Machine Learning
Presentation transcript:

Data Science that’s scale Marcin Szeliga Data Science that’s scale

SQLSat Kyiv Team Yevhen Nedashkivskyi Alesya Zhuk Eugene Polonichko Oksana Borysenko Mykola Pobyivovk Oksana Tkach

Sponsor Sessions Starts at 13:10 Don’t miss them, they might be providing some interesting and valuable information! Room A Room B Room C 13:10 - 13:30 DevArt Microsoft Eleks 13:30 - 13:50 DB Best Intapp DataArt NULL means no session in that room at that time 

Our Awesome Sponsors

Session will begin very soon :) Please complete the evaluation form from your pocket after the session. Your feedback will help us to improve future conferences and speakers will appreciate your feedback! Enjoy the conference!

Marcin Szeliga Data philosopher 20 years of experience with SQL Server Data Platform MVP & MCT Microsoft Certified Solutions Expert Data Platform Data Management and Analytics Cloud Platform and Infrastructure marcin.szeliga@datacommunity.pl

Agenda Tools MRO (Microsoft R Open), MRC (Microsoft R Client), MRS (Microsoft R Server) Tips & tricks on performing data science experiment with 540 lines of R code Data ingestion Data preparation Data profiling Data enhancement Data modeling Model evaluation Model improvement Model operationalization

Microsoft R Open (MRO) Based on R Open (Revolution R Open to be precise) Free and Open Source R distribution Compatible with all R-related software MRAN website https://mran.revolutionanalytics.com/ Enhanced and distributed by Microsoft Intel MKL Library Reproducible R toolkit ParallelR Rhadoop AzureML

Microsoft R Client (MRC) Free, community-supported, data science tool for high performance analytic http://aka.ms/rclient/download Built on top of Microsoft R Open (MRO) Brings together ScaleR technology and its proprietary functions Allows you to work with production data only locally Data to be processed must fit in local memory Processing is limited up to two threads for ScaleR functions R Tools for Visual Studio (RTVS) is an integrated development environment available as a free add-in for any edition of Visual Studio https://www.visualstudio.com/vs/rtvs/

Microsoft R Server (MRS) 9 R for the enterprise Available for download from MSDN and Visual Studio Dev Essentials Adds support for Remote execution Remote compute contexts Data chunking Additional threads for multithreaded processing Parallel processing and streaming R Server platforms R Server for Hadoop R Server for Teradata DB R Server for Linux R Server for Windows SQL Server R Services

What’s new in MRS 9 MRS 9.0 brings MRS 9.1 adds State-of-the-art machine learning algorithms (MicrosoftML library) Fast linear learner, with support for L1 and L2 regularization Fast boosted decision tree Fast random forest Logistic regression, with support for L1 and L2 regularization GPU-accelerated Deep Neural Networks (DNNs) with convolutions Binary classification using a One-Class Support Vector Machine Simplified operationalization of R Models (MRSDeploy) New data sources for Apache Hive and Parquet MRS 9.1 adds Pre-trained cognitive models for sentiment analysis and image featurization New platform - Apache Spark on a HDInsight cluster Real-time scoring

Data Science approach – follow the data SOURCE: CRISP-DM 1.0 http://www.crisp-dm.org/download.htm DESIGN: Nicole Leaper http://www.nicoleleaper.com

Solving problems with machine learning A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E (Tom Mitchell) Thanks to law of big numbers this is possible Hoeffding's inequality allows us to measure how trustworthy the results are Our goal is to classify transactions as fraudulent or not based on historical and demographic data The train data are a set of cases (observations), each of which is described by: List of attributes (input, explanatory, x, independed variables or just features) Label classes (output, explained, y, dependend variable or just labels)

Data ingestion Tidy your data Ensure fast access to the data Each variable forms a column Each observation forms a row Each type of observational unit forms a table Ensure fast access to the data Convert data into optimized for fast processing, compressed format XDF files (eXternal Data Frame) is a binary file format with an R interface that optimizes row and column processing and analysis Move computation where data is stored Set remote compute context

Data preparation Predictive models should provide most accurate and reliable predictions Feel free to add variables, transform them, and play with model parameters Find more data Datasets can be combined if they have at least one common variable Impute missing values Try to minimize changes in variables distribution Correct bad data Data that does not comply with bussines rules or common sense Deal with outliers Unusual values do not fall within the scope of 1.5*IQR

Data profiling For each variable Search for patterns Check how much information it contains (variance will help) Asses its quality (range, numer of missing observations, duplicates, outlieres) Search for patterns If systematic relationship exists between two variables it will appear as a pattern in the data When you spot a pattern, ask yourself Could this pattern be due to coincidence? How can you describe the relationship implied by the pattern? How strong is the relationship implied by the pattern? What other variables might affect the relationship? Descriptive statistics are simplifications - graphs tend to be more relevant and easier to interpret

Data enhancement Add features If you have knowledge in a given domain you can calculate them on the basis of other attributes Computed variables can also be the result of the technical transformations Split data into train, test and control sets (cross validation is even better but slower) Train set is used to detect patterns Test set is used to detect errors, always present in the data Control set is used only once, to a final assessment of data mining model If the distribution of output variable is heavily skewed, you should balance it Accuracy paradox: model with 99.99% accuracy can be completely useless

Data modeling Classification and regression are methods of supervised learning Source data contains ground truth Most data mining algorithms can be used for both tasks Logistic regression Linear regression Boosted decision tree Random forest Neural net

Model evaluation There is no single best model Ongoing evaluation of model performance is a must The best models are simple models that fit data well We need a balance between accuracy and simplicity In a binary classification scenario, the target variable has only two possible outcomes One is called positive p, second – negative n Since each case the true value of the output variable is known, we can simply submit these records for classification and compare the prediction with true values

Model evaluation cont. You can deduce from confusion matrix a series of measures assessing the quality of the classifier 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦= 𝑇𝑃+𝑇𝑁 𝑇𝑃+𝐹𝑁+𝐹𝑃+𝑇𝑁 p𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛= 𝑇𝑃 𝑇𝑃+𝐹𝑃 𝑟𝑒𝑐𝑎𝑙𝑙= 𝑇𝑃 𝑇𝑃+𝐹𝑁 𝐹−𝑠𝑐𝑜𝑟𝑒 =2∗ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 Single measure is handy High F-score means high precision and recall AUC is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one

Model improvement Model quality depends on many elements Gathering relevant and representative source data Proper data preparation Enriching train data Selecting the appropriate algorithm Hyperparameters tuning Do you remeber data mining life cycle?

Model operationalization

Thank you We moved from row data into inteligent fraud detection system in one hour Take some time to walk through the code at your pace Please evaluate all sessions After this session, you can speak with me In the conference venue Via social media https://www.linkedin.com/in/marcinszeliga/ Through an email marcin.szeliga@datacommuity.pl