Data Science that’s scale Marcin Szeliga Data Science that’s scale
SQLSat Kyiv Team Yevhen Nedashkivskyi Alesya Zhuk Eugene Polonichko Oksana Borysenko Mykola Pobyivovk Oksana Tkach
Sponsor Sessions Starts at 13:10 Don’t miss them, they might be providing some interesting and valuable information! Room A Room B Room C 13:10 - 13:30 DevArt Microsoft Eleks 13:30 - 13:50 DB Best Intapp DataArt NULL means no session in that room at that time
Our Awesome Sponsors
Session will begin very soon :) Please complete the evaluation form from your pocket after the session. Your feedback will help us to improve future conferences and speakers will appreciate your feedback! Enjoy the conference!
Marcin Szeliga Data philosopher 20 years of experience with SQL Server Data Platform MVP & MCT Microsoft Certified Solutions Expert Data Platform Data Management and Analytics Cloud Platform and Infrastructure marcin.szeliga@datacommunity.pl
Agenda Tools MRO (Microsoft R Open), MRC (Microsoft R Client), MRS (Microsoft R Server) Tips & tricks on performing data science experiment with 540 lines of R code Data ingestion Data preparation Data profiling Data enhancement Data modeling Model evaluation Model improvement Model operationalization
Microsoft R Open (MRO) Based on R Open (Revolution R Open to be precise) Free and Open Source R distribution Compatible with all R-related software MRAN website https://mran.revolutionanalytics.com/ Enhanced and distributed by Microsoft Intel MKL Library Reproducible R toolkit ParallelR Rhadoop AzureML
Microsoft R Client (MRC) Free, community-supported, data science tool for high performance analytic http://aka.ms/rclient/download Built on top of Microsoft R Open (MRO) Brings together ScaleR technology and its proprietary functions Allows you to work with production data only locally Data to be processed must fit in local memory Processing is limited up to two threads for ScaleR functions R Tools for Visual Studio (RTVS) is an integrated development environment available as a free add-in for any edition of Visual Studio https://www.visualstudio.com/vs/rtvs/
Microsoft R Server (MRS) 9 R for the enterprise Available for download from MSDN and Visual Studio Dev Essentials Adds support for Remote execution Remote compute contexts Data chunking Additional threads for multithreaded processing Parallel processing and streaming R Server platforms R Server for Hadoop R Server for Teradata DB R Server for Linux R Server for Windows SQL Server R Services
What’s new in MRS 9 MRS 9.0 brings MRS 9.1 adds State-of-the-art machine learning algorithms (MicrosoftML library) Fast linear learner, with support for L1 and L2 regularization Fast boosted decision tree Fast random forest Logistic regression, with support for L1 and L2 regularization GPU-accelerated Deep Neural Networks (DNNs) with convolutions Binary classification using a One-Class Support Vector Machine Simplified operationalization of R Models (MRSDeploy) New data sources for Apache Hive and Parquet MRS 9.1 adds Pre-trained cognitive models for sentiment analysis and image featurization New platform - Apache Spark on a HDInsight cluster Real-time scoring
Data Science approach – follow the data SOURCE: CRISP-DM 1.0 http://www.crisp-dm.org/download.htm DESIGN: Nicole Leaper http://www.nicoleleaper.com
Solving problems with machine learning A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E (Tom Mitchell) Thanks to law of big numbers this is possible Hoeffding's inequality allows us to measure how trustworthy the results are Our goal is to classify transactions as fraudulent or not based on historical and demographic data The train data are a set of cases (observations), each of which is described by: List of attributes (input, explanatory, x, independed variables or just features) Label classes (output, explained, y, dependend variable or just labels)
Data ingestion Tidy your data Ensure fast access to the data Each variable forms a column Each observation forms a row Each type of observational unit forms a table Ensure fast access to the data Convert data into optimized for fast processing, compressed format XDF files (eXternal Data Frame) is a binary file format with an R interface that optimizes row and column processing and analysis Move computation where data is stored Set remote compute context
Data preparation Predictive models should provide most accurate and reliable predictions Feel free to add variables, transform them, and play with model parameters Find more data Datasets can be combined if they have at least one common variable Impute missing values Try to minimize changes in variables distribution Correct bad data Data that does not comply with bussines rules or common sense Deal with outliers Unusual values do not fall within the scope of 1.5*IQR
Data profiling For each variable Search for patterns Check how much information it contains (variance will help) Asses its quality (range, numer of missing observations, duplicates, outlieres) Search for patterns If systematic relationship exists between two variables it will appear as a pattern in the data When you spot a pattern, ask yourself Could this pattern be due to coincidence? How can you describe the relationship implied by the pattern? How strong is the relationship implied by the pattern? What other variables might affect the relationship? Descriptive statistics are simplifications - graphs tend to be more relevant and easier to interpret
Data enhancement Add features If you have knowledge in a given domain you can calculate them on the basis of other attributes Computed variables can also be the result of the technical transformations Split data into train, test and control sets (cross validation is even better but slower) Train set is used to detect patterns Test set is used to detect errors, always present in the data Control set is used only once, to a final assessment of data mining model If the distribution of output variable is heavily skewed, you should balance it Accuracy paradox: model with 99.99% accuracy can be completely useless
Data modeling Classification and regression are methods of supervised learning Source data contains ground truth Most data mining algorithms can be used for both tasks Logistic regression Linear regression Boosted decision tree Random forest Neural net
Model evaluation There is no single best model Ongoing evaluation of model performance is a must The best models are simple models that fit data well We need a balance between accuracy and simplicity In a binary classification scenario, the target variable has only two possible outcomes One is called positive p, second – negative n Since each case the true value of the output variable is known, we can simply submit these records for classification and compare the prediction with true values
Model evaluation cont. You can deduce from confusion matrix a series of measures assessing the quality of the classifier 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦= 𝑇𝑃+𝑇𝑁 𝑇𝑃+𝐹𝑁+𝐹𝑃+𝑇𝑁 p𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛= 𝑇𝑃 𝑇𝑃+𝐹𝑃 𝑟𝑒𝑐𝑎𝑙𝑙= 𝑇𝑃 𝑇𝑃+𝐹𝑁 𝐹−𝑠𝑐𝑜𝑟𝑒 =2∗ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 Single measure is handy High F-score means high precision and recall AUC is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one
Model improvement Model quality depends on many elements Gathering relevant and representative source data Proper data preparation Enriching train data Selecting the appropriate algorithm Hyperparameters tuning Do you remeber data mining life cycle?
Model operationalization
Thank you We moved from row data into inteligent fraud detection system in one hour Take some time to walk through the code at your pace Please evaluate all sessions After this session, you can speak with me In the conference venue Via social media https://www.linkedin.com/in/marcinszeliga/ Through an email marcin.szeliga@datacommuity.pl