Azure Machine Learning My first Data Science experiment Using Azure Machine Learning
Our Main Sponsors:
Speaker Florian / fleid.frfleid.fr Cellenza 156, bd Haussmann Paris, France
For who? BeginnerExperimented Beginner Experimented Machine Learning Azure ML
Agenda A quick word on Azure ML then: Two experimentations I’m the owner of a flat/condo in Paris and I want to sell it! I’m in marketing and I want my promotion s to reach their targets through anti- spam software
Azure ML in one schema Business Need Business Value Modeling Deployment HDInsight SQL Server VM SQL DB Blobs & Tables Local Files Excel Files … Cloud Local Storage space IDE for Machine Learning Publication as a web service API Monetization ML Studio API Microsoft Azure Marketplace Web
First : Enable your ML Studio In the Azure portal, with an Azure account
Azure ML Studio
Before that : my 1st experimentation I want to sell my flat Paris, France 2 bedrooms 55 m2 … But at what price?
How to answer that?
Surface (m 2 ) Price (€) My flat A fair price!
But how to generalize? Thousands of price points (facts) Often hundreds of features (dimension attributes) Surface Nb of rooms Storage area Parking Exact Location in town Floor (correlated to the presence of a lift) Age of the building Empty or equipped Distance to metro / public transportation Distance to shops … Machine Learning!
Machine Learning Building a system that will learn from the existing data, detecting pattern and trends, so that it can predict a continuous value! Supervised Learning >> Regression
Linear Regression (1 feature) Surface (m 2 ) Price (€) My flat Market price y = ax + b y : price x : surface
My ML System My surface A good price estimate Machine Learning xy y = ax + b Surface (m 2 ) Price(€)
My ML System Input : x My surface Output : y An estimate of price h The hypothesis xy y = ax + b Surface (m 2 ) Price(€) y = h(x)
Input : x My surface Output : y An estimate of price h The hypothesis y = h(x) x θ0θ0 y = θ 1 x + θ 0 y = h(x) h(x) = h θ (x) = θ 0 + θ 1 x
Parameters ranking : Cost Function J(θ i ) : Cost Function Function of thetas, that calculate the total distance between my model and the training set x x y θ0θ0 Model A θ 0 = 1 θ 1 = 0 y = θ 1 x + θ 0 Model B θ 0 = 1 θ 1 = 0,25
Parameters ranking : Cost Function J(θ i ) : Cost Function Function of thetas, that calculate the total distance between my model and the training set x x y θ0θ0 Model A θ 0 = 1 θ 1 = 0 y = θ 1 x + θ 0 Model B θ 0 = 1 θ 1 = 0,25 J(θ 0,θ 1 ) = 25 J(θ 0,θ 1 ) = 5
The last piece of the puzzle Training Set Model type Cost Error Function … ? y = h(x) h(x) = h θ (x) = θ 0 + θ 1 x
The last piece of the puzzle Training Set Model type Cost Error Function Optimization Method y = h(x) h(x) = h θ (x) = θ 0 + θ 1 x
My ML System Input : x My surface Output : y An estimate of price h The hypothesis xy y = ax + b Surface (m 2 ) Price(€) y = h(x) - Cost Function - Optimization Method
Demo 1
Variance and Bias Underfit Overfit
My 2 nd experimentation As a spammer a marketing professional, how to be sure that my ads high value content optimize the ROI gets maximum viewing on the prospect listings I got from that shady company In short: I want to know if my messages are going to be flagged as spam or not before I send them
Exposing the API to users in Excel SPAM!
To get there…
Machine Learning Building a system that will learn from the existing data, detecting pattern and trends, so that it can predicts a category! Supervised Learning >> Classification
What features for my classification? 1st experimentation : surface, location, floor… Now? 1 line = 1 message LabelAttribut 0Attribut 1Attribut 2… Spam21 Ham4 31 Spam1 Ham12
Intuition Labeloffernewservicerevolutionize… Spam1111
The data set SpamAssassin : mails, unstructured text
Standard approach Normalization url > #url > # $,£,€ > #devise Removal of numbers, punctuation, stopwords, HTML tags Lower case Length from 3 to 10 max Stemming
Generation of the training corpus 6000 mails > N reference words We keep the top 10’000 by frequence of usage A set of 6000 lines, columns: LabelWord 0Word 1Word 2Word 3…Word Spam211 Ham Spam12 Ham121 … Spam1
Implementation in Azure ML The Hate No module for normalizing Has to be done before, in an ETL like data pipeline The Love We don’t need to do a full normalization! Feature Hashing using Vowpal-Wabbit
Demo 2 Sources SpamAssassin : Coursera : Machine Learning par Andrew Ng (ex 6 – spam detection with SVM)Machine Learning Classifying s as Spam or Ham using RTextTools, Dennis Lee (blog)blog AzureML Web Service Scoring with Excel and Power Query, Rui Quintino (blog)blog
To go further For everyone 1 month free trial For MSDN subscribers Activate your Azure benefits Download now Included in almost all Office licences us/powerBI/support/default.aspx NB : Power BI in Excel is not hosted at PowerBI.com, be aware when you try to download it
To go further : the communities sqlpass.org sqlport.com guss.pro