Welcome everyone. Been to good sessions, exciting ones coming up. My first SQL Bits session. Introduction to me.
Machine Learning The Maths Behind My introduction to ML. Data analytics world getting more interested in ML. Noticed a trend. Black box A shame Need to understand to effectively use. More frustration. Missing fun. Things I want to put right today. To show how quick, 3 algorithms. 3 popular, used before, use again.
Quick overview – classic classification problem. Variables & class.
Classification Supervised – we know if it’s dog or cat. height weight
Decision Tree Classification Widely used, popular, easy to understand.
Would I have survived?
All passengers 500 : 809 Died
All passengers 500 : 809 Men 142 : 640 Women 308 : 112 Children 50 : 57 Survived Died Died
All passengers 500 : 809 Men 142 : 640 Women 308 : 112 Children 50 : 57 1st Class 55 : 114 2nd Class 22 : 138 3rd Class 65 : 388 1st Class 124 : 4 2nd Class 85 : 12 3rd Class 99 : 96 1st Class 21 : 5 2nd Class 12 : 8 3rd Class 17 : 44 Died Died Died Survived Survived Survived Survived Survived Died
14% All passengers 500 : 809 Men 142 : 640 Women 308 : 112 Children 50 : 57 1st Class 55 : 114 2nd Class 22 : 138 3rd Class 65 : 388 1st Class 124 : 4 2nd Class 85 : 12 3rd Class 99 : 96 1st Class 21 : 5 2nd Class 12 : 8 3rd Class 17 : 44 Died Died Died Survived Survived Survived Survived Survived Died 14%
Classification Demo
Which variable do we split on? The best variable pclass age name cabin fare boat body Which variable do we split on? The best variable parch ticket survived embarked home.dest sex sibsp
The best variable pclass age name cabin fare boat body parch ticket survived embarked home.dest sex sibsp
𝐺𝑖𝑛𝑖=1− 𝑖=1 𝐶 ( 𝑝 𝑖 ) 2 The Gini Index 𝐺𝑖𝑛𝑖=1− 𝑖=1 𝐶 ( 𝑝 𝑖 ) 2 Gini index originally used to calculate income inequality. Derived 1912 – present. Low Gini index is desired.
𝑔 𝑃 𝑠 = 1 − 500 1309 2 − 809 1309 2 ≈0.4721 𝑔 𝑃 𝑚 = 1 − 161 843 2 − 682 843 2 ≈0.3090 𝑔 𝑃 𝑓 = 1 − 339 466 2 − 127 466 2 ≈0.3965 𝐺=0.4721 − 843 1309 ×0.3090 − 466 1309 ×0.3965 ≈0.132
sex = 0.132 sex = male sex = female fare pclass sibsp parch age embarked age pclass fare parch embarked sibsp pclass fare sibsp embarked parch age
pclass 1 2 3 1.5 2.5 Continue recursively, until sub group reaches min size, or no improvement. Greedy. Cross validation performed to trim.
Used a lot in data analytics. Similar to classification.
Regression profit time
Regression Support Vector SVR similar to SVM, getting more popular. Works well with non-linear.
How much is my car worth?
𝑟= (𝑥− 𝑥 )(𝑦− 𝑦 ) (𝑥− 𝑥 ) 2 (𝑦− 𝑦 ) 2 𝑎= 𝑦 - b 𝑥 15,000- 𝑦=𝑎+𝑏𝑥 𝑟= (𝑥− 𝑥 )(𝑦− 𝑦 ) (𝑥− 𝑥 ) 2 (𝑦− 𝑦 ) 2 𝑎= 𝑦 - b 𝑥 𝑏=𝑟 𝜎 𝑦 𝜎 𝑥 price (€) Uses a Kernel function, this case Radial Basis function Kernel. 0 - | 1980 | 2015 year
10,000- 15,000- price (€) 3,000 - 0 - | 2003 | 1980 | 2011 | 2015 year
15,000- price (€) 0 - | 1980 | 2015 year
price (€) year 15,000- 𝑓 𝑥 + 𝜀 𝑓 𝑥 − 𝜀 0 - | 1980 | 2015 Outside margin points penalised. Principle of maximal margin. Important to prevent overfitting. 0 - | 1980 | 2015 year
Regression Demo
200,000- price (€) 0 - | 1980 | 2015 year
A third dimension …and so on forth third fifth sixth
Netflix clustering. Similar to HBO.
Clustering
K-Means Clustering K-Means – most popular, released decades ago, improvements.
What new music can I discover?
Euclidean distance= ( 𝑥 2 − 𝑥 1 ) 2 + ( 𝑦 2 − 𝑦 1 ) 2
Clustering Demo Better algorithms for problem: Gaussian mixture models, hierarchical clustering.
We’ve gone through 3 algorithms in 1 hour. List of good sources…
github.com/matt-willis https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_SQL_Cheat_Sheet_Python.pdf www.listendata.com alex.smola.org/papers/2004/SmoSch04.pdf data-flair.training/blogs www.analyticsvidhya.com www.kdnuggets.com cowlet.org github.com/matt-willis hackernoon.com r4ds.had.co.nz blogs.adatis.co.uk towardsdatascience.com archive.ics.uci.edu/ml/datasets.html www.r-bloggers.com elitedatascience.com www.datacamp.com www.datasciencecentral.com trevorstephens.com kernelsvm.tripod.com https://runawayhorse001.github.io/LearningApacheSpark/clustering.html www.coursera.org www.learnbymarketing.com wiki.icub.org/images/8/82/OnlineSVR_Thesis.pdf
matt.willis@adatis.bg Email me. Speak to me. I hope you’ve learnt something. More importantly, changed your approach.