Download presentation
Presentation is loading. Please wait.
1
with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib #javaone
2
@tmatyashovsky
3
I am not a data science engineer
4
Motivation
5
Given verse from lyrics recognize genre
6
Pop vs. Heavy Metal “I'm a rolling thunder, a pouring rain I'm comin' on like a hurricane My lightning's flashing across the sky You're only young but you're gonna die I won't take no prisoners, won't spare no lives Nobody's putting up a fight I got my bell, I'm gonna take you to hell I'm gonna get you, Satan get you”
7
Pop vs. Heavy Metal “I'm a rolling thunder, a pouring rain I'm comin' on like a hurricane My lightning's flashing across the sky You're only young but you're gonna die I won't take no prisoners, won't spare no lives Nobody's putting up a fight I got my bell, I'm gonna take you to hell I'm gonna get you, Satan get you”
8
Ideas?
9
Ideas Look for particular words like “fear”, “fight”, “kill”, “devil”, ”death”, etc.? Count length of a verse? Count unique words in a verse?
10
Machine Learning?
11
Machine Learning in 15-20 mins
12
is the study of computer algorithms that improve automatically
Machine Learning is the study of computer algorithms that improve automatically through experience
13
Supervised learning Unsupervised learning Reinforcement learning
14
Supervised Learning
15
Speakers’ Feedback Dataset
Date & time Overall impression Conference name Overall rating Speaker Number of slides Talk name Time spent on live coding Track Number of jokes Duration Etc. Type
16
Features: Target variable: Training example: Training set: Learning algorithms Hypotheses: Сost function:
18
Number of jokes during a talk
Regression Speaker’s rating Score of the speaker based on xxx. Number of jokes during a talk
19
Linear Regression
20
OMG, Math at a conference?
21
Linear Regression
22
Linear Regression
23
No magic, it’s just math
24
Regression
25
Number of jokes during a talk
Classification Positive Impression Negative Quantity of jokes used. Liked or not liked the speaker. Number of jokes during a talk
26
Sigmoid (Logistic) Function
27
Logistic Regression
28
Logistic Regression
29
No magic, it’s just math
31
Unsupervised Learning
32
Clustering Number of clusters: K = 5 K = 2
Number of jokes during a talk Time (min.) spent on live coding
33
K-Means Initialize cluster centroids:
assign each example to the closest cluster centroid Recalculate centroids as an average (mean) of examples assigned to a cluster Assign or index each example to the cluster centroid closest to it Recalculate or move centroids as an average (mean) of examples assigned to a cluster Repeat until centroids not longer move
34
K-Means
36
ML-based Solution
37
Pop vs. Heavy Metal Collect data set of lyrics:
Abba, Ace of base, Backstreet Boys, Britney Spears, Christina Aguilera, Madonna, etc. Black Sabbath, In Flames, Iron Maiden, Metallica, Moonspell, Nightwish, Sentenced, etc. Create training set, i.e. label (0|1) + features Train logistic regression (or other classification algorithm)
38
https://github.com/tmatyashovsky/spark-ml-samples
39
Feature Extraction
40
GloVe Bag of Words Word2Vec TF-IDF
Bag of words – a single word is a one hot encoding vector with the size of the dictionary. As a result – a lot of sparse vectors.
41
Word2Vec Produces unique fixed-size dense vectors
Captures semantic and morphologic similarity Behind the scenes - a two-layer neural net that processes text. Captures semantic and morphologic similarity so similar words are close in the vector space Similar words would be clustered together in the high dimensional sphere.
42
Word2Vec Opposite Similar scores Unrelated scores (cos ~ 1) (cos ~ 0)
Open Conference Conference Open Closed Metal Similar scores (cos ~ 1) Unrelated scores (cos ~ 0) Opposite scores (cos ~ -1) If two words are very close to synonymous, you’d expect them to show up in similar contexts, and indeed synonymous words tend to be close. For two completely random words, the similarity is pretty close to 0. On an opposite side there is not an antonym, but usually just a noise. Used Google News Negative 300.
43
“Love you” Similarity Verse Cosine Distance baby one more time
crazy for you show me the meaning of being lonely highway to hell kill them all My corpus words
44
https://github.com/tmatyashovsky/spark-ml-samples
45
The Best Model?
46
Model Selection hyper parameter tuning
47
Evaluating Hypothesis
Under-fitting (high bias) Appropriate fitting Over-fitting (high variance)
48
K-folds Cross Validation
Training set (66,6%) K = 3 Test set (33%)
49
K-folds Cross Validation
Test set (33%) K = 3 Training set (66,6%)
50
K-folds Cross Validation
Training set (33,3%) K = 3 Test set (33%) Training set (33,3%)
51
Phew, enough of theory!
52
Practice using Java Let’s finally go to the implementation using a library or framework that is going to help us to avoid tedious transformations and provide algorithms as well as feature extractors out-of-the-box.
53
Weka Encog Aerosolve FlinkML
54
Speed Generality Cloud computing Data processing Easy of use
55
Component Stack
56
Spark MLlib Is a library of ML algorithms and utilities
designed to run in parallel on Spark cluster
57
MLlib Design & Philosophy
Introduces a few new data types, e.g. vector (dense and sparse), labeled point, rating, etc. Allows to invoke various algorithms on distributed datasets (RDD/Dataset)
58
Build on top of Datasets
spark.mllib spark.ml Build on top of RDDs Build on top of Datasets
59
spark.mllib Features Utilities: linear algebra, statistics, etc.
Features extraction, features transforming, etc. Regression Classification Clustering Collaborative filtering, e.g. alternating least squares Dimensionality reduction And many more
60
spark.ml Features ”All” spark.mllib features plus: Pipelines
Persistence Model selection and tuning: Train validation split K-folds cross validation
61
[pipeline, evaluator, parameters]
Estimator [parameters] Raw data Transformer Dataset Dataset Dataset Cross Validator [pipeline, evaluator, parameters] Estimator [parameters] Dataset Dataset Transformer [parameters]
62
Pop vs. Heavy Metal Using Spark MLlib Pipeline
63
Spark ML Pipeline Lyrics
64
Raw Unknown Lyrics I'm a rolling thunder, a pouring rain I'm comin' on like a hurricane My lightning's flashing across the sky You're only young but you're gonna die I won't take no prisoners, won't spare no lives Nobody's putting up a fight I got my bell, I'm gonna take you to hell I'm gonna get you, Satan get you
65
Spark ML Pipeline Lyrics Cleanser Dataset
66
Cleanser I'm a rolling thunder, a pouring rain I'm comin' on like a hurricane My lightning's flashing across the sky You're only young but you're gonna die I won't take no prisoners, won't spare no lives Nobody's putting up a fight I got my bell, I'm gonna take you to hell I'm gonna get you, Satan get you
67
Spark ML Pipeline Lyrics Cleanser Numerator Dataset Dataset
68
Numerator Im a rolling thunder a pouring rain Im comin on like a hurricane My lightnings flashing across the sky Youre only young but youre gonna die I wont take no prisoners wont spare no lives Nobodys putting up a fight I got my bell Im gonna take you to hell Im gonna get you Satan get you 1 2 3 4 5 6 7 8
69
Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words
Dataset Numerator Dataset Dataset Tokenizer Dataset Stop Words Remover
70
Stop Words Remover im a rolling thunder a pouring rain im comin on like a hurricane My lightnings flashing across the sky youre only young but youre gonna die I wont take no prisoners wont spare no lives nobodys putting up a fight I got my bell im gonna take you to hell im gonna get you satan get you 1 2 3 4 5 6 7 8
71
Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words
Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover
72
Stemmer im rolling thunder pouring rain im comin like hurricane lightnings flashing across sky youre young youre gonna die wont take prisoners wont spare lives nobodiys putting fight got bell im gonna take hell im gonna get satan get 1 2 3 4 5 6 7 8
73
Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words
Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover Dataset Verser [Sentences in verse]
74
Verser (sentencesInVerse = 4)
im roll thunder pour rain im comin like hurrican lightn flash across sky your young your gonna die wont take prison wont spare live nobodi put fight got bell im gonna take hell im gonna get satan get 1 2 3 4 5 6 7 8 verse1 verse2
75
Verser (sentencesInVerse = 8)
im roll thunder pour rain im comin like hurrican Light n flash across sky your young your gonna die wont take prison wont spare live nobodi put fight got bell im gonna take hell im gonna get satan get 1 2 3 4 5 6 7 8 verse1
76
Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words
Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover Dataset Word2Vec [Vector size] Verser [Sentences in verse] Dataset
77
Word2Vec (sentencesInVerse = 4)
[ , , ... ] feature1 [ , , ... ] feature2
78
Word2Vec (sentencesInVerse = 8)
[ , , , , ... , , ] feature1
79
Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words
Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover Dataset Word2Vec [Vector size] Verser [Sentences in verse] Logistic Regression [Max iterations, Reg parameter] Dataset Dataset
80
Logistic Regression [0.9212126972383768, Probability: Prediction:
] Prediction: 0.0
81
Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words
Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover Dataset Word2Vec [Vector size] Verser [Sentences in verse] Logistic Regression [Max iterations, Reg parameter] Cross Validator Model Dataset Dataset
82
CV Average Metrics [ , , , , ... , ]
83
Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words
Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover Dataset Word2Vec [Vector size] Verser [Sentences in verse] Logistic Regression [Max iterations, Reg parameter] Cross Validator Model Dataset Dataset
84
Demo Time
85
Summary
86
Summary ML is not as complex as it seems from an applied perspective
Existing libraries and frameworks reduce a lot of tedious work For instance, Spark MLlib can help to build nice ML pipelines
87
Thank you! @tmatyashovsky
@LejlekF Design by
88
References Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia l
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.