with Apache Spark MLlib

with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib #javaone

@tmatyashovsky

I am not a data science engineer

Motivation

Given verse from lyrics recognize genre

Pop vs. Heavy Metal “I'm a rolling thunder, a pouring rain I'm comin' on like a hurricane My lightning's flashing across the sky You're only young but you're gonna die I won't take no prisoners, won't spare no lives Nobody's putting up a fight I got my bell, I'm gonna take you to hell I'm gonna get you, Satan get you”

Ideas?

Ideas Look for particular words like “fear”, “fight”, “kill”, “devil”, ”death”, etc.? Count length of a verse? Count unique words in a verse?

Machine Learning?

Machine Learning in 15-20 mins

is the study of computer algorithms that improve automatically
Machine Learning is the study of computer algorithms that improve automatically through experience

Supervised learning Unsupervised learning Reinforcement learning

Supervised Learning

Speakers’ Feedback Dataset
Date & time Overall impression Conference name Overall rating Speaker Number of slides Talk name Time spent on live coding Track Number of jokes Duration Etc. Type

Features: Target variable: Training example: Training set: Learning algorithms Hypotheses: Сost function:

Number of jokes during a talk
Regression Speaker’s rating Score of the speaker based on xxx. Number of jokes during a talk

Linear Regression

OMG, Math at a conference?

Linear Regression

No magic, it’s just math

Regression

Number of jokes during a talk
Classification Positive Impression Negative Quantity of jokes used. Liked or not liked the speaker. Number of jokes during a talk

Sigmoid (Logistic) Function

Logistic Regression

No magic, it’s just math

Unsupervised Learning

Clustering Number of clusters: K = 5 K = 2
Number of jokes during a talk Time (min.) spent on live coding

K-Means Initialize cluster centroids:
assign each example to the closest cluster centroid Recalculate centroids as an average (mean) of examples assigned to a cluster Assign or index each example to the cluster centroid closest to it Recalculate or move centroids as an average (mean) of examples assigned to a cluster Repeat until centroids not longer move

K-Means

ML-based Solution

Pop vs. Heavy Metal Collect data set of lyrics:
Abba, Ace of base, Backstreet Boys, Britney Spears, Christina Aguilera, Madonna, etc. Black Sabbath, In Flames, Iron Maiden, Metallica, Moonspell, Nightwish, Sentenced, etc. Create training set, i.e. label (0|1) + features Train logistic regression (or other classification algorithm)

https://github.com/tmatyashovsky/spark-ml-samples

Feature Extraction

GloVe Bag of Words Word2Vec TF-IDF
Bag of words – a single word is a one hot encoding vector with the size of the dictionary. As a result – a lot of sparse vectors.

Word2Vec Produces unique fixed-size dense vectors
Captures semantic and morphologic similarity Behind the scenes - a two-layer neural net that processes text. Captures semantic and morphologic similarity so similar words are close in the vector space Similar words would be clustered together in the high dimensional sphere.

Word2Vec Opposite Similar scores Unrelated scores (cos ~ 1) (cos ~ 0)
Open Conference Conference Open Closed Metal Similar scores (cos ~ 1) Unrelated scores (cos ~ 0) Opposite scores (cos ~ -1) If two words are very close to synonymous, you’d expect them to show up in similar contexts, and indeed synonymous words tend to be close. For two completely random words, the similarity is pretty close to 0. On an opposite side there is not an antonym, but usually just a noise. Used Google News Negative 300.

“Love you” Similarity Verse Cosine Distance baby one more time
crazy for you show me the meaning of being lonely highway to hell kill them all My corpus words

https://github.com/tmatyashovsky/spark-ml-samples

The Best Model?

Model Selection hyper parameter tuning

Evaluating Hypothesis
Under-fitting (high bias) Appropriate fitting Over-fitting (high variance)

K-folds Cross Validation
Training set (66,6%) K = 3 Test set (33%)

Test set (33%) K = 3 Training set (66,6%)

Training set (33,3%) K = 3 Test set (33%) Training set (33,3%)

Phew, enough of theory!

Practice using Java Let’s finally go to the implementation using a library or framework that is going to help us to avoid tedious transformations and provide algorithms as well as feature extractors out-of-the-box.

Weka Encog Aerosolve FlinkML

Speed Generality Cloud computing Data processing Easy of use

Component Stack

Spark MLlib Is a library of ML algorithms and utilities
designed to run in parallel on Spark cluster

MLlib Design & Philosophy
Introduces a few new data types, e.g. vector (dense and sparse), labeled point, rating, etc. Allows to invoke various algorithms on distributed datasets (RDD/Dataset)

Build on top of Datasets
spark.mllib spark.ml Build on top of RDDs Build on top of Datasets

spark.mllib Features Utilities: linear algebra, statistics, etc.
Features extraction, features transforming, etc. Regression Classification Clustering Collaborative filtering, e.g. alternating least squares Dimensionality reduction And many more

spark.ml Features ”All” spark.mllib features plus: Pipelines
Persistence Model selection and tuning: Train validation split K-folds cross validation

[pipeline, evaluator, parameters]
Estimator [parameters] Raw data Transformer Dataset Dataset Dataset Cross Validator [pipeline, evaluator, parameters] Estimator [parameters] Dataset Dataset Transformer [parameters]

Pop vs. Heavy Metal Using Spark MLlib Pipeline

Spark ML Pipeline Lyrics

Raw Unknown Lyrics I'm a rolling thunder, a pouring rain I'm comin' on like a hurricane My lightning's flashing across the sky You're only young but you're gonna die I won't take no prisoners, won't spare no lives Nobody's putting up a fight I got my bell, I'm gonna take you to hell I'm gonna get you, Satan get you

Spark ML Pipeline Lyrics Cleanser Dataset

Cleanser I'm a rolling thunder, a pouring rain I'm comin' on like a hurricane My lightning's flashing across the sky You're only young but you're gonna die I won't take no prisoners, won't spare no lives Nobody's putting up a fight I got my bell, I'm gonna take you to hell I'm gonna get you, Satan get you

Spark ML Pipeline Lyrics Cleanser Numerator Dataset Dataset

Numerator Im a rolling thunder a pouring rain Im comin on like a hurricane My lightnings flashing across the sky Youre only young but youre gonna die I wont take no prisoners wont spare no lives Nobodys putting up a fight I got my bell Im gonna take you to hell Im gonna get you Satan get you 1 2 3 4 5 6 7 8

Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words
Dataset Numerator Dataset Dataset Tokenizer Dataset Stop Words Remover

Stop Words Remover im a rolling thunder a pouring rain im comin on like a hurricane My lightnings flashing across the sky youre only young but youre gonna die I wont take no prisoners wont spare no lives nobodys putting up a fight I got my bell im gonna take you to hell im gonna get you satan get you 1 2 3 4 5 6 7 8

Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover

Stemmer im rolling thunder pouring rain im comin like hurricane lightnings flashing across sky youre young youre gonna die wont take prisoners wont spare lives nobodiys putting fight got bell im gonna take hell im gonna get satan get 1 2 3 4 5 6 7 8

Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover Dataset Verser [Sentences in verse]

Verser (sentencesInVerse = 4)
im roll thunder pour rain im comin like hurrican lightn flash across sky your young your gonna die wont take prison wont spare live nobodi put fight got bell im gonna take hell im gonna get satan get 1 2 3 4 5 6 7 8 verse1 verse2

Verser (sentencesInVerse = 8)
im roll thunder pour rain im comin like hurrican Light n flash across sky your young your gonna die wont take prison wont spare live nobodi put fight got bell im gonna take hell im gonna get satan get 1 2 3 4 5 6 7 8 verse1

Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover Dataset Word2Vec [Vector size] Verser [Sentences in verse] Dataset

Word2Vec (sentencesInVerse = 4)
[ , , ... ] feature1 [ , , ... ] feature2

Word2Vec (sentencesInVerse = 8)
[ , , , , ... , , ] feature1

Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover Dataset Word2Vec [Vector size] Verser [Sentences in verse] Logistic Regression [Max iterations, Reg parameter] Dataset Dataset

Logistic Regression [0.9212126972383768, Probability: Prediction:
] Prediction: 0.0

Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover Dataset Word2Vec [Vector size] Verser [Sentences in verse] Logistic Regression [Max iterations, Reg parameter] Cross Validator Model Dataset Dataset

CV Average Metrics [ , , , , ... , ]

Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover Dataset Word2Vec [Vector size] Verser [Sentences in verse] Logistic Regression [Max iterations, Reg parameter] Cross Validator Model Dataset Dataset

Demo Time

Summary

Summary ML is not as complex as it seems from an applied perspective
Existing libraries and frameworks reduce a lot of tedious work For instance, Spark MLlib can help to build nice ML pipelines

Thank you! @tmatyashovsky
@LejlekF Design by

References Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia l

with Apache Spark MLlib

Similar presentations

Presentation on theme: "with Apache Spark MLlib"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

with Apache Spark MLlib

Similar presentations

Presentation on theme: "with Apache Spark MLlib"— Presentation transcript:

Similar presentations

About project

Feedback