with Apache Spark MLlib Introduction to ML with Apache Spark MLlib #javaone
@tmatyashovsky https://ua.linkedin.com/in/tarasmatyashovsky
I am not a data science engineer
Motivation
Given verse from lyrics recognize genre
Pop vs. Heavy Metal “I'm a rolling thunder, a pouring rain I'm comin' on like a hurricane My lightning's flashing across the sky You're only young but you're gonna die I won't take no prisoners, won't spare no lives Nobody's putting up a fight I got my bell, I'm gonna take you to hell I'm gonna get you, Satan get you” https://github.com/tmatyashovsky/spark-ml-samples
Pop vs. Heavy Metal “I'm a rolling thunder, a pouring rain I'm comin' on like a hurricane My lightning's flashing across the sky You're only young but you're gonna die I won't take no prisoners, won't spare no lives Nobody's putting up a fight I got my bell, I'm gonna take you to hell I'm gonna get you, Satan get you” https://github.com/tmatyashovsky/spark-ml-samples
Ideas?
Ideas Look for particular words like “fear”, “fight”, “kill”, “devil”, ”death”, etc.? Count length of a verse? Count unique words in a verse?
Machine Learning?
Machine Learning in 15-20 mins
is the study of computer algorithms that improve automatically Machine Learning is the study of computer algorithms that improve automatically through experience
Supervised learning Unsupervised learning Reinforcement learning
Supervised Learning
Speakers’ Feedback Dataset Date & time Overall impression Conference name Overall rating Speaker Number of slides Talk name Time spent on live coding Track Number of jokes Duration Etc. Type
Features: Target variable: Training example: Training set: Learning algorithms Hypotheses: Сost function:
http://www.slideshare.net/liweiyang5/spark-mllib-training-material
Number of jokes during a talk Regression Speaker’s rating Score of the speaker based on xxx. Number of jokes during a talk
Linear Regression
OMG, Math at a conference?
Linear Regression
Linear Regression
No magic, it’s just math
Regression
Number of jokes during a talk Classification Positive Impression Negative Quantity of jokes used. Liked or not liked the speaker. Number of jokes during a talk
Sigmoid (Logistic) Function
Logistic Regression
Logistic Regression
No magic, it’s just math
Unsupervised Learning
Clustering Number of clusters: K = 5 K = 2 Number of jokes during a talk Time (min.) spent on live coding
K-Means Initialize cluster centroids: assign each example to the closest cluster centroid Recalculate centroids as an average (mean) of examples assigned to a cluster Assign or index each example to the cluster centroid closest to it Recalculate or move centroids as an average (mean) of examples assigned to a cluster Repeat until centroids not longer move
K-Means
ML-based Solution
Pop vs. Heavy Metal Collect data set of lyrics: Abba, Ace of base, Backstreet Boys, Britney Spears, Christina Aguilera, Madonna, etc. Black Sabbath, In Flames, Iron Maiden, Metallica, Moonspell, Nightwish, Sentenced, etc. Create training set, i.e. label (0|1) + features Train logistic regression (or other classification algorithm) https://github.com/tmatyashovsky/spark-ml-samples
https://github.com/tmatyashovsky/spark-ml-samples
Feature Extraction
GloVe Bag of Words Word2Vec TF-IDF Bag of words – a single word is a one hot encoding vector with the size of the dictionary. As a result – a lot of sparse vectors. http://spark.apache.org/docs/latest/ml-features.html#feature-extractors
Word2Vec Produces unique fixed-size dense vectors Captures semantic and morphologic similarity Behind the scenes - a two-layer neural net that processes text. Captures semantic and morphologic similarity so similar words are close in the vector space Similar words would be clustered together in the high dimensional sphere. https://code.google.com/archive/p/word2vec/
Word2Vec Opposite Similar scores Unrelated scores (cos ~ 1) (cos ~ 0) Open Conference Conference Open Closed Metal Similar scores (cos ~ 1) Unrelated scores (cos ~ 0) Opposite scores (cos ~ -1) If two words are very close to synonymous, you’d expect them to show up in similar contexts, and indeed synonymous words tend to be close. For two completely random words, the similarity is pretty close to 0. On an opposite side there is not an antonym, but usually just a noise. Used Google News Negative 300. http://bionlp-www.utu.fi/wv_demo/ http://blog.christianperone.com/wp-content/uploads/2013/09/cosinesimilarityfq1.png
“Love you” Similarity Verse Cosine Distance baby one more time 0.482028 crazy for you 0.437875 show me the meaning of being lonely 0.258147 highway to hell -0.1120049 kill them all -0.231876 My corpus - 8316 words https://github.com/tmatyashovsky/spark-ml-samples
https://github.com/tmatyashovsky/spark-ml-samples
The Best Model?
Model Selection hyper parameter tuning
Evaluating Hypothesis Under-fitting (high bias) Appropriate fitting Over-fitting (high variance) http://mlwiki.org/index.php/Overfitting
K-folds Cross Validation Training set (66,6%) K = 3 Test set (33%)
K-folds Cross Validation Test set (33%) K = 3 Training set (66,6%)
K-folds Cross Validation Training set (33,3%) K = 3 Test set (33%) Training set (33,3%)
Phew, enough of theory!
Practice using Java Let’s finally go to the implementation using a library or framework that is going to help us to avoid tedious transformations and provide algorithms as well as feature extractors out-of-the-box.
Weka Encog Aerosolve FlinkML https://github.com/josephmisiti/awesome-machine-learning
Speed Generality Cloud computing Data processing Easy of use
Component Stack https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
Spark MLlib Is a library of ML algorithms and utilities designed to run in parallel on Spark cluster
MLlib Design & Philosophy Introduces a few new data types, e.g. vector (dense and sparse), labeled point, rating, etc. Allows to invoke various algorithms on distributed datasets (RDD/Dataset) http://spark.apache.org/docs/latest/mllib-guide.html
Build on top of Datasets spark.mllib spark.ml Build on top of RDDs Build on top of Datasets http://spark.apache.org/docs/latest/mllib-guide.html
spark.mllib Features Utilities: linear algebra, statistics, etc. Features extraction, features transforming, etc. Regression Classification Clustering Collaborative filtering, e.g. alternating least squares Dimensionality reduction And many more http://spark.apache.org/docs/latest/mllib-guide.html
spark.ml Features ”All” spark.mllib features plus: Pipelines Persistence Model selection and tuning: Train validation split K-folds cross validation http://spark.apache.org/docs/latest/ml-guide.html
[pipeline, evaluator, parameters] Estimator [parameters] Raw data Transformer Dataset Dataset Dataset Cross Validator [pipeline, evaluator, parameters] Estimator [parameters] Dataset Dataset Transformer [parameters] http://spark.apache.org/docs/latest/ml-pipeline.html
Pop vs. Heavy Metal Using Spark MLlib Pipeline
Spark ML Pipeline Lyrics https://github.com/tmatyashovsky/spark-ml-samples
Raw Unknown Lyrics I'm a rolling thunder, a pouring rain I'm comin' on like a hurricane My lightning's flashing across the sky You're only young but you're gonna die I won't take no prisoners, won't spare no lives Nobody's putting up a fight I got my bell, I'm gonna take you to hell I'm gonna get you, Satan get you https://github.com/tmatyashovsky/spark-ml-samples
Spark ML Pipeline Lyrics Cleanser Dataset https://github.com/tmatyashovsky/spark-ml-samples
Cleanser I'm a rolling thunder, a pouring rain I'm comin' on like a hurricane My lightning's flashing across the sky You're only young but you're gonna die I won't take no prisoners, won't spare no lives Nobody's putting up a fight I got my bell, I'm gonna take you to hell I'm gonna get you, Satan get you https://github.com/tmatyashovsky/spark-ml-samples
Spark ML Pipeline Lyrics Cleanser Numerator Dataset Dataset https://github.com/tmatyashovsky/spark-ml-samples
Numerator Im a rolling thunder a pouring rain Im comin on like a hurricane My lightnings flashing across the sky Youre only young but youre gonna die I wont take no prisoners wont spare no lives Nobodys putting up a fight I got my bell Im gonna take you to hell Im gonna get you Satan get you 1 2 3 4 5 6 7 8 https://github.com/tmatyashovsky/spark-ml-samples
Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words Dataset Numerator Dataset Dataset Tokenizer Dataset Stop Words Remover https://github.com/tmatyashovsky/spark-ml-samples
Stop Words Remover im a rolling thunder a pouring rain im comin on like a hurricane My lightnings flashing across the sky youre only young but youre gonna die I wont take no prisoners wont spare no lives nobodys putting up a fight I got my bell im gonna take you to hell im gonna get you satan get you 1 2 3 4 5 6 7 8 https://github.com/tmatyashovsky/spark-ml-samples
Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover https://github.com/tmatyashovsky/spark-ml-samples
Stemmer im rolling thunder pouring rain im comin like hurricane lightnings flashing across sky youre young youre gonna die wont take prisoners wont spare lives nobodiys putting fight got bell im gonna take hell im gonna get satan get 1 2 3 4 5 6 7 8 https://github.com/tmatyashovsky/spark-ml-samples
Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover Dataset Verser [Sentences in verse] https://github.com/tmatyashovsky/spark-ml-samples
Verser (sentencesInVerse = 4) im roll thunder pour rain im comin like hurrican lightn flash across sky your young your gonna die wont take prison wont spare live nobodi put fight got bell im gonna take hell im gonna get satan get 1 2 3 4 5 6 7 8 verse1 verse2 https://github.com/tmatyashovsky/spark-ml-samples
Verser (sentencesInVerse = 8) im roll thunder pour rain im comin like hurrican Light n flash across sky your young your gonna die wont take prison wont spare live nobodi put fight got bell im gonna take hell im gonna get satan get 1 2 3 4 5 6 7 8 verse1 https://github.com/tmatyashovsky/spark-ml-samples
Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover Dataset Word2Vec [Vector size] Verser [Sentences in verse] Dataset https://github.com/tmatyashovsky/spark-ml-samples
Word2Vec (sentencesInVerse = 4) [0.036463763926011056, -0.013076733228398295, ... 0.03816963326281462] feature1 [-0.013962931134021625, 0.049275818325650804, ... -0.058982484615766086] feature2 https://github.com/tmatyashovsky/spark-ml-samples
Word2Vec (sentencesInVerse = 8) [0.036463763926011056, -0.013076733228398295, 0.044362547532774695, 0.03816963326281462, ... -0.013962931134021625, 0.049275818325650804, -0.058982484615766086] feature1 https://github.com/tmatyashovsky/spark-ml-samples
Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover Dataset Word2Vec [Vector size] Verser [Sentences in verse] Logistic Regression [Max iterations, Reg parameter] Dataset Dataset https://github.com/tmatyashovsky/spark-ml-samples
Logistic Regression [0.9212126972383768, Probability: Prediction: 0.07878730276162313] Prediction: 0.0 https://github.com/tmatyashovsky/spark-ml-samples
Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover Dataset Word2Vec [Vector size] Verser [Sentences in verse] Logistic Regression [Max iterations, Reg parameter] Cross Validator Model Dataset Dataset https://github.com/tmatyashovsky/spark-ml-samples
CV Average Metrics [0.8454839775240359, 0.9061236588248319, 0.9527128936788524, 0.9522790271664413, ... 0.9526248129757111, 0.9522790271664411] https://github.com/tmatyashovsky/spark-ml-samples
Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover Dataset Word2Vec [Vector size] Verser [Sentences in verse] Logistic Regression [Max iterations, Reg parameter] Cross Validator Model Dataset Dataset https://github.com/tmatyashovsky/spark-ml-samples
Demo Time
Summary
Summary ML is not as complex as it seems from an applied perspective Existing libraries and frameworks reduce a lot of tedious work For instance, Spark MLlib can help to build nice ML pipelines
Thank you! @tmatyashovsky @LejlekF Design by
References https://www.quora.com/What-is-the-difference-between-supervised-and-unsupervised-learning-algorithms Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research https://www.kaggle.com/c/dogs-vs-cats/ http://yann.lecun.com/exdb/mnist/ http://www.bcl.hamilton.ie/~barak/teach/F98/ECE547/hw1/index.html http://www.slideshare.net/jeykottalam/pipelines-ampcamp https://github.com/master/spark-stemming https://databricks.com/blog/2016/04/01/unreasonable-effectiveness-of-deep-learning-on-apache-spark.html http://www.degeneratestate.org/posts/2016/Apr/20/heavy-metal-and-natural-language-processing-part-1/ https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html http://www.slideshare.net/liweiyang5/spark-mllib-training-material https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.htm http://www.slideshare.net/databricks/combining-machine-learning-frameworks-with-apache-spark l https://databricks.com/blog/2015/10/20/audience-modeling-with-apache-spark-ml-pipelines.html https://github.com/deeplearning4j/deeplearning4j http://deeplearning4j.org/spark http://mlwiki.org/index.php/Overfitting http://bionlp-www.utu.fi/wv_demo/ https://quomodocumque.wordpress.com/2016/01/15/messing-around-with-word2vec/