Presentation is loading. Please wait.

Presentation is loading. Please wait.

with Apache Spark MLlib

Similar presentations


Presentation on theme: "with Apache Spark MLlib"— Presentation transcript:

1 with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib #javaone

2 @tmatyashovsky

3 I am not a data science engineer

4 Motivation

5 Given verse from lyrics recognize genre

6 Pop vs. Heavy Metal “I'm a rolling thunder, a pouring rain I'm comin' on like a hurricane My lightning's flashing across the sky You're only young but you're gonna die I won't take no prisoners, won't spare no lives Nobody's putting up a fight I got my bell, I'm gonna take you to hell I'm gonna get you, Satan get you”

7 Pop vs. Heavy Metal “I'm a rolling thunder, a pouring rain I'm comin' on like a hurricane My lightning's flashing across the sky You're only young but you're gonna die I won't take no prisoners, won't spare no lives Nobody's putting up a fight I got my bell, I'm gonna take you to hell I'm gonna get you, Satan get you”

8 Ideas?

9 Ideas Look for particular words like “fear”, “fight”, “kill”, “devil”, ”death”, etc.? Count length of a verse? Count unique words in a verse?

10 Machine Learning?

11 Machine Learning in 15-20 mins

12 is the study of computer algorithms that improve automatically
Machine Learning is the study of computer algorithms that improve automatically through experience

13 Supervised learning Unsupervised learning Reinforcement learning

14 Supervised Learning

15 Speakers’ Feedback Dataset
Date & time Overall impression Conference name Overall rating Speaker Number of slides Talk name Time spent on live coding Track Number of jokes Duration Etc. Type

16 Features: Target variable: Training example: Training set: Learning algorithms Hypotheses: Сost function:

17

18 Number of jokes during a talk
Regression Speaker’s rating Score of the speaker based on xxx. Number of jokes during a talk

19 Linear Regression

20 OMG, Math at a conference?

21 Linear Regression

22 Linear Regression

23 No magic, it’s just math

24 Regression

25 Number of jokes during a talk
Classification Positive Impression Negative Quantity of jokes used. Liked or not liked the speaker. Number of jokes during a talk

26 Sigmoid (Logistic) Function

27 Logistic Regression

28 Logistic Regression

29 No magic, it’s just math

30

31 Unsupervised Learning

32 Clustering Number of clusters: K = 5 K = 2
Number of jokes during a talk Time (min.) spent on live coding

33 K-Means Initialize cluster centroids:
assign each example to the closest cluster centroid Recalculate centroids as an average (mean) of examples assigned to a cluster Assign or index each example to the cluster centroid closest to it Recalculate or move centroids as an average (mean) of examples assigned to a cluster Repeat until centroids not longer move

34 K-Means

35

36 ML-based Solution

37 Pop vs. Heavy Metal Collect data set of lyrics:
Abba, Ace of base, Backstreet Boys, Britney Spears, Christina Aguilera, Madonna, etc. Black Sabbath, In Flames, Iron Maiden, Metallica, Moonspell, Nightwish, Sentenced, etc. Create training set, i.e. label (0|1) + features Train logistic regression (or other classification algorithm)

38 https://github.com/tmatyashovsky/spark-ml-samples

39 Feature Extraction

40 GloVe Bag of Words Word2Vec TF-IDF
Bag of words – a single word is a one hot encoding vector with the size of the dictionary. As a result – a lot of sparse vectors.

41 Word2Vec Produces unique fixed-size dense vectors
Captures semantic and morphologic similarity Behind the scenes - a two-layer neural net that processes text. Captures semantic and morphologic similarity so similar words are close in the vector space Similar words would be clustered together in the high dimensional sphere. 

42 Word2Vec Opposite Similar scores Unrelated scores (cos ~ 1) (cos ~ 0)
Open Conference Conference Open Closed Metal Similar scores (cos ~ 1) Unrelated scores (cos ~ 0) Opposite scores (cos ~ -1) If two words are very close to synonymous, you’d expect them to show up in similar contexts, and indeed synonymous words tend to be close. For two completely random words, the similarity is pretty close to 0. On an opposite side there is not an antonym, but usually just a noise. Used Google News Negative 300.

43 “Love you” Similarity Verse Cosine Distance baby one more time
crazy for you show me the meaning of being lonely highway to hell kill them all My corpus words

44 https://github.com/tmatyashovsky/spark-ml-samples

45 The Best Model?

46 Model Selection hyper parameter tuning

47 Evaluating Hypothesis
Under-fitting (high bias) Appropriate fitting Over-fitting (high variance)

48 K-folds Cross Validation
Training set (66,6%) K = 3 Test set (33%)

49 K-folds Cross Validation
Test set (33%) K = 3 Training set (66,6%)

50 K-folds Cross Validation
Training set (33,3%) K = 3 Test set (33%) Training set (33,3%)

51 Phew, enough of theory!

52 Practice using Java Let’s finally go to the implementation using a library or framework that is going to help us to avoid tedious transformations and provide algorithms as well as feature extractors out-of-the-box.

53 Weka Encog Aerosolve FlinkML

54 Speed Generality Cloud computing Data processing Easy of use

55 Component Stack

56 Spark MLlib Is a library of ML algorithms and utilities
designed to run in parallel on Spark cluster

57 MLlib Design & Philosophy
Introduces a few new data types, e.g. vector (dense and sparse), labeled point, rating, etc. Allows to invoke various algorithms on distributed datasets (RDD/Dataset)

58 Build on top of Datasets
spark.mllib spark.ml Build on top of RDDs Build on top of Datasets

59 spark.mllib Features Utilities: linear algebra, statistics, etc.
Features extraction, features transforming, etc. Regression Classification Clustering Collaborative filtering, e.g. alternating least squares Dimensionality reduction And many more

60 spark.ml Features ”All” spark.mllib features plus: Pipelines
Persistence Model selection and tuning: Train validation split K-folds cross validation

61 [pipeline, evaluator, parameters]
Estimator [parameters] Raw data Transformer Dataset Dataset Dataset Cross Validator [pipeline, evaluator, parameters] Estimator [parameters] Dataset Dataset Transformer [parameters]

62 Pop vs. Heavy Metal Using Spark MLlib Pipeline

63 Spark ML Pipeline Lyrics

64 Raw Unknown Lyrics I'm a rolling thunder, a pouring rain I'm comin' on like a hurricane My lightning's flashing across the sky You're only young but you're gonna die I won't take no prisoners, won't spare no lives Nobody's putting up a fight I got my bell, I'm gonna take you to hell I'm gonna get you, Satan get you

65 Spark ML Pipeline Lyrics Cleanser Dataset

66 Cleanser I'm a rolling thunder, a pouring rain I'm comin' on like a hurricane My lightning's flashing across the sky You're only young but you're gonna die I won't take no prisoners, won't spare no lives Nobody's putting up a fight I got my bell, I'm gonna take you to hell I'm gonna get you, Satan get you

67 Spark ML Pipeline Lyrics Cleanser Numerator Dataset Dataset

68 Numerator Im a rolling thunder a pouring rain Im comin on like a hurricane My lightnings flashing across the sky Youre only young but youre gonna die I wont take no prisoners wont spare no lives Nobodys putting up a fight I got my bell Im gonna take you to hell Im gonna get you Satan get you 1 2 3 4 5 6 7 8

69 Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words
Dataset Numerator Dataset Dataset Tokenizer Dataset Stop Words Remover

70 Stop Words Remover im a rolling thunder a pouring rain im comin on like a hurricane My lightnings flashing across the sky youre only young but youre gonna die I wont take no prisoners wont spare no lives nobodys putting up a fight I got my bell im gonna take you to hell im gonna get you satan get you 1 2 3 4 5 6 7 8

71 Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words
Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover

72 Stemmer im rolling thunder pouring rain im comin like hurricane lightnings flashing across sky youre young youre gonna die wont take prisoners wont spare lives nobodiys putting fight got bell im gonna take hell im gonna get satan get 1 2 3 4 5 6 7 8

73 Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words
Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover Dataset Verser [Sentences in verse]

74 Verser (sentencesInVerse = 4)
im roll thunder pour rain im comin like hurrican lightn flash across sky your young your gonna die wont take prison wont spare live nobodi put fight got bell im gonna take hell im gonna get satan get 1 2 3 4 5 6 7 8 verse1 verse2

75 Verser (sentencesInVerse = 8)
im roll thunder pour rain im comin like hurrican Light n flash across sky your young your gonna die wont take prison wont spare live nobodi put fight got bell im gonna take hell im gonna get satan get 1 2 3 4 5 6 7 8 verse1

76 Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words
Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover Dataset Word2Vec [Vector size] Verser [Sentences in verse] Dataset

77 Word2Vec (sentencesInVerse = 4)
[ , , ... ] feature1 [ , , ... ] feature2

78 Word2Vec (sentencesInVerse = 8)
[ , , , , ... , , ] feature1

79 Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words
Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover Dataset Word2Vec [Vector size] Verser [Sentences in verse] Logistic Regression [Max iterations, Reg parameter] Dataset Dataset

80 Logistic Regression [0.9212126972383768, Probability: Prediction:
] Prediction: 0.0

81 Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words
Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover Dataset Word2Vec [Vector size] Verser [Sentences in verse] Logistic Regression [Max iterations, Reg parameter] Cross Validator Model Dataset Dataset

82 CV Average Metrics [ , , , , ... , ]

83 Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words
Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover Dataset Word2Vec [Vector size] Verser [Sentences in verse] Logistic Regression [Max iterations, Reg parameter] Cross Validator Model Dataset Dataset

84 Demo Time

85 Summary

86 Summary ML is not as complex as it seems from an applied perspective
Existing libraries and frameworks reduce a lot of tedious work For instance, Spark MLlib can help to build nice ML pipelines

87 Thank you! @tmatyashovsky
@LejlekF Design by

88 References Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia l


Download ppt "with Apache Spark MLlib"

Similar presentations


Ads by Google