with Apache Spark MLlib

Slides:



Advertisements
Similar presentations
Machine Learning on.NET F# FTW!. A few words about me  Mathias Brandewinder  Background: economics, operations research .NET developer.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Classic Rock By: Jack Weldon. The need for immediate profit is not a pressure. The product is promoted to create awareness. (Rock and roll made.
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Data mining and machine learning A brief introduction.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
M Machine Learning F# and Accord.net. Alena Dzenisenka Software architect at Luxoft Poland Member of F# Software Foundation Board of Trustees Researcher.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Pattern Recognition April 19, 2007 Suggested Reading: Horn Chapter 14.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
Data Mining and Decision Support
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
Data Science Dimensionality Reduction WFH: Section 7.3 Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall.
Data Science Credibility: Evaluating What’s Been Learned
Machine Learning with Spark MLlib
Who am I? Work in Probabilistic Machine Learning Like to teach 
Semi-Supervised Clustering
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Action-Grounded Push Affordance Bootstrapping of Unknown Objects
Deep learning David Kauchak CS158 – Fall 2016.
A New Support Vector Finder Method Based on Triangular Calculations
10701 / Machine Learning.
Introduction to Data Science Lecture 7 Machine Learning Overview
Deep Web Mining and Learning for Advanced Local Search
Basic machine learning background with Python scikit-learn
Machine Learning Basics
Spark Software Stack Inf-2202 Concurrent and Data-Intensive Programming Fall 2016 Lars Ailo Bongo
Machine Learning Feature Creation and Selection
Recognition - III.
Machine Learning & Data Science
Data Mining Practical Machine Learning Tools and Techniques
COS 518: Advanced Computer Systems Lecture 12 Mike Freedman
CMPT 733, SPRING 2016 Jiannan Wang
Logistic Regression & Parallel SGD
Machine Learning with Weka
DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일
Project 1: Text Classification by Neural Networks
Text Categorization Assigning documents to a fixed set of categories
10701 / Machine Learning Today: - Cross validation,
Overview of Machine Learning
Word Embedding Word2Vec.
Lecture 6: Introduction to Machine Learning
Deep Learning for Non-Linear Control
Multivariate Methods Berlin Chen
Statistical Models and Machine Learning Algorithms --Review
Machine Learning – a Probabilistic Perspective
CMPT 733, SPRING 2017 Jiannan Wang
Supervised machine learning: creating a model
Introduction to Sentiment Analysis
The Student’s Guide to Apache Spark
Word representations David Kauchak CS158 – Fall 2016.
Machine Learning in Business John C. Hull
Machine Learning for Cyber
Presentation transcript:

with Apache Spark MLlib Introduction to ML with Apache Spark MLlib #javaone

@tmatyashovsky https://ua.linkedin.com/in/tarasmatyashovsky

I am not a data science engineer

Motivation

Given verse from lyrics recognize genre

Pop vs. Heavy Metal “I'm a rolling thunder, a pouring rain I'm comin' on like a hurricane My lightning's flashing across the sky You're only young but you're gonna die I won't take no prisoners, won't spare no lives Nobody's putting up a fight I got my bell, I'm gonna take you to hell I'm gonna get you, Satan get you” https://github.com/tmatyashovsky/spark-ml-samples

Pop vs. Heavy Metal “I'm a rolling thunder, a pouring rain I'm comin' on like a hurricane My lightning's flashing across the sky You're only young but you're gonna die I won't take no prisoners, won't spare no lives Nobody's putting up a fight I got my bell, I'm gonna take you to hell I'm gonna get you, Satan get you” https://github.com/tmatyashovsky/spark-ml-samples

Ideas?

Ideas Look for particular words like “fear”, “fight”, “kill”, “devil”, ”death”, etc.? Count length of a verse? Count unique words in a verse?

Machine Learning?

Machine Learning in 15-20 mins

is the study of computer algorithms that improve automatically Machine Learning is the study of computer algorithms that improve automatically through experience

Supervised learning Unsupervised learning Reinforcement learning

Supervised Learning

Speakers’ Feedback Dataset Date & time Overall impression Conference name Overall rating Speaker Number of slides Talk name Time spent on live coding Track Number of jokes Duration Etc. Type

Features: Target variable: Training example: Training set: Learning algorithms Hypotheses: Сost function:

http://www.slideshare.net/liweiyang5/spark-mllib-training-material

Number of jokes during a talk Regression Speaker’s rating Score of the speaker based on xxx. Number of jokes during a talk

Linear Regression

OMG, Math at a conference?

Linear Regression

Linear Regression

No magic, it’s just math

Regression

Number of jokes during a talk Classification Positive Impression Negative Quantity of jokes used. Liked or not liked the speaker. Number of jokes during a talk

Sigmoid (Logistic) Function

Logistic Regression

Logistic Regression

No magic, it’s just math

Unsupervised Learning

Clustering Number of clusters: K = 5 K = 2 Number of jokes during a talk Time (min.) spent on live coding

K-Means Initialize cluster centroids: assign each example to the closest cluster centroid Recalculate centroids as an average (mean) of examples assigned to a cluster Assign or index each example to the cluster centroid closest to it Recalculate or move centroids as an average (mean) of examples assigned to a cluster Repeat until centroids not longer move

K-Means

ML-based Solution

Pop vs. Heavy Metal Collect data set of lyrics: Abba, Ace of base, Backstreet Boys, Britney Spears, Christina Aguilera, Madonna, etc. Black Sabbath, In Flames, Iron Maiden, Metallica, Moonspell, Nightwish, Sentenced, etc. Create training set, i.e. label (0|1) + features Train logistic regression (or other classification algorithm) https://github.com/tmatyashovsky/spark-ml-samples

https://github.com/tmatyashovsky/spark-ml-samples

Feature Extraction

GloVe Bag of Words Word2Vec TF-IDF Bag of words – a single word is a one hot encoding vector with the size of the dictionary. As a result – a lot of sparse vectors. http://spark.apache.org/docs/latest/ml-features.html#feature-extractors

Word2Vec Produces unique fixed-size dense vectors Captures semantic and morphologic similarity Behind the scenes - a two-layer neural net that processes text. Captures semantic and morphologic similarity so similar words are close in the vector space Similar words would be clustered together in the high dimensional sphere.  https://code.google.com/archive/p/word2vec/

Word2Vec Opposite Similar scores Unrelated scores (cos ~ 1) (cos ~ 0) Open Conference Conference Open Closed Metal Similar scores (cos ~ 1) Unrelated scores (cos ~ 0) Opposite scores (cos ~ -1) If two words are very close to synonymous, you’d expect them to show up in similar contexts, and indeed synonymous words tend to be close. For two completely random words, the similarity is pretty close to 0. On an opposite side there is not an antonym, but usually just a noise. Used Google News Negative 300. http://bionlp-www.utu.fi/wv_demo/ http://blog.christianperone.com/wp-content/uploads/2013/09/cosinesimilarityfq1.png

“Love you” Similarity Verse Cosine Distance baby one more time 0.482028 crazy for you 0.437875 show me the meaning of being lonely 0.258147 highway to hell -0.1120049 kill them all -0.231876 My corpus - 8316 words https://github.com/tmatyashovsky/spark-ml-samples

https://github.com/tmatyashovsky/spark-ml-samples

The Best Model?

Model Selection hyper parameter tuning

Evaluating Hypothesis Under-fitting (high bias) Appropriate fitting Over-fitting (high variance) http://mlwiki.org/index.php/Overfitting

K-folds Cross Validation Training set (66,6%) K = 3 Test set (33%)

K-folds Cross Validation Test set (33%) K = 3 Training set (66,6%)

K-folds Cross Validation Training set (33,3%) K = 3 Test set (33%) Training set (33,3%)

Phew, enough of theory!

Practice using Java Let’s finally go to the implementation using a library or framework that is going to help us to avoid tedious transformations and provide algorithms as well as feature extractors out-of-the-box.

Weka Encog Aerosolve FlinkML https://github.com/josephmisiti/awesome-machine-learning

Speed Generality Cloud computing Data processing Easy of use

Component Stack https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html

Spark MLlib Is a library of ML algorithms and utilities designed to run in parallel on Spark cluster

MLlib Design & Philosophy Introduces a few new data types, e.g. vector (dense and sparse), labeled point, rating, etc. Allows to invoke various algorithms on distributed datasets (RDD/Dataset) http://spark.apache.org/docs/latest/mllib-guide.html

Build on top of Datasets spark.mllib spark.ml Build on top of RDDs Build on top of Datasets http://spark.apache.org/docs/latest/mllib-guide.html

spark.mllib Features Utilities: linear algebra, statistics, etc. Features extraction, features transforming, etc. Regression Classification Clustering Collaborative filtering, e.g. alternating least squares Dimensionality reduction And many more http://spark.apache.org/docs/latest/mllib-guide.html

spark.ml Features ”All” spark.mllib features plus: Pipelines Persistence Model selection and tuning: Train validation split K-folds cross validation http://spark.apache.org/docs/latest/ml-guide.html

[pipeline, evaluator, parameters] Estimator [parameters] Raw data Transformer Dataset Dataset Dataset Cross Validator [pipeline, evaluator, parameters] Estimator [parameters] Dataset Dataset Transformer [parameters] http://spark.apache.org/docs/latest/ml-pipeline.html

Pop vs. Heavy Metal Using Spark MLlib Pipeline

Spark ML Pipeline Lyrics https://github.com/tmatyashovsky/spark-ml-samples

Raw Unknown Lyrics I'm a rolling thunder, a pouring rain I'm comin' on like a hurricane My lightning's flashing across the sky You're only young but you're gonna die I won't take no prisoners, won't spare no lives Nobody's putting up a fight I got my bell, I'm gonna take you to hell I'm gonna get you, Satan get you https://github.com/tmatyashovsky/spark-ml-samples

Spark ML Pipeline Lyrics Cleanser Dataset https://github.com/tmatyashovsky/spark-ml-samples

Cleanser I'm a rolling thunder, a pouring rain I'm comin' on like a hurricane My lightning's flashing across the sky You're only young but you're gonna die I won't take no prisoners, won't spare no lives Nobody's putting up a fight I got my bell, I'm gonna take you to hell I'm gonna get you, Satan get you https://github.com/tmatyashovsky/spark-ml-samples

Spark ML Pipeline Lyrics Cleanser Numerator Dataset Dataset https://github.com/tmatyashovsky/spark-ml-samples

Numerator Im a rolling thunder a pouring rain Im comin on like a hurricane My lightnings flashing across the sky Youre only young but youre gonna die I wont take no prisoners wont spare no lives Nobodys putting up a fight I got my bell Im gonna take you to hell Im gonna get you Satan get you 1 2 3 4 5 6 7 8 https://github.com/tmatyashovsky/spark-ml-samples

Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words Dataset Numerator Dataset Dataset Tokenizer Dataset Stop Words Remover https://github.com/tmatyashovsky/spark-ml-samples

Stop Words Remover im a rolling thunder a pouring rain im comin on like a hurricane My lightnings flashing across the sky youre only young but youre gonna die I wont take no prisoners wont spare no lives nobodys putting up a fight I got my bell im gonna take you to hell im gonna get you satan get you 1 2 3 4 5 6 7 8 https://github.com/tmatyashovsky/spark-ml-samples

Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover https://github.com/tmatyashovsky/spark-ml-samples

Stemmer im rolling thunder pouring rain im comin like hurricane lightnings flashing across sky youre young youre gonna die wont take prisoners wont spare lives nobodiys putting fight got bell im gonna take hell im gonna get satan get 1 2 3 4 5 6 7 8 https://github.com/tmatyashovsky/spark-ml-samples

Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover Dataset Verser [Sentences in verse] https://github.com/tmatyashovsky/spark-ml-samples

Verser (sentencesInVerse = 4) im roll thunder pour rain im comin like hurrican lightn flash across sky your young your gonna die wont take prison wont spare live nobodi put fight got bell im gonna take hell im gonna get satan get 1 2 3 4 5 6 7 8 verse1 verse2 https://github.com/tmatyashovsky/spark-ml-samples

Verser (sentencesInVerse = 8) im roll thunder pour rain im comin like hurrican Light n flash across sky your young your gonna die wont take prison wont spare live nobodi put fight got bell im gonna take hell im gonna get satan get 1 2 3 4 5 6 7 8 verse1 https://github.com/tmatyashovsky/spark-ml-samples

Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover Dataset Word2Vec [Vector size] Verser [Sentences in verse] Dataset https://github.com/tmatyashovsky/spark-ml-samples

Word2Vec (sentencesInVerse = 4) [0.036463763926011056, -0.013076733228398295, ... 0.03816963326281462] feature1 [-0.013962931134021625, 0.049275818325650804, ... -0.058982484615766086] feature2 https://github.com/tmatyashovsky/spark-ml-samples

Word2Vec (sentencesInVerse = 8) [0.036463763926011056, -0.013076733228398295, 0.044362547532774695, 0.03816963326281462, ... -0.013962931134021625, 0.049275818325650804, -0.058982484615766086] feature1 https://github.com/tmatyashovsky/spark-ml-samples

Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover Dataset Word2Vec [Vector size] Verser [Sentences in verse] Logistic Regression [Max iterations, Reg parameter] Dataset Dataset https://github.com/tmatyashovsky/spark-ml-samples

Logistic Regression [0.9212126972383768, Probability: Prediction: 0.07878730276162313] Prediction: 0.0 https://github.com/tmatyashovsky/spark-ml-samples

Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover Dataset Word2Vec [Vector size] Verser [Sentences in verse] Logistic Regression [Max iterations, Reg parameter] Cross Validator Model Dataset Dataset https://github.com/tmatyashovsky/spark-ml-samples

CV Average Metrics [0.8454839775240359, 0.9061236588248319, 0.9527128936788524, 0.9522790271664413, ... 0.9526248129757111, 0.9522790271664411] https://github.com/tmatyashovsky/spark-ml-samples

Spark ML Pipeline Lyrics Cleanser Numerator Tokenizer Stop Words Dataset Numerator Dataset Dataset Tokenizer Dataset Uniter Stemmer Dataset Exploder Dataset Dataset Stop Words Remover Dataset Word2Vec [Vector size] Verser [Sentences in verse] Logistic Regression [Max iterations, Reg parameter] Cross Validator Model Dataset Dataset https://github.com/tmatyashovsky/spark-ml-samples

Demo Time

Summary

Summary ML is not as complex as it seems from an applied perspective Existing libraries and frameworks reduce a lot of tedious work For instance, Spark MLlib can help to build nice ML pipelines

Thank you! @tmatyashovsky @LejlekF Design by

References https://www.quora.com/What-is-the-difference-between-supervised-and-unsupervised-learning-algorithms Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research https://www.kaggle.com/c/dogs-vs-cats/ http://yann.lecun.com/exdb/mnist/ http://www.bcl.hamilton.ie/~barak/teach/F98/ECE547/hw1/index.html http://www.slideshare.net/jeykottalam/pipelines-ampcamp https://github.com/master/spark-stemming https://databricks.com/blog/2016/04/01/unreasonable-effectiveness-of-deep-learning-on-apache-spark.html http://www.degeneratestate.org/posts/2016/Apr/20/heavy-metal-and-natural-language-processing-part-1/ https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html http://www.slideshare.net/liweiyang5/spark-mllib-training-material https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.htm http://www.slideshare.net/databricks/combining-machine-learning-frameworks-with-apache-spark l https://databricks.com/blog/2015/10/20/audience-modeling-with-apache-spark-ml-pipelines.html https://github.com/deeplearning4j/deeplearning4j http://deeplearning4j.org/spark http://mlwiki.org/index.php/Overfitting http://bionlp-www.utu.fi/wv_demo/ https://quomodocumque.wordpress.com/2016/01/15/messing-around-with-word2vec/