CMPT 733, SPRING 2016 Jiannan Wang

Slides:



Advertisements
Similar presentations
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Advertisements

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Spark: Cluster Computing with Working Sets
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
FLANN Fast Library for Approximate Nearest Neighbors
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Practical Machine Learning Pipelines with MLlib Joseph K. Bradley March 18, 2015 Spark Summit East 2015.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Dimensionality Reduction Motivation I: Data Compression Machine Learning.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
Data analysis tools Subrata Mitra and Jason Rahman.
Big Data Programming II Course Introduction CMPT 733, SPRING 2016 JIANNAN WANG.
ODL based AI/ML for Networks Prem Sankar Gopannan, Ericsson
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
Does one size really fit all? Evaluating classifiers in a Bag-of-Visual-Words classification Christian Hentschel, Harald Sack Hasso Plattner Institute.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
| presented by Vasileios Zois CS at USC 09/20/2013 Introducing Scalability into Smart Grid 1.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Image taken from: slideshare
Big Data Analytics and HPC Platforms
Experience Report: System Log Analysis for Anomaly Detection
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
World’s fastest Machine Learning Engine
Big Data is a Big Deal!.
A Peta-Scale Graph Mining System
Designing a Scalable Data Cleaning Infrastructure
Data Transformation: Normalization
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
ITCS-3190.
Multiplicative updates for L1-regularized regression
Pathology Spatial Analysis February 2017
Applying Control Theory to Stream Processing Systems
Learning Deep L0 Encoders
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Spark Presentation.
Resource Elasticity for Large-Scale Machine Learning
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Extraction, aggregation and classification at Web Scale
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
with Apache Spark MLlib
Chapter 15 QUERY EXECUTION.
Spark Software Stack Inf-2202 Concurrent and Data-Intensive Programming Fall 2016 Lars Ailo Bongo
R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.
Big Data Programming II Course Introduction
Classifying enterprises by economic activity
CS110: Discussion about Spark
Logistic Regression & Parallel SGD
Parallel Analytic Systems
Overview of big data tools
CMPT 354: Database System I
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Procedural Information Extraction from Text:
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Introduction to Spark.
Machine Learning in Practice Lecture 27
CMPT 733, SPRING 2017 Jiannan Wang
Fast, Interactive, Language-Integrated Cluster Computing
The Student’s Guide to Apache Spark
Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang University of Houston, USA 1.
Lecture 29: Distributed Systems
Igor Stančin, Alan Jović to: {igor.stancin,
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

CMPT 733, SPRING 2016 Jiannan Wang Introduction to MLlib CMPT 733, SPRING 2016 Jiannan Wang

Recap of Spark Spark is a fast and general engine for large-scale data processing Improving MapReduce from two aspects Efficiency: in-memory computing Usability : high-level operators 01/05/16 Jiannan Wang - CMPT 733

A Brief History of MLlib Mlbase (2012) Started in the AMPLab Goal: making distributed machine learning easy MLlib enters Spark v0.8 (2013) One of the four big libraries built on top of Spark A good coverage of distributed machine learning algorithms A New High-Level API for MLlib (2015) spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines 01/05/16 Jiannan Wang - CMPT 733

MLlib’s Mission Making practical machine learning scalable and easy Data is messy, and often comes from multiple sources Feature selection and parameter tuning are quite important A model should have good performance in productions How did MLlib achieve the goal of scalability? Implementing distributed ML algorithms using Spark 01/05/16 Jiannan Wang - CMPT 733

Distributed ML Algorithms in MLlib Feature extraction and transformation: TF-IDF, Word2Vec Classification and regression: SVMs, logistic regression, linear regression Optimization: limited-memory BFGS (L-BFGS) Clustering: k-means Dimensionality reduction: principal component analysis (PCA) Collaborative filtering: alternating least squares (ALS) Week 2 Week 2 Week 3 Week 4 Week 4 Week 5 01/05/16 Jiannan Wang - CMPT 733

Some Basic Concepts What is distributed ML? How different is it from non-distributed ML? How to evaluate the performance of distributed ML algorithms? 01/05/16 Jiannan Wang - CMPT 733

What is Distributed ML? A Hot topic! Many Aliases: Scalable Machine Learning Large-scale Machine Learning Machine Learning for Big Data Take a look at these courses if you want to learn more theories SML: Scalable Machine Learning (UC Berkeley, 2012) Large-scale Machine Learning (NYU, 2013) Scalable Machine Learning (edX, 2015) 01/05/16 Jiannan Wang - CMPT 733

An intuitive explanation of distributed ML Data are often in the form of a table N: # of training examples (e.g., tweets, images) F: # of features (e.g., bag of words, color histogram) An ML algorithm can be thought of as a process that turns the table into a model We will discuss this process later Distributed ML studies how to make the process work for the following cases: Big N, Small F Small N, Big F Big N, Big F N 01/05/16 Jiannan Wang - CMPT 733

How different from non-distributed ML? Requiring distributed data storage and access Thanks to HDFS and Spark! Network communication is often the bottleneck Non-distributed ML focuses on reducing CPU time and I/O cost, but distributed ML often seeks to reduce network communication More design choices Broadcast (Recall Assignment3B in CMPT 732) Caching (which intermediate results should be cached in Memory?) Parallelization (which part in an ML algorithm should be parallelized? ) … 01/05/16 Jiannan Wang - CMPT 733

Performance metrics of distributed ML Consider this scenario: You develop an ML application using MLlib, and deploy it in your company’s product. It works pretty well. But after some time, … 1. Speedup Your boss said: “Some customers complained that our product is slow. How much can we improve the speed by buying more machines?” 2. Scaleout You boss said: “More and more costumers like to use our product. But, at the same time, we will collect/process more customer data. How many new machines do we need to buy in order to keep the same speed? ” # of machines ideal speedup # of machines and data size ideal scaleout 01/05/16 Jiannan Wang - CMPT 733

MLlib’s Mission Making practical machine learning scalable and easy Data is messy, and often comes from multiple sources Feature selection and parameter tuning are quite important A model should have good performance in productions How did MLlib achieve the goal of scalability? Implementing distributed machine learning algorithms using Spark How did MLlib achieve the goal of ease of use? The new ML Pipeline API 01/05/16 Jiannan Wang - CMPT 733

ML Workflow Our goal Dataset Model Load Data Extract Features Train 01/05/16 Jiannan Wang - CMPT 733

ML Workflows are complex Our goal Load Data Extract Features Train Model Dataset Model Data often comes from multiple sources This requires a lot of data transformation work There are so many ML algos and parameters to choose from 01/05/16 Jiannan Wang - CMPT 733

Pain points 1. There will generate a lot of RDDs through the whole process. Manipulating and managing these RDDs are painful. In the feature selection stage, we often need to perform selection/ projection/ aggregation/ group-by/ join operations on RDDs. 2. Writing the workflow as a script is hard for workflow reuse. Data often change over time. We need to iterate on the old workflow, making it work for new datasets. 3. An ML workflow has a lot of parameters. Tuning the parameters is painful. Almost every stage of the workflow has some parameters to tune (e.g., # of features, # of iterations, step size, gram size, regularization parameter). 01/05/16 Jiannan Wang - CMPT 733

The new ML pipeline API Pain Point 1: Manipulating and managing a lot of RDDs are painful. Basic Idea: Using DataFrame (instead of RDD) to represent the ML dataset DataFrame adds schema and relational operations on RDD. easy to manipulate easy to manage VS. VS. 01/05/16 Jiannan Wang - CMPT 733

The new ML pipeline API (cont’d) Pain Point 2: Writing as a script is hard for workflow reuse. Basic Idea: Abstracting ML stages into two components Transformer: Estimator: A pipeline consists of a sequence of Transformers and Estimators Create a new pipeline: Apply the pipeline to a dataset: Use a different estimator: Apply to a new dataset: DataFrame  DataFrame DataFrame  Model pipeline = Pipeline(stages=[Transformer1, Transformer2, Estimator]) model = pipeline.fit(dataset) pipeline’ = Pipeline(stages=[Transformer1’, Transformer2, Estimator]) model' = pipeline’.fit(new-dataset) 01/05/16 Jiannan Wang - CMPT 733

The new ML pipeline API (cont’d) Pain Point 3: Parameter tuning is painful. Basic Idea: grid search and cross validation Grid search enumerates every possible combination of parameters Grid Search numFeatures = 10, 100, 1000 regParam = 0.1, 0.01 (10, 0.1), (10, 0.01), (100, 0.1), (100, 0.01), (1000, 0.1), (1000, 0.01) 01/05/16 Jiannan Wang - CMPT 733

The new ML pipeline API (cont’d) Pain Point 3: Parameter tuning is painful. Basic Idea: grid search and cross validation Grid search enumerates every possible combination of parameters Cross validation evaluates which combination performs the best training testing Fold 1 training testing Fold 2 training testing Fold 3 dataset Cross Validation (3 folds) evaluation metric evaluation metric evaluation metric Average metric 01/05/16 Jiannan Wang - CMPT 733

spark.mllib vs. spark.ml Advantage of spark.ml spark.mllib is the old ML API in Spark. It only focuses on making ML scalable. spark.ml is the new ML API in Spark. It focuses on making ML both scalable and easy of use. Disadvantage of spark.ml spark.ml contains fewer ML algorithms than spark.mllib. How to choose which one to use in your assignments? Use spark.ml in Assignment 1 Use spark.mllib in other Assignments 01/05/16 Jiannan Wang - CMPT 733

Assignment 1 (http://tiny.cc/cmpt733-sp16-a1) Part 1. Matrix Multiplication Dense Representation Sparse Representation Part 2. A simple ML pipeline Adding a parameter tuning component Deadline: 23:59pm, Jan 11 01/05/16 Jiannan Wang - CMPT 733