CMPT 733, SPRING 2017 Jiannan Wang

Slides:

Advertisements

Similar presentations

Spark: Cluster Computing with Working Sets

Advertisements

Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Anomaly detection Problem motivation Machine Learning.

Practical Machine Learning Pipelines with MLlib Joseph K. Bradley March 18, 2015 Spark Summit East 2015.

© 2015 IBM Corporation UNIT 2: BigData Analytics with Spark and Spark Platforms 1 Shelly Garion IBM Research -- Haifa.

Window-based models for generic object detection Mei-Chen Yeh 04/24/2012.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.

Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉教授 : 許毅然作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.

Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.

Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.

Project Overview CSE 6367 – Computer Vision Vassilis Athitsos University of Texas at Arlington.

Big Data Programming II Course Introduction CMPT 733, SPRING 2016 JIANNAN WANG.

ODL based AI/ML for Networks Prem Sankar Gopannan, Ericsson

Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:

Defining the Competencies for Leadership- Class Computing Education and Training Steven I. Gordon and Judith D. Gardiner August 3, 2010.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.

Image taken from: slideshare

Big Data Analytics and HPC Platforms

Experience Report: System Log Analysis for Anomaly Detection

Big Data is a Big Deal!.

A Peta-Scale Graph Mining System

Designing a Scalable Data Cleaning Infrastructure

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

Machine Learning Library for Apache Ignite

Pagerank and Betweenness centrality on Big Taxi Trajectory Graph

Pathology Spatial Analysis February 2017

Applying Control Theory to Stream Processing Systems

So, what was this course about?

Learning Deep L0 Encoders

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Spark Presentation.

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Hadoop Clusters Tess Fulkerson.

Extraction, aggregation and classification at Web Scale

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

Introduction to Spark.

with Apache Spark MLlib

Clustering tweets and webpages

Spark Software Stack Inf-2202 Concurrent and Data-Intensive Programming Fall 2016 Lars Ailo Bongo

Big Data Programming II Course Introduction

Classifying enterprises by economic activity

Intro to Machine Learning

Data Mining Practical Machine Learning Tools and Techniques

Machine Learning Platform Life-Cycle Management

CMPT 733, SPRING 2016 Jiannan Wang

CS110: Discussion about Spark

Ch 4. The Evolution of Analytic Scalability

Overview of big data tools

Working with Spark With Focus on Lab3.

Approaching an ML Problem

CMPT 354: Database System I

Procedural Information Extraction from Text:

Introduction to Spark.

Machine Learning in Practice Lecture 27

Jia-Bin Huang Virginia Tech

Working with Spark With Focus on Lab3.

The Student’s Guide to Apache Spark

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang University of Houston, USA 1.

Lecture 29: Distributed Systems

Igor Stančin, Alan Jović to: {igor.stancin,

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

CMPT 733, SPRING 2017 Jiannan Wang Introduction to MLlib CMPT 733, SPRING 2017 Jiannan Wang

Recap of Spark Spark is a fast and general engine for large-scale data processing Improving MapReduce from two aspects Efficiency: in-memory computing Usability : high-level operators 5/22/2019 Jiannan Wang - CMPT 733

A Brief History of MLlib Mlbase (2012) Started in the AMPLab Goal: making distributed machine learning easy MLlib enters Spark v0.8 (2013) One of the four big libraries built on top of Spark A good coverage of distributed machine learning algorithms A New High-Level API for MLlib (2015) spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines 5/22/2019 Jiannan Wang - CMPT 733

MLlib’s Mission Making practical machine learning scalable and easy Data is messy, and often comes from multiple sources Feature selection and parameter tuning are quite important A model should have good performance in productions How did MLlib achieve the goal of scalability? Implementing distributed ML algorithms using Spark 5/22/2019 Jiannan Wang - CMPT 733

Distributed ML Algorithms in MLlib Week 2. Feature engineering and Classification TF-IDF, Word2Vec, SVMs Clustering k-means Collaborative filtering Alternating Least Squares (ALS) Week 2 Week 3 Week 4 5/22/2019 Jiannan Wang - CMPT 733

Some Basic Concepts What is distributed ML? How different is it from non-distributed ML? How to evaluate the performance of distributed ML algorithms? 5/22/2019 Jiannan Wang - CMPT 733

What is Distributed ML? A Hot topic! Many Aliases: Scalable Machine Learning Large-scale Machine Learning Machine Learning for Big Data Take a look at these courses if you want to learn more theories SML: Scalable Machine Learning (UC Berkeley, 2012) Large-scale Machine Learning (NYU, 2013) Scalable Machine Learning (edX, 2015) 5/22/2019 Jiannan Wang - CMPT 733

An intuitive explanation of distributed ML Data are often in the form of a table N: # of training examples (e.g., tweets, images) F: # of features (e.g., bag of words, color histogram) An ML algorithm can be thought of as a process that turns the table into a model We will discuss this process later Distributed ML studies how to make the process work for the following cases: Big N, Small F Small N, Big F Big N, Big F N 5/22/2019 Jiannan Wang - CMPT 733

How different from non-distributed ML? Requiring distributed data storage and access Thanks to HDFS and Spark! Network communication is often the bottleneck Non-distributed ML focuses on reducing CPU time and I/O cost, but distributed ML often seeks to reduce network communication More design choices Broadcast (Recall Assignment3B in CMPT 732) Caching (which intermediate results should be cached in Memory?) Parallelization (which part in an ML algorithm should be parallelized? ) … 5/22/2019 Jiannan Wang - CMPT 733

Performance metrics of distributed ML Consider this scenario: You develop an ML application using MLlib, and deploy it in your company’s product. It works pretty well. But after some time, … 1. Speedup Your boss said: “Some customers complained that our product is slow. How much can we improve the speed by buying more machines?” 2. Scaleout You boss said: “More and more costumers like to use our product. But, at the same time, we will collect/process more customer data. How many new machines do we need to buy in order to keep the same speed? ” # of machines ideal speedup # of machines and data size ideal scaleout 5/22/2019 Jiannan Wang - CMPT 733

MLlib’s Mission Making practical machine learning scalable and easy Data is messy, and often comes from multiple sources Feature selection and parameter tuning are quite important A model should have good performance in productions How did MLlib achieve the goal of scalability? Implementing distributed machine learning algorithms using Spark How did MLlib achieve the goal of ease of use? The new ML Pipeline API 5/22/2019 Jiannan Wang - CMPT 733

ML Workflow Our goal Dataset Model Load Data Extract Features Train 5/22/2019 Jiannan Wang - CMPT 733

ML Workflows are complex Our goal Load Data Extract Features Train Model Dataset Model Data often comes from multiple sources This requires a lot of data transformation work There are so many ML algos and parameters to choose from 5/22/2019 Jiannan Wang - CMPT 733

Pain points 1. There will generate a lot of RDDs through the whole process. Manipulating and managing these RDDs are painful. In the feature selection stage, we often need to perform selection/ projection/ aggregation/ group-by/ join operations on RDDs. 2. Writing the workflow as a script is hard for workflow reuse. Data often change over time. We need to iterate on the old workflow, making it work for new datasets. 3. An ML workflow has a lot of parameters. Tuning the parameters is painful. Almost every stage of the workflow has some parameters to tune (e.g., # of features, # of iterations, step size, gram size, regularization parameter). 5/22/2019 Jiannan Wang - CMPT 733

The new ML pipeline API Pain Point 1: Manipulating and managing a lot of RDDs are painful. Basic Idea: Using DataFrame (instead of RDD) to represent the ML dataset DataFrame adds schema and relational operations on RDD. easy to manipulate easy to manage VS. VS. 5/22/2019 Jiannan Wang - CMPT 733

The new ML pipeline API (cont’d) Pain Point 2: Writing as a script is hard for workflow reuse. Basic Idea: Abstracting ML stages into two components Transformer: Estimator: A pipeline consists of a sequence of Transformers and Estimators Create a new pipeline: Apply the pipeline to a dataset: Use a different estimator: Apply to a new dataset: DataFrame  DataFrame DataFrame  Model pipeline = Pipeline(stages=[Transformer1, Transformer2, Estimator]) model = pipeline.fit(dataset) pipeline’ = Pipeline(stages=[Transformer1’, Transformer2, Estimator]) model' = pipeline’.fit(new-dataset) 5/22/2019 Jiannan Wang - CMPT 733

The new ML pipeline API (cont’d) Pain Point 3: Parameter tuning is painful. Basic Idea: grid search and cross validation Grid search enumerates every possible combination of parameters Grid Search numFeatures = 10, 100, 1000 regParam = 0.1, 0.01 (10, 0.1), (10, 0.01), (100, 0.1), (100, 0.01), (1000, 0.1), (1000, 0.01) 5/22/2019 Jiannan Wang - CMPT 733

The new ML pipeline API (cont’d) Pain Point 3: Parameter tuning is painful. Basic Idea: grid search and cross validation Grid search enumerates every possible combination of parameters Cross validation evaluates which combination performs the best training testing Fold 1 training testing Fold 2 training testing Fold 3 dataset Cross Validation (3 folds) evaluation metric evaluation metric evaluation metric Average metric 5/22/2019 Jiannan Wang - CMPT 733

spark.mllib vs. spark.ml Advantage of spark.ml spark.mllib is the old ML API in Spark. It only focuses on making ML scalable. spark.ml is the new ML API in Spark. It focuses on making ML both scalable and ease of use. Disadvantage of spark.ml spark.ml contains fewer ML algorithms than spark.mllib. How to choose which one to use in your assignments? Use spark.ml in Assignment 1 Use spark.mllib or spark.ml in other Assignments 5/22/2019 Jiannan Wang - CMPT 733

Assignment 1 (http://tiny.cc/cmpt733-a1) Part 1. Matrix Multiplication Dense Representation Sparse Representation Part 2. A simple ML pipeline Adding a parameter tuning component Deadline: 23:59pm, Jan 15 5/22/2019 Jiannan Wang - CMPT 733