Machine Learning on Spark

Slides:



Advertisements
Similar presentations
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,
Advertisements

Shark:SQL and Rich Analytics at Scale
CS525: Special Topics in DBs Large-Scale Data Management
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
K-means Clustering Ke Chen.
Basic Gene Expression Data Analysis--Clustering
K-means algorithm 1)Pick a number (k) of cluster centers 2)Assign every gene to its nearest cluster center 3)Move each cluster center to the mean of its.
Machine Learning on.NET F# FTW!. A few words about me  Mathias Brandewinder  Background: economics, operations research .NET developer.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Collaborative Filtering in iCAMP Max Welling Professor of Computer Science & Statistics.
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Introduction to Bioinformatics - Tutorial no. 12
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Building Efficient Time Series Similarity Search Operator Mijung Kim Summer Internship 2013 at HP Labs.
SparkR: Enabling Interactive Data Science at Scale
Apache Mahout Feb 13, 2012 Shannon Quinn Cloud Computing CS
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database.
“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Apache Mahout Qiaodi Zhuang Xijing Zhang.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Big Data Infrastructure Week 8: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Apache Mahout Industrial Strength Machine Learning Jeff Eastman.
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD
Big Data is a Big Deal!.
Semi-Supervised Clustering
Optimizing Parallel Algorithms for All Pairs Similarity Search
Industrial Strength Machine Learning Jeff Eastman
Spark Presentation.
Hadoop Clusters Tess Fulkerson.
Waikato Environment for Knowledge Analysis
Topic 3: Cluster Analysis
CMPT 733, SPRING 2016 Jiannan Wang
Revision (Part II) Ke Chen
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
Clustering 77B Recommender Systems
Revision (Part II) Ke Chen
HPML Conference, Lyon, Sept 2018
Overview of big data tools
Topic 5: Cluster Analysis
CMPT 733, SPRING 2017 Jiannan Wang
Fast, Interactive, Language-Integrated Cluster Computing
Introduction to Machine learning
Presentation transcript:

Machine Learning on Spark Shivaram Venkataraman UC Berkeley

Machine learning Computer Science Statistics

Machine learning Spam filters Click prediction Recommendations Search ranking

Machine learning techniques Classification Clustering Regression Active learning Collaborative filtering

Implementing Machine Learning Machine learning algorithms are Complex, multi-stage Iterative MapReduce/Hadoop unsuitable Need efficient primitives for data sharing

Machine Learning using Spark Spark RDDs  efficient data sharing In-memory caching accelerates performance Up to 20x faster than Hadoop Easy to use high-level programming interface Express complex algorithms ~100 lines.

Machine learning techniques Classification Clustering Regression Active learning Collaborative filtering

K-Means Clustering using Spark Focus: Implementation and Performance

Grouping data according to similarity Clustering E.g. archaeological dig Clustering applications – applications – image recognition Distance North Grouping data according to similarity Distance East

Grouping data according to similarity Clustering E.g. archaeological dig Clustering applications – applications – image recognition Distance North Grouping data according to similarity Distance East

K-Means Algorithm E.g. archaeological dig Distance North Distance East Benefits Popular Fast Conceptually straightforward E.g. archaeological dig Clustering applications – applications – image recognition Distance North Distance East

K-Means: preliminaries Data: Collection of values Clustering applications – applications – image recognition data = lines.map(line=> parseVector(line)) Feature 2 Feature 1

K-Means: preliminaries Dissimilarity: Squared Euclidean distance Clustering applications – applications – image recognition Feature 2 dist = p.squaredDist(q) Feature 1

K-Means: preliminaries K = Number of clusters Clustering applications – applications – image recognition Feature 2 Data assignments to clusters S1, S2,. . ., SK Feature 1

K-Means: preliminaries K = Number of clusters Clustering applications – applications – image recognition Feature 2 Data assignments to clusters S1, S2,. . ., SK Feature 1

K-Means Algorithm • Initialize K cluster centers • Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. Clustering applications – applications – image recognition Feature 2 Feature 1

K-Means Algorithm • Initialize K cluster centers • Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. Clustering applications – applications – image recognition Feature 2 Feature 1

K-Means Algorithm • Initialize K cluster centers • Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 Feature 1

K-Means Algorithm • Initialize K cluster centers • Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 Feature 1

K-Means Algorithm • Initialize K cluster centers • Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 Feature 1

K-Means Algorithm • Initialize K cluster centers • Repeat until convergence: Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) Feature 1

K-Means Algorithm • Initialize K cluster centers • Repeat until convergence: Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) Feature 1

K-Means Algorithm • Initialize K cluster centers • Repeat until convergence: Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) Feature 1

K-Means Algorithm • Initialize K cluster centers • Repeat until convergence: centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() Feature 1

K-Means Algorithm • Initialize K cluster centers • Repeat until convergence: centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters = pointsGroup.mapValues( ps => average(ps)) Feature 1

K-Means Algorithm • Initialize K cluster centers • Repeat until convergence: centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters = pointsGroup.mapValues( ps => average(ps)) Feature 1

K-Means Algorithm • Initialize K cluster centers • Repeat until convergence: centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters = pointsGroup.mapValues( ps => average(ps)) Feature 1

K-Means Algorithm • Initialize K cluster centers • Repeat until convergence: centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 while (dist(centers, newCenters) > ɛ) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) Feature 1

K-Means Algorithm • Initialize K cluster centers • Repeat until convergence: centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 while (dist(centers, newCenters) > ɛ) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) Feature 1

K-Means Source Feature 2 Feature 1 centers = data.takeSample( false, K, seed) while (d > ɛ) { Clustering applications – applications – image recognition closest = data.map(p => (closestPoint(p,centers),p)) Feature 2 pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) d = distance(centers, newCenters) centers = newCenters.map(_) } Feature 1

Ease of use Interactive shell: Useful for featurization, pre-processing data Lines of code for K-Means Spark ~ 90 lines – (Part of hands-on tutorial !) Hadoop/Mahout ~ 4 files, > 300 lines

Performance Logistic Regression K-Means [Zaharia et. al, NSDI’12]

Conclusion Spark: Framework for cluster computing Fast and easy machine learning programs K means clustering using Spark Hands-on exercise this afternoon ! Examples and more: www.spark-project.org