Download presentation
Presentation is loading. Please wait.
1
Machine Learning on Spark
Shivaram Venkataraman UC Berkeley
2
Machine learning Computer Science Statistics
3
Machine learning Spam filters Click prediction Recommendations
Search ranking
4
Machine learning techniques Classification Clustering Regression
Active learning Collaborative filtering
5
Implementing Machine Learning
Machine learning algorithms are Complex, multi-stage Iterative MapReduce/Hadoop unsuitable Need efficient primitives for data sharing
6
Machine Learning using Spark
Spark RDDs efficient data sharing In-memory caching accelerates performance Up to 20x faster than Hadoop Easy to use high-level programming interface Express complex algorithms ~100 lines.
7
Machine learning techniques Classification Clustering Regression
Active learning Collaborative filtering
8
K-Means Clustering using Spark
Focus: Implementation and Performance
9
Grouping data according to similarity
Clustering E.g. archaeological dig Clustering applications – applications – image recognition Distance North Grouping data according to similarity Distance East
10
Grouping data according to similarity
Clustering E.g. archaeological dig Clustering applications – applications – image recognition Distance North Grouping data according to similarity Distance East
11
K-Means Algorithm E.g. archaeological dig Distance North Distance East
Benefits Popular Fast Conceptually straightforward E.g. archaeological dig Clustering applications – applications – image recognition Distance North Distance East
12
K-Means: preliminaries
Data: Collection of values Clustering applications – applications – image recognition data = lines.map(line=> parseVector(line)) Feature 2 Feature 1
13
K-Means: preliminaries
Dissimilarity: Squared Euclidean distance Clustering applications – applications – image recognition Feature 2 dist = p.squaredDist(q) Feature 1
14
K-Means: preliminaries
K = Number of clusters Clustering applications – applications – image recognition Feature 2 Data assignments to clusters S1, S2,. . ., SK Feature 1
15
K-Means: preliminaries
K = Number of clusters Clustering applications – applications – image recognition Feature 2 Data assignments to clusters S1, S2,. . ., SK Feature 1
16
K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. Clustering applications – applications – image recognition Feature 2 Feature 1
17
K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. Clustering applications – applications – image recognition Feature 2 Feature 1
18
K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 Feature 1
19
K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 Feature 1
20
K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 Feature 1
21
K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) Feature 1
22
K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) Feature 1
23
K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) Feature 1
24
K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() Feature 1
25
K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters = pointsGroup.mapValues( ps => average(ps)) Feature 1
26
K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters = pointsGroup.mapValues( ps => average(ps)) Feature 1
27
K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters = pointsGroup.mapValues( ps => average(ps)) Feature 1
28
K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 while (dist(centers, newCenters) > ɛ) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) Feature 1
29
K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 while (dist(centers, newCenters) > ɛ) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) Feature 1
30
K-Means Source Feature 2 Feature 1 centers = data.takeSample(
false, K, seed) while (d > ɛ) { Clustering applications – applications – image recognition closest = data.map(p => (closestPoint(p,centers),p)) Feature 2 pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) d = distance(centers, newCenters) centers = newCenters.map(_) } Feature 1
31
Ease of use Interactive shell:
Useful for featurization, pre-processing data Lines of code for K-Means Spark ~ 90 lines – (Part of hands-on tutorial !) Hadoop/Mahout ~ 4 files, > 300 lines
32
Performance Logistic Regression K-Means [Zaharia et. al, NSDI’12]
33
Conclusion Spark: Framework for cluster computing
Fast and easy machine learning programs K means clustering using Spark Hands-on exercise this afternoon ! Examples and more:
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.