Download presentation
Presentation is loading. Please wait.
Published byAmanda Copeland Modified over 9 years ago
1
Apache Mahout
2
Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion
3
What is Mahout? Distributed machine learning libraries – “scalable to reasonably large data sets” – Runs on Hadoop
4
What? Hadoop brings: – Map/Reduce API – HDFS – In other words, scalability and fault-tolerance Mahout brings: – Library of machine learning algorithms – Examples
5
Why Mahout? Many Open Source ML libraries either: – Lack Community – Lack Documentation and Examples – Lack Scalability – Lack the Apache License ;-) – Or are research-oriented
6
Clustering Unsupervised Find Natural Groupings – Documents – Search Results – People – Genetic traits in groups – Many, many more uses
7
Types Supervised – Using labeled training data, create function that predicts output of unseen inputs Unsupervised – Using unlabeled data, create function that predicts output Semi-Supervised – Uses labeled and unlabeled data
8
Example: Clustering Google News
9
K-means Algorithm 1)Pick a number (k) of cluster centers 2)Assign every element to its nearest cluster center 3)Move each cluster center to the mean of its assigned elements 4)Repeat 2-3 until convergence
10
Figure 1: K-means algorithm. Training examples are shown as dots, and cluster centroids are shown as crosses. K-means Example
11
Invocation using the command line takes the form:
12
Canopy Clustering Canopy Clustering is a very simple, fast and surprisingly accurate method for grouping objects into clusters. Define two thresholds Tight: T 1 Loose: T 2 Put all records into a set S While S is not empty Remove any record r from S and create a canopy centered at r For each other record r i, compute cheap distance d from r to r i If d < T 2, place r i in r’s canopy If d < T 1, remove r i from S
13
Canopy Clustering SequenceFile (WritableComparable, VectorWritable) Invocation using the command line takes the form:
14
Fuzzy K-Means Fuzzy K-Means (also called Fuzzy C-Means) is an extension of K-Means, the popular simple clustering technique. Like K-Means, Fuzzy K-Means works on those objects which can be represented in n- dimensional vector space and a distance measure is defined. The algorithm is similar to k-means. Initialize k clusters Until converged Compute the probability of a point belong to a cluster for every pair Re-compute the cluster centers using above probability membership values of points to clusters.
15
Fuzzy K-Means Invocation using the command line takes the form:
16
Conclusion Mahout did not scale well Mahout was not easy to learn Mahout was not easily modifiable For performance and efficiency, it is better to – Understand the data set – Understand data mining – Understand the methodology
17
Thank you !
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.