Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

Apache Mahout

Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion

What is Mahout? Distributed machine learning libraries – “scalable to reasonably large data sets” – Runs on Hadoop

What? Hadoop brings: – Map/Reduce API – HDFS – In other words, scalability and fault-tolerance Mahout brings: – Library of machine learning algorithms – Examples

Why Mahout? Many Open Source ML libraries either: – Lack Community – Lack Documentation and Examples – Lack Scalability – Lack the Apache License ;-) – Or are research-oriented

Clustering Unsupervised Find Natural Groupings – Documents – Search Results – People – Genetic traits in groups – Many, many more uses

Types Supervised – Using labeled training data, create function that predicts output of unseen inputs Unsupervised – Using unlabeled data, create function that predicts output Semi-Supervised – Uses labeled and unlabeled data

Example: Clustering Google News

K-means Algorithm 1)Pick a number (k) of cluster centers 2)Assign every element to its nearest cluster center 3)Move each cluster center to the mean of its assigned elements 4)Repeat 2-3 until convergence

Figure 1: K-means algorithm. Training examples are shown as dots, and cluster centroids are shown as crosses. K-means Example

Invocation using the command line takes the form:

Canopy Clustering Canopy Clustering is a very simple, fast and surprisingly accurate method for grouping objects into clusters. Define two thresholds Tight: T 1 Loose: T 2 Put all records into a set S While S is not empty Remove any record r from S and create a canopy centered at r For each other record r i, compute cheap distance d from r to r i If d < T 2, place r i in r’s canopy If d < T 1, remove r i from S

Canopy Clustering SequenceFile (WritableComparable, VectorWritable) Invocation using the command line takes the form:

Fuzzy K-Means Fuzzy K-Means (also called Fuzzy C-Means) is an extension of K-Means, the popular simple clustering technique. Like K-Means, Fuzzy K-Means works on those objects which can be represented in n- dimensional vector space and a distance measure is defined. The algorithm is similar to k-means. Initialize k clusters Until converged Compute the probability of a point belong to a cluster for every pair Re-compute the cluster centers using above probability membership values of points to clusters.

Fuzzy K-Means Invocation using the command line takes the form:

Conclusion Mahout did not scale well Mahout was not easy to learn Mahout was not easily modifiable For performance and efficiency, it is better to – Understand the data set – Understand data mining – Understand the methodology

Thank you !

Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

Similar presentations

Presentation on theme: "Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

Similar presentations

Presentation on theme: "Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion."— Presentation transcript:

Similar presentations

About project

Feedback