Industrial Strength Machine Learning Jeff Eastman

Industrial Strength Machine Learning Jeff Eastman
Apache Mahout Industrial Strength Machine Learning Jeff Eastman

Current Situation Large volumes of data are now available
Platforms now exist to run computations over large datasets (Hadoop, HBase) Sophisticated analytics are needed to turn data into information people can use Active research community and proprietary implementations of “machine learning” algorithms The world needs scalable implementations of ML under open license - ASF

Where is ML Used Today Internet search clustering
Knowledge management systems Social network mapping Taxonomy transformations Marketing analytics Recommendation systems Log analysis & event filtering SPAM filtering, fraud detection

History of Mahout Summer 2007 Community formed
Developers needed scalable ML Mailing list formed Community formed Apache contributors Academia & industry Lots of initial interest Project formed under Apache Lucene January 25, 2008

Who We Are (so far) Grant Ingersoll Dawid Weiss Ozgur Yilmazel
Erik Hatcher Karl Wettin Jeff Eastman Ted Dunning Sean Owen Otis Gospodnetic Isabel Drost

Current Code Base Matrix & Vector library Clustering Utilities
Hama collaboration for very large arrays Clustering Canopy K-Means Mean Shift Utilities Distance Measures Parameters

Example: K-Means Given K, assign the first K random points to be the initial cluster centers Assign subsequent points to the closest cluster using the supplied distance measure Compute the centroid of each cluster and iterate the previous step until the cluster centers converge within delta Run a final pass over the points to cluster them for output

K-Means Map/Reduce Design
Driver Runs multiple iteration jobs using mapper+combiner+reducer Runs final clustering job using only mapper Mapper Configure: Single file containing encoded Clusters Input: File split containing encoded Vectors Output: Vectors keyed by nearest cluster Combiner Input: Vectors keyed by nearest cluster Output: Cluster centroid vectors keyed by “cluster” Reducer (singleton) Input: Cluster centroid vectors Output: Single file containing Vectors keyed by cluster

K-Means Hadoop Implementation
KMeansDriver runJob() runIteration() isConverged() runCluster() KMeansMapper configure() map() KMeansCombiner reduce() KMeansReducer Cluster configure() formatCluster() decodeCluster() addPoint() computeCentroid() accessors

Algorithms Under Development
Naïve Bayes Perceptron PLSI/EM Taste Collaborative Filtering Integration Genetic Programming Dirichlet Process Clustering

GSoC @ Mahout Many interesting submissions
4 projects approved for Mahout ( “Mahout: Parallel implementation of [NB/SOM/RF] machine learning algorithms”, Farid Bourennani “Implementing Logistic Regression in Mahout”, Yun Jiang “Codename Mahout.GA for mahout-machine-learning”, Abdel Hakim Deneche “To implement Complementary Naïve Bayes and Expectation Maximization algorithm using Map Reduce for Multicore Systems”, Robin Anil

Conclusion This is just the beginning
High demand for scalable machine learning Contributors needed who have Interest, enthusiasm & programming ability Test driven development readiness Comfort with the scary math (or bravery) Interest and/or proficiency with Hadoop Some large data sets you want to analyze Access to clusters that we could use for testing

Industrial Strength Machine Learning Jeff Eastman

Similar presentations

Presentation on theme: "Industrial Strength Machine Learning Jeff Eastman"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Industrial Strength Machine Learning Jeff Eastman

Similar presentations

Presentation on theme: "Industrial Strength Machine Learning Jeff Eastman"— Presentation transcript:

Similar presentations

About project

Feedback