Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll Lucid Imagination
Overview What is Machine Learning? Mahout
Definition “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” Intro. To Machine Learning by E. Alpaydin Subset of Artificial Intelligence Many other fields: comp sci., biology, math, psychology, etc.
Types Supervised Unsupervised Semi-Supervised Using labeled training data, create function that predicts output of unseen inputs Unsupervised Using unlabeled data, create function that predicts output Semi-Supervised Uses labeled and unlabeled data
Characterizations Lots of Data Identifiable Features in that Data Too big/costly for people to handle People still can help
Clustering Unsupervised Find Natural Groupings Documents Search Results People Genetic traits in groups Many, many more uses
Example: Clustering Google News
Collaborative Filtering Unsupervised Recommend people and products User-User User likes X, you might too Item-Item People who bought X also bought Y
Example: Collab Filtering Amazon.com
Classification/Categorization Many, many types Spam Filtering Named Entity Recognition Phrase Identification Sentiment Analysis Classification into a Taxonomy
Example: NER NER? Excerpt from Yahoo News
Example: Categorization
Info. Retrieval Learning Ranking Functions Learning Spelling Corrections User Click Analysis and Tracking
Other Image Analysis Robotics Games Higher level natural language processing Many, many others
What is Apache Mahout? A Mahout is an elephant trainer/driver/keeper, hence… + Machine Learning = (and other distributed techniques)
What? Hadoop brings: Mahout brings: Map/Reduce API HDFS In other words, scalability and fault-tolerance Mahout brings: Library of machine learning algorithms Examples
Why Mahout? Many Open Source ML libraries either: Lack Community Lack Documentation and Examples Lack Scalability Lack the Apache License ;-) Or are research-oriented
Why Mahout? Intelligent Apps are the Present and Future Thus, Mahout’s Goal is: Scalable Machine Learning with Apache License
Current Status What’s in it: Simple Matrix/Vector library Taste Collaborative Filtering Clustering Canopy/K-Means/Fuzzy K-Means/Mean-shift/Dirichlet Classifiers Naïve Bayes Complementary NB Evolutionary Integration with Watchmaker for fitness function
How? Examples Taste Clustering Classification Evolutionary
Taste: Movie Recommendations Given ratings by users of movies, recommend other movies http://lucene.apache.org/mahout/taste.html#demo
Taste Demo http://localhost:8080/mahout-taste-webapp/RecommenderServlet?userID=12&debug=true http://localhost:8080/mahout-taste-webapp/RecommenderServlet?userID=43&debug=true mvn jetty:run-war
Clustering: Synthetic Control Data http://archive.ics.uci.edu/ml/datasets/Synthetic+Control+Chart+Time+Series Each clustering impl. has an example Job for running in <MAHOUT_HOME>/examples o.a.mahout.clustering.syntheticcontrol.* Outputs clusters… See output.txt, synthetic_control data
Classification: NB and CNB Examples 20 Newsgroups http://cwiki.apache.org/confluence/display/MAHOUT/TwentyNewsgroups Wikipedia http://cwiki.apache.org/confluence/display/MAHOUT/WikipediaBayesExample
Evolutionary Traveling Salesman Class Discovery http://cwiki.apache.org/confluence/display/MAHOUT/Traveling+Salesman Class Discovery http://cwiki.apache.org/confluence/display/MAHOUT/Class+Discovery
What’s Next? More Examples Winnow/Perceptron (MAHOUT-85) Text Clustering Association Rules (MAHOUT-108) Logistic Regression Solr Integration (SOLR-769) GSOC
When, Who When? Now! Who? You! We want others to: Mahout is growing We want programmers who: Are comfortable with math Like to work on hard problems We want others to: Kick the tires
Where? http://lucene.apache.org/mahout http://cwiki.apache.org/MAHOUT Hadoop - http://hadoop.apache.org http://cwiki.apache.org/MAHOUT mahout-{user|dev}@lucene.apache.org http://www.lucidimagination.com/search/p:mahout
Resources “Programming Collective Intelligence” by Segaran “Data Mining - Practical Machine Learning Tools and Techniques” by Witten and Frank “Taming Text” by Ingersoll and Morton Taming Text – Open source tools for doing machine learning