Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll
Agenda What is Machine Learning? Mahout Definitions Types Applications Why? How? Who?
NOT! What is Machine Learning? Or? http://en.wikipedia.org/wiki/Image:Hal-9000.jpg http://upload.wikimedia.org/wikipedia/en/4/49/Terminator.jpg
How about? Google News
Or? Amazon.com
Definition “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” Intro. To Machine Learning by E. Alpaydin Subset of Artificial Intelligence Many other fields: comp sci., biology, math, psychology, etc.
Characterizations Lots of Data Identifiable Features in that Data Too big/costly for people to handle People still can help
Types Supervised Unsupervised Semi-Supervised Using labeled training data, create function that predicts output of unseen inputs Unsupervised Using unlabeled data, create function that predicts output Semi-Supervised Uses labeled and unlabeled data
Classification/Categorization Spam Filtering Named Entity Recognition Phrase Identification Sentiment Analysis Classification into a Taxonomy
Clustering Find Natural Groupings Documents Search Results People Genetic traits in groups Many, many more uses
Collaborative Filtering Recommend people and products User-User User likes X, you might too Item-Item People who bought X also bought Y
Info. Retrieval Learning Ranking Functions Learning Spelling Corrections User Click Analysis and Tracking
Other Image Analysis Robotics Games Higher level natural language processing Many, many others
What is Apache Mahout? A Mahout is an elephant trainer/driver/keeper, hence… + Machine Learning = (and other distributed techniques)
What? Hadoop brings: Thus, Mahout’s Goal is: Map/Reduce API HDFS In other words, scalability and fault-tolerance Thus, Mahout’s Goal is: Scalable Machine Learning with Apache License
Why Mahout? Many Open Source ML libraries either: Lack Community Lack Documentation and Examples Lack Scalability Lack the Apache License ;-) Or are research-oriented Personal: Learn more ML Intelligent Apps are the Present and Future See the Hadoop talks tomorrow and Friday! Goal: Overcome gaps the Apache Way!
Current Status Close to Initial release What’s in it: Focused on examples, docs, bug fixes What’s in it: Simple Matrix/Vector library Taste Collaborative Filtering Clustering Canopy/K-Means/Fuzzy K-Means/Mean-shift Classifiers Naïve Bayes Complementary NB Evolutionary Integration with Watchmaker for fitness function
How? Examples Taste Clustering Classification Evolutionary
Taste: Movie Recommendations Given ratings by users of movies, recommend other movies http://lucene.apache.org/mahout/taste.html#demo
Clustering: Synthetic Control Data http://archive.ics.uci.edu/ml/datasets/Synthetic+Control+Chart+Time+Series Each clustering impl. has an example Job for running in <MAHOUT_HOME>/examples o.a.mahout.clustering.syntheticcontrol.* Outputs clusters… See output.txt, synthetic_control data
Classification: NB and CNB Examples 20 Newsgroups http://cwiki.apache.org/confluence/display/MAHOUT/TwentyNewsgroups Wikipedia http://cwiki.apache.org/confluence/display/MAHOUT/WikipediaBayesExample
Evolutionary Traveling Salesman Class Discovery http://cwiki.apache.org/confluence/display/MAHOUT/Traveling+Salesman Class Discovery http://cwiki.apache.org/confluence/display/MAHOUT/Class+Discovery
What’s Next? Release 0.1! Shared Amazon Images (others?) More Examples Winnow/Perceptron (MAHOUT-85) Hbase and HAMA support Normalize I/O format for data Solr Integration (SOLR-769) Other Algorithms: SVM, Linear Regression, etc.
When, Where, Who When? Now! Who? You! Where? Mahout is growing We want Java programmers who: Are comfortable with math Like to work on large, hard problems Where? http://lucene.apache.org/mahout http://cwiki.apache.org/MAHOUT mahout-{user|dev}@lucene.apache.org
Resources “Programming Collective Intelligence” by Toby Segaran “Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank Hadoop - http://hadoop.apache.org http://mloss.org/software/