Introducing Apache Mahout

Slides:



Advertisements
Similar presentations
Introduction to Scalable Machine Learning with Apache Mahout Grant Ingersoll February 15, 2010.
Advertisements

CS525: Special Topics in DBs Large-Scale Data Management
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
INTRODUCTION TO MACHINE LEARNING David Kauchak CS 451 – Fall 2013.
An Overview of Machine Learning
1 Machine Learning with Apache Hama Tommaso Teofili tommaso [at] apache [dot] org.
Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll Lucid Imagination.
Searching with Lucene Chapter 2. For discussion Information retrieval What is Lucene? Code for indexer using Lucene Pagerank algorithm.
Big Data Analytics Module 4 – Data Mining and Predictive Analytics Including Mahout Saptak Sen, Microsoft Bill Ramos, Advaiya.
Introduction to machine learning
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.
Apache Mahout Feb 13, 2012 Shannon Quinn Cloud Computing CS
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Introduction to Machine Learning MSE 2400 EaLiCaRA Spring 2015 Dr. Tom Way Based in part on notes from Gavin Brown, University of Manchester.
Apache Mahout Industrial Strength Machine Learning Jeff Eastman.
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Machine Learning Lecture 1. Course Information Text book “Introduction to Machine Learning” by Ethem Alpaydin, MIT Press. Reference book “Data Mining.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
1 Machine Learning 1.Where does machine learning fit in computer science? 2.What is machine learning? 3.Where can machine learning be applied? 4.Should.
Week 1 - An Introduction to Machine Learning & Soft Computing
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Apache Mahout Qiaodi Zhuang Xijing Zhang.
807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.
Page 1 Cloud Study: Algorithm Team Mahout Introduction 박성찬 IDS Lab.
Machine Learning in CSC 196K
Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016.
Apache Mahout Industrial Strength Machine Learning Jeff Eastman.
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
Introducing Precictive Analytics
Image taken from: slideshare
Bhakthi Liyanage SQL Saturday Atlanta 15 July 2017
Machine Learning Models
Information Organization: Overview
Hadoop Aakash Kag What Why How 1.
Machine Learning overview Chapter 18, 21
Industrial Strength Machine Learning Jeff Eastman
Machine Learning overview Chapter 18, 21
Introducing Apache Mahout
Introduction Machine Learning 14/02/2017.
Eick: Introduction Machine Learning
Intro to Machine Learning

Data Mining 101 with Scikit-Learn
Machine Learning With Python Sreejith.S Jaganadh.G.
Introduction to Data Science Lecture 7 Machine Learning Overview
CH. 1: Introduction 1.1 What is Machine Learning Example:
Hadoop Clusters Tess Fulkerson.
Waikato Environment for Knowledge Analysis
DATA ANALYTICS AND TEXT MINING
What is Pattern Recognition?
HPML Conference, Lyon, Sept 2018
Overview of Machine Learning
Classification and Prediction
ITEC323 Lecture 1.
Charles Tappert Seidenberg School of CSIS, Pace University
Information Retrieval
Basics of ML Rohan Suri.
Information Organization: Overview
Machine Learning overview Chapter 18, 21
Machine Learning.
Naïve Bayes Classifier
Presentation transcript:

Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll Lucid Imagination

Overview What is Machine Learning? Mahout

Definition “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” Intro. To Machine Learning by E. Alpaydin Subset of Artificial Intelligence Many other fields: comp sci., biology, math, psychology, etc.

Types Supervised Unsupervised Semi-Supervised Using labeled training data, create function that predicts output of unseen inputs Unsupervised Using unlabeled data, create function that predicts output Semi-Supervised Uses labeled and unlabeled data

Characterizations Lots of Data Identifiable Features in that Data Too big/costly for people to handle People still can help

Clustering Unsupervised Find Natural Groupings Documents Search Results People Genetic traits in groups Many, many more uses

Example: Clustering Google News

Collaborative Filtering Unsupervised Recommend people and products User-User User likes X, you might too Item-Item People who bought X also bought Y

Example: Collab Filtering Amazon.com

Classification/Categorization Many, many types Spam Filtering Named Entity Recognition Phrase Identification Sentiment Analysis Classification into a Taxonomy

Example: NER NER? Excerpt from Yahoo News

Example: Categorization

Info. Retrieval Learning Ranking Functions Learning Spelling Corrections User Click Analysis and Tracking

Other Image Analysis Robotics Games Higher level natural language processing Many, many others

What is Apache Mahout? A Mahout is an elephant trainer/driver/keeper, hence… + Machine Learning = (and other distributed techniques)

What? Hadoop brings: Mahout brings: Map/Reduce API HDFS In other words, scalability and fault-tolerance Mahout brings: Library of machine learning algorithms Examples

Why Mahout? Many Open Source ML libraries either: Lack Community Lack Documentation and Examples Lack Scalability Lack the Apache License ;-) Or are research-oriented

Why Mahout? Intelligent Apps are the Present and Future Thus, Mahout’s Goal is: Scalable Machine Learning with Apache License

Current Status What’s in it: Simple Matrix/Vector library Taste Collaborative Filtering Clustering Canopy/K-Means/Fuzzy K-Means/Mean-shift/Dirichlet Classifiers Naïve Bayes Complementary NB Evolutionary Integration with Watchmaker for fitness function

How? Examples Taste Clustering Classification Evolutionary

Taste: Movie Recommendations Given ratings by users of movies, recommend other movies http://lucene.apache.org/mahout/taste.html#demo

Taste Demo http://localhost:8080/mahout-taste-webapp/RecommenderServlet?userID=12&debug=true http://localhost:8080/mahout-taste-webapp/RecommenderServlet?userID=43&debug=true mvn jetty:run-war

Clustering: Synthetic Control Data http://archive.ics.uci.edu/ml/datasets/Synthetic+Control+Chart+Time+Series Each clustering impl. has an example Job for running in <MAHOUT_HOME>/examples o.a.mahout.clustering.syntheticcontrol.* Outputs clusters… See output.txt, synthetic_control data

Classification: NB and CNB Examples 20 Newsgroups http://cwiki.apache.org/confluence/display/MAHOUT/TwentyNewsgroups Wikipedia http://cwiki.apache.org/confluence/display/MAHOUT/WikipediaBayesExample

Evolutionary Traveling Salesman Class Discovery http://cwiki.apache.org/confluence/display/MAHOUT/Traveling+Salesman Class Discovery http://cwiki.apache.org/confluence/display/MAHOUT/Class+Discovery

What’s Next? More Examples Winnow/Perceptron (MAHOUT-85) Text Clustering Association Rules (MAHOUT-108) Logistic Regression Solr Integration (SOLR-769) GSOC

When, Who When? Now! Who? You! We want others to: Mahout is growing We want programmers who: Are comfortable with math Like to work on hard problems We want others to: Kick the tires

Where? http://lucene.apache.org/mahout http://cwiki.apache.org/MAHOUT Hadoop - http://hadoop.apache.org http://cwiki.apache.org/MAHOUT mahout-{user|dev}@lucene.apache.org http://www.lucidimagination.com/search/p:mahout

Resources “Programming Collective Intelligence” by Segaran “Data Mining - Practical Machine Learning Tools and Techniques” by Witten and Frank “Taming Text” by Ingersoll and Morton Taming Text – Open source tools for doing machine learning