Introducing Apache Mahout

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Introduction to Scalable Machine Learning with Apache Mahout Grant Ingersoll February 15, 2010.
CS525: Special Topics in DBs Large-Scale Data Management
INTRODUCTION TO MACHINE LEARNING David Kauchak CS 451 – Fall 2013.
1 Machine Learning with Apache Hama Tommaso Teofili tommaso [at] apache [dot] org.
Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll Lucid Imagination.
Searching with Lucene Chapter 2. For discussion Information retrieval What is Lucene? Code for indexer using Lucene Pagerank algorithm.
Big Data Analytics Module 4 – Data Mining and Predictive Analytics Including Mahout Saptak Sen, Microsoft Bill Ramos, Advaiya.
Introduction to machine learning
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.
Apache Mahout Feb 13, 2012 Shannon Quinn Cloud Computing CS
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Lecture 2: Introduction to Machine Learning
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Introduction to Machine Learning MSE 2400 EaLiCaRA Spring 2015 Dr. Tom Way Based in part on notes from Gavin Brown, University of Manchester.
Apache Mahout Industrial Strength Machine Learning Jeff Eastman.
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Machine Learning Introduction Study on the Coursera All Right Reserved : Andrew Ng Lecturer:Much Database Lab of Xiamen University Aug 12,2014.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Machine Learning Tutorial Amit Gruber The Hebrew University of Jerusalem.
1 Machine Learning 1.Where does machine learning fit in computer science? 2.What is machine learning? 3.Where can machine learning be applied? 4.Should.
Week 1 - An Introduction to Machine Learning & Soft Computing
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Apache Mahout Qiaodi Zhuang Xijing Zhang.
807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
NATURAL LANGUAGE PROCESSING Zachary McNellis. Overview  Background  Areas of NLP  How it works?  Future of NLP  References.
NTU & MSRA Ming-Feng Tsai
Machine Learning in CSC 196K
Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016.
Apache Mahout Industrial Strength Machine Learning Jeff Eastman.
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
Introducing Precictive Analytics
Image taken from: slideshare
Information Organization: Overview
Machine Learning overview Chapter 18, 21
Industrial Strength Machine Learning Jeff Eastman
Machine Learning overview Chapter 18, 21
Theme Introduction : Learning from Data
Semi-supervised Machine Learning Gergana Lazarova
Eick: Introduction Machine Learning
Intro to Machine Learning

Machine Learning With Python Sreejith.S Jaganadh.G.
CH. 1: Introduction 1.1 What is Machine Learning Example:
Hadoop Clusters Tess Fulkerson.
Extraction, aggregation and classification at Web Scale
Waikato Environment for Knowledge Analysis
WEKA.
DATA ANALYTICS AND TEXT MINING
כריית מידע -- מבוא ד"ר אבי רוזנפלד.
HPML Conference, Lyon, Sept 2018
iSRD Spam Review Detection with Imbalanced Data Distributions
ITEC323 Lecture 1.
Charles Tappert Seidenberg School of CSIS, Pace University
Christoph F. Eick: A Gentle Introduction to Machine Learning
Information Organization: Overview
Machine Learning overview Chapter 18, 21
Introducing Apache Mahout
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Machine Learning.
Naïve Bayes Classifier
Presentation transcript:

Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll

Agenda What is Machine Learning? Mahout Definitions Types Applications Why? How? Who?

NOT! What is Machine Learning? Or? http://en.wikipedia.org/wiki/Image:Hal-9000.jpg http://upload.wikimedia.org/wikipedia/en/4/49/Terminator.jpg

How about? Google News

Or? Amazon.com

Definition “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” Intro. To Machine Learning by E. Alpaydin Subset of Artificial Intelligence Many other fields: comp sci., biology, math, psychology, etc.

Characterizations Lots of Data Identifiable Features in that Data Too big/costly for people to handle People still can help

Types Supervised Unsupervised Semi-Supervised Using labeled training data, create function that predicts output of unseen inputs Unsupervised Using unlabeled data, create function that predicts output Semi-Supervised Uses labeled and unlabeled data

Classification/Categorization Spam Filtering Named Entity Recognition Phrase Identification Sentiment Analysis Classification into a Taxonomy

Clustering Find Natural Groupings Documents Search Results People Genetic traits in groups Many, many more uses

Collaborative Filtering Recommend people and products User-User User likes X, you might too Item-Item People who bought X also bought Y

Info. Retrieval Learning Ranking Functions Learning Spelling Corrections User Click Analysis and Tracking

Other Image Analysis Robotics Games Higher level natural language processing Many, many others

What is Apache Mahout? A Mahout is an elephant trainer/driver/keeper, hence… + Machine Learning = (and other distributed techniques)

What? Hadoop brings: Thus, Mahout’s Goal is: Map/Reduce API HDFS In other words, scalability and fault-tolerance Thus, Mahout’s Goal is: Scalable Machine Learning with Apache License

Why Mahout? Many Open Source ML libraries either: Lack Community Lack Documentation and Examples Lack Scalability Lack the Apache License ;-) Or are research-oriented Personal: Learn more ML Intelligent Apps are the Present and Future See the Hadoop talks tomorrow and Friday! Goal: Overcome gaps the Apache Way!

Current Status Close to Initial release What’s in it: Focused on examples, docs, bug fixes What’s in it: Simple Matrix/Vector library Taste Collaborative Filtering Clustering Canopy/K-Means/Fuzzy K-Means/Mean-shift Classifiers Naïve Bayes Complementary NB Evolutionary Integration with Watchmaker for fitness function

How? Examples Taste Clustering Classification Evolutionary

Taste: Movie Recommendations Given ratings by users of movies, recommend other movies http://lucene.apache.org/mahout/taste.html#demo

Clustering: Synthetic Control Data http://archive.ics.uci.edu/ml/datasets/Synthetic+Control+Chart+Time+Series Each clustering impl. has an example Job for running in <MAHOUT_HOME>/examples o.a.mahout.clustering.syntheticcontrol.* Outputs clusters… See output.txt, synthetic_control data

Classification: NB and CNB Examples 20 Newsgroups http://cwiki.apache.org/confluence/display/MAHOUT/TwentyNewsgroups Wikipedia http://cwiki.apache.org/confluence/display/MAHOUT/WikipediaBayesExample

Evolutionary Traveling Salesman Class Discovery http://cwiki.apache.org/confluence/display/MAHOUT/Traveling+Salesman Class Discovery http://cwiki.apache.org/confluence/display/MAHOUT/Class+Discovery

What’s Next? Release 0.1! Shared Amazon Images (others?) More Examples Winnow/Perceptron (MAHOUT-85) Hbase and HAMA support Normalize I/O format for data Solr Integration (SOLR-769) Other Algorithms: SVM, Linear Regression, etc.

When, Where, Who When? Now! Who? You! Where? Mahout is growing We want Java programmers who: Are comfortable with math Like to work on large, hard problems Where? http://lucene.apache.org/mahout http://cwiki.apache.org/MAHOUT mahout-{user|dev}@lucene.apache.org

Resources “Programming Collective Intelligence” by Toby Segaran “Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank Hadoop - http://hadoop.apache.org http://mloss.org/software/