Introduction to Scalable Machine Learning with Apache Mahout Grant Ingersoll February 15, 2010.

Slides:

Advertisements

Similar presentations

Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Advertisements

Recommender Systems & Collaborative Filtering

CS525: Special Topics in DBs Large-Scale Data Management

Copyright© 2003 Avaya Inc. All rights reserved Avaya Interactive Dashboard (AID): An Interactive Tool for Mining Avaya Problem Ticket Database Ziyang Wang.

Database System Concepts and Architecture

Dan Bassett, Jonathan Canfield December 13, 2011.

University of Minnesota Optimizing MapReduce Provisioning in the Cloud Michael Cardosa, Aameek Singh†, Himabindu Pucha†, Abhishek Chandra

Rocchio’s Algorithm 1. Motivation Naïve Bayes is unusual as a learner: – Only one pass through data – Order doesn’t matter 2.

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert

Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.

Recommender System with Hadoop and Spark

1 Large-Scale Machine Learning at Twitter Jimmy Lin and Alek Kolcz Twitter, Inc. Presented by: Yishuang Geng and Kexin Liu.

1 Machine Learning with Apache Hama Tommaso Teofili tommaso [at] apache [dot] org.

Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll Lucid Imagination.

Introduction to Open Source Search with Apache Lucene and Solr Grant Ingersoll.

Searching with Lucene Chapter 2. For discussion Information retrieval What is Lucene? Code for indexer using Lucene Pagerank algorithm.

Scalable Text Mining with Sparse Generative Models

Implementing search with free software An introduction to Solr By Mick England.

MapReduce for Machine Learning on Multicore

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.

Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.

Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.

DLRL Cluster Matt Bollinger, Joseph Pontani, Adam Lech Client: Sunshin Lee CS4624 Capstone Project March 3, 2014 Virginia Tech, Blacksburg, VA.

Apache Mahout Feb 13, 2012 Shannon Quinn Cloud Computing CS

1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.

Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.

Terrier: TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009)

Apache Mahout Industrial Strength Machine Learning Jeff Eastman.

CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.

INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.

Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.

Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%

Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Apache Mahout Qiaodi Zhuang Xijing Zhang.

807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.

Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?

Redpoll A machine learning library based on hadoop Jeremy CS Dept. Jinan University, Guangzhou.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

Page 1 Cloud Study: Algorithm Team Mahout Introduction 박성찬 IDS Lab.

Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016.

Apache Mahout Industrial Strength Machine Learning Jeff Eastman.

Recommendation Systems ARGEDOR. Introduction Sample Data Tools Cases.

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.

Image taken from: slideshare

Big Data is a Big Deal!.

Presented by: Javier Pastorino Fall 2016

Information Organization: Overview

Scalable Machine Learning

Industrial Strength Machine Learning Jeff Eastman

Introducing Apache Mahout

Hadoop Clusters Tess Fulkerson.

Waikato Environment for Knowledge Analysis

CS110: Discussion about Spark

Introduction to Apache

HPML Conference, Lyon, Sept 2018

Charles Tappert Seidenberg School of CSIS, Pace University

Big Data, Bigger Data & Big R Data

CSE 491/891 Lecture 25 (Mahout).

Information Organization: Overview

Introducing Apache Mahout

Presentation transcript:

Introduction to Scalable Machine Learning with Apache Mahout Grant Ingersoll February 15, 2010

Lucid Imagination, Inc. Introduction You Machine learning experience? Business Intelligence? Natural Lang. Processing? Apache Hadoop? Me Co-founder Apache Mahout Apache Lucene/Solr committer Co-founder Lucid Imagination

Lucid Imagination, Inc. Topics What is Machine Learning? ML Use Cases What is Mahout? What can I do with it right now? Wheres Mahout headed?

Lucid Imagination, Inc. Amazon.com What is Machine Learning? Google News

Lucid Imagination, Inc. Really its… Machine Learning is programming computers to optimize a performance criterion using example data or past experience Intro. To Machine Learning by E. Alpaydin Subset of Artificial Intelligence Lots of related fields: Information Retrieval Stats Biology Linear algebra Many more

Lucid Imagination, Inc. Common Use Cases Recommend friends/dates/products Classify content into predefined groups Find similar content based on object properties Find associations/patterns in actions/behaviors Identify key topics in large collections of text Detect anomalies in machine output Ranking search results Others?

Lucid Imagination, Inc. Useful Terminology Vectors/Matrices Weights Sparse Dense Norms Features Feature reduction Occurrences and Cooccurrences

Lucid Imagination, Inc. Getting Started with ML Get your data Decide on your features per your algorithm Prep the data Different approaches for different algorithms Run your algorithm(s) Lather, rinse, repeat Validate your results Smell test, A/B testing, more formal methods

Lucid Imagination, Inc. Apache Mahout An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License Why Mahout? Many Open Source ML libraries either: Lack Community Lack Documentation and Examples Lack Scalability Lack the Apache License ;-) Or are research-oriented

Lucid Imagination, Inc. Focus: Machine Learning Math Vectors/Matrices/S VD Math Vectors/Matrices/S VD Recommenders Clustering Classification Freq. Pattern Mining Freq. Pattern Mining Genetic Utilities Lucene/Vectorizer Utilities Lucene/Vectorizer Collections (primitives) Apache Hadoop Applications Examples See

Lucid Imagination, Inc. Focus: Scalable Goal: Be as fast and efficient as the possible given the intrinsic design of the algorithm Some algorithms wont scale to massive machine clusters Others fit logically on a Map Reduce framework like Apache Hadoop Still others will need other distributed programming models Be pragmatic Most Mahout implementations are Map Reduce enabled Work in Progress

Lucid Imagination, Inc. Prepare Data from Raw content Data Sources: Lucene integration bin/mahout lucenevector … Document Vectorizer bin/mahout seqdirectory … bin/mahout seq2sparse … Programmatically See the Utils module in Mahout Database File system

Lucid Imagination, Inc. Recommendations Extensive framework for collaborative filtering Recommenders User based Item based Online and Offline support Offline can utilize Hadoop Many different Similarity measures Cosine, LLR, Tanimoto, Pearson, others

Clustering Document level Group documents based on a notion of similarity K-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean- Shift Distance Measures Manhattan, Euclidean, other Topic Modeling Cluster words across documents to identify topics Latent Dirichlet Allocation

Categorization Place new items into predefined categories: Sports, politics, entertainment Mahout has several implementations Naïve Bayes Complementary Naïve Bayes Decision Forests

Freq. Pattern Mining Identify frequently co- occurrent items Useful for: Query Recommendations Apple -> iPhone, orange, OS X Related product placement Beer and Diapers

Lucid Imagination, Inc. Evolutionary Map-Reduce ready fitness functions for genetic programming Integration with Watchmaker Problems solved: Traveling salesman Class discovery Many others

Lucid Imagination, Inc. How To: Recommenders Data: Users (abstract) Items (abstract) Ratings (optional) Load the data model Ask for Recommendations: User-User Item-Item

Lucid Imagination, Inc. Ugly Demo I Group Lens Data: ug=true In other words: the reason why I work on servers, not UIs!

Lucid Imagination, Inc. How to: Command Line Most algorithms have a Driver program Shell script in $MAHOUT_HOME/bin helps with most tasks Prepare the Data Different algorithms require different setup Run the algorithm Single Node Hadoop Print out the results Several helper classes: LDAPrintTopics, ClusterDumper, etc.

Lucid Imagination, Inc. Ugly Demo II - Prep Data Set: Reuters / Convert to Text via boot-camp-preclass-training/ boot-camp-preclass-training/ Convert to Sequence File: bin/mahout seqdirectory –input --output -- charset UTF-8 Convert to Sparse Vector: bin/mahout seq2sparse --input /content/reuters/seqfiles/ --norm 2 --weight TF --output /content/reuters/seqfiles- TF/ --minDF 5 --maxDFPercent 90

Lucid Imagination, Inc. Ugly Demo II: Topic Modeling Latent Dirichlet Allocation./mahout lda --input /content/reuters/seqfiles- TF/vectors/ --output /content/reuters/seqfiles-TF/lda- output --numWords –numTopics 10./mahout org.apache.mahout.clustering.lda.LDAPrintTopics -- input /content/reuters/seqfiles-TF/lda-output/state dict /content/reuters/seqfiles-TF/dictionary.file-0 --words 10 --output /content/reuters/seqfiles-TF/lda- output/topics --dictionaryType sequencefile Good feature reduction (stopword removal) required

Lucid Imagination, Inc. Ugly Demo III: Clustering K-Means Same Prep as UD II, except use TFIDF weight./mahout kmeans --input /content/reuters/seqfiles- TFIDF/vectors/part k 15 --output /content/reuters/seqfiles-TFIDF/output-kmeans --clusters /content/reuters/seqfiles-TFIDF/output-kmeans/clusters Print out the clusters:./mahout clusterdump --seqFileDir /content/reuters/seqfiles-TFIDF/output-kmeans/clusters- 15/ --pointsDir /content/reuters/seqfiles-TFIDF/output- kmeans/points/ --dictionary /content/reuters/seqfiles- TFIDF/dictionary.file-0 --dictionaryType sequencefile --substring 20

Lucid Imagination, Inc. Ugly Demo IV: Frequent Pattern Mining Data: fpg -i /content/freqitemset/accidents.dat - o patterns -k 50 -method mapreduce -g 10 -regex [\ ]./mahout seqdump --seqFile patterns/fpgrowth/part-r

Lucid Imagination, Inc. Whats Next? 0.3 release very soon Parallel Singular Value Decomposition (Lanczos) Stabilize APIs for 1.0 release Benchmarking Google Summer of Code? More Algorithms

Lucid Imagination, Inc. Resources Slides and Full Details of Demos at: slides-and-demo-examples/ More Examples in Mahout SVN in the examples directory

Lucid Imagination, Inc. Resources trunk

Lucid Imagination, Inc. Resources Mahout in Action by Owen and Anil Introducing Apache Mahout Programming Collective Intelligence by Toby Segaran Data Mining - Practical Machine Learning Tools and Techniques by Ian H. Witten and Eibe Frank

Lucid Imagination, Inc. References HAL: Terminator: Matrix: Google News: Amazon.com: Facebook: Mahout: Beer and Diapers: DMOZ: