Apache Mahout Feb 13, 2012 Shannon Quinn Cloud Computing CS 15-319.

Slides:



Advertisements
Similar presentations
Topic Identification in Forums Evaluation Strategy IA Seminar Discussion Ahmad Ammari School of Computing, University of Leeds.
Advertisements

Machine Learning on Spark
CS525: Special Topics in DBs Large-Scale Data Management
1 Image Classification MSc Image Processing Assignment March 2003.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
An Overview of Machine Learning
Recommender System with Hadoop and Spark
Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.
Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll Lucid Imagination.
Customizable Bayesian Collaborative Filtering Denver Dash Big Data Reading Group 11/19/2007.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Introduction to machine learning
Algorithms for Data Analytics Chapter 3. Plans Introduction to Data-intensive computing (Lecture 1) Statistical Inference: Foundations of statistics (Chapter.
Health and CS Philip Chan. DNA, Genes, Proteins What is the relationship among DNA Genes Proteins ?
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.
B.Ramamurthy. Data Analytics (Data Science) EDA Data Intuition/ understand ing Big-data analytics StatsAlgs Discoveries / intelligence Statistical Inference.
Apache Mahout Industrial Strength Machine Learning Jeff Eastman.
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Active Learning An example From Xu et al., “Training SpamAssassin with Active Semi- Supervised Learning”
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.
Stochastic SVD on Hadoop Shannon Quinn (with thanks to Gunnar Martinsson and Nathan Halko of UC Boulder, and Joel Tropp of CalTech)
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Artificial Intelligence 8. Supervised and unsupervised learning Japan Advanced Institute of Science and Technology (JAIST) Yoshimasa Tsuruoka.
The Summary of My Work In Graduate Grade One Reporter: Yuanshuai Sun
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
Apache Mahout Qiaodi Zhuang Xijing Zhang.
Redpoll A machine learning library based on hadoop Jeremy CS Dept. Jinan University, Guangzhou.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
Biomedicine and Big Data Analyzing spatio-temporal patterns in biomedical data Normal Stiff Wavy.
Page 1 Cloud Study: Algorithm Team Mahout Introduction 박성찬 IDS Lab.
Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Apache Mahout Industrial Strength Machine Learning Jeff Eastman.
Recommendation Systems ARGEDOR. Introduction Sample Data Tools Cases.
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
A Simple Approach for Author Profiling in MapReduce
Image taken from: slideshare
Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD
Machine Learning with Spark MLlib
Big Data is a Big Deal!.
Presented by: Javier Pastorino Fall 2016
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Semi-Supervised Clustering
Scalable Machine Learning
A Straightforward Author Profiling Approach in MapReduce
Industrial Strength Machine Learning Jeff Eastman
Introducing Apache Mahout
DATA SCIENCE Online Training at GoLogica
Multimodal Learning with Deep Boltzmann Machines
Introduction to Data Science Lecture 7 Machine Learning Overview
Hadoop Clusters Tess Fulkerson.
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Instance Based Learning
Charles Tappert Seidenberg School of CSIS, Pace University
CSE 491/891 Lecture 25 (Mahout).
Semi-Supervised Learning
Introducing Apache Mahout
Presentation transcript:

Apache Mahout Feb 13, 2012 Shannon Quinn Cloud Computing CS

MapReduce Review C0C1C2C3 M0M1M2M3 IO0IO1IO2IO3 R0R1 FO0FO1 chunks mappers Reducers Map Phase Reduce Phase Shuffling Data Scalable programming model Map phase Shuffle Reduce phase MapReduce Implementations Google Hadoop Figure from lecture 6: MapReduce

MapReduce Review C0C1C2C3 M0M1M2M3 IO0IO1IO2IO3 R0R1 FO0FO1 chunks mappers Reducers Map Phase Reduce Phase Shuffling Data Scalable programming model Map phase Shuffle Reduce phase MapReduce Implementations Google Hadoop Figure from lecture 6: MapReduce This is our focus!

Apache Mahout A scalable machine learning library

Apache Mahout A scalable machine learning library Built on Hadoop

Apache Mahout A scalable machine learning library Built on Hadoop Philosophy of Mahout (and Hadoop by proxy)

What does Mahout do?

Recommendation

Classification

Clustering

Other Mahout Algorithms Dimensionality Reduction Regression Evolutionary Algorithms

Mahout 1.Recommendation 2.Classification 3.Clustering

Recommendation Overview Help users find items they might like based on historical preferences

Recommendation Overview Mathematically

Recommendation Overview Alice Bob Peter 514 ? *based on example by Sebastian Schelter

Recommendation Overview *based on example by Sebastian Schelter

Recommendation Overview 514 ? Bob *based on example by Sebastian Schelter

Recommendation Overview Bob *based on example by Sebastian Schelter

Recommendation in Mahout 1 st Map phase: process input *based on example by Sebastian Schelter

Recommendation in Mahout 1 st Map phase: process input 1 st Reduce phase: list by user *based on example by Sebastian Schelter

Recommendation in Mahout 2 nd Map phase: Emit co-occurred ratings *based on example by Sebastian Schelter

Recommendation in Mahout 2 nd Map phase: Emit co-occurred ratings 2 nd Reduce phase: Compute similarities *based on example by Sebastian Schelter

Mahout 1.Recommendation 2.Classification 3.Clustering

Classification Overview Assigning data to discrete categories

Classification Overview Assigning data to discrete categories Train a model on labeled data Spam Not spam

Classification Overview Assigning data to discrete categories Train a model on labeled data Run the model on new, unlabeled data Spam Not spam ?

Naïve Bayes Example

Prob (token | label) =

Naïve Bayes Example Not spam

Naïve Bayes Example President Obama’s Nobel Prize Speech Not spam

Naïve Bayes Example Spam

Naïve Bayes Example Spam Spam content

Naïve Bayes Example

“Order a trial Adobe chicken daily EAB-List new summer savings, welcome!”

Naïve Bayes in Mahout Complex!

Naïve Bayes in Mahout Complex! Training 1.Read the features

Naïve Bayes in Mahout Complex! Training 1.Read the features 2.Calculate per-document statistics

Naïve Bayes in Mahout Complex! Training 1.Read the features 2.Calculate per-document statistics 3.Normalize across categories

Naïve Bayes in Mahout Complex! Training 1.Read the features 2.Calculate per-document statistics 3.Normalize across categories 4.Calculate normalizing factor of each label

Naïve Bayes in Mahout Complex! Training 1.Read the features 2.Calculate per-document statistics 3.Normalize across categories 4.Calculate normalizing factor of each label Testing Classification

Other Classification Algorithms Stochastic Gradient Descent

Other Classification Algorithms Stochastic Gradient Descent Support Vector Machines

Other Classification Algorithms Stochastic Gradient Descent Support Vector Machines Random Forests

Mahout 1.Recommendation 2.Classification 3.Clustering

Clustering Overview Grouping unstructured data

Clustering Overview Grouping unstructured data Small intra-cluster distance

Clustering Overview Grouping unstructured data Small intra-cluster distance Large inter-cluster distance

K-Means Clustering Example

Cats Dogs

K-Means Clustering in Mahout + C0C1C2C3 M0M1M2M3 IO0IO1IO2IO3 R0R1 FO0FO1 chunks mappers Reducers Map Phase Reduce Phase Shuffling Data Figure from lecture 6: MapReduce

K-Means Clustering in Mahout Assume: # clusters <<< # points

K-Means Clustering in Mahout Assume: # clusters <<< # points M0M1M2M3

K-Means Clustering in Mahout Assume: # clusters <<< # points M0M1M2M3 R0R1

K-Means Clustering in Mahout Map phase: assign cluster IDs

K-Means Clustering in Mahout Map phase: assign cluster IDs Reduce phase: reset centroids

K-Means Clustering in Mahout Important notes --maxIter --convergenceDelta method

Other Clustering Algorithms Latent Dirichlet Allocation Topic models

Other Clustering Algorithms Latent Dirichlet Allocation Topic models Fuzzy K-Means Points are assigned multiple clusters

Other Clustering Algorithms Latent Dirichlet Allocation Topic models Fuzzy K-Means Points are assigned multiple clusters Canopy clustering Fast approximations of clusters

Other Clustering Algorithms Latent Dirichlet Allocation Topic models Fuzzy K-Means Points are assigned multiple clusters Canopy clustering Fast approximations of clusters Spectral clustering Treat points as a graph

Other Clustering Algorithms Latent Dirichlet Allocation Topic models Fuzzy K-Means Points are assigned multiple clusters Canopy clustering Fast approximations of clusters Spectral clustering Treat points as a graph K-Means & Eigencuts

Mahout in Summary

Scalable library

Mahout in Summary Scalable library Three primary areas of focus

Mahout in Summary Scalable library Three primary areas of focus Other algorithms

Mahout in Summary Scalable library Three primary areas of focus Other algorithms All in your friendly neighborhood MapReduce

Mahout in Summary Scalable library Three primary areas of focus Other algorithms All in your friendly neighborhood MapReduce

Thank you!