Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

Slides:

Advertisements

Similar presentations

CS525: Special Topics in DBs Large-Scale Data Management

Advertisements

PARTITIONAL CLUSTERING

Unsupervised Learning

Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching ANSHUL VARMA FAISAL QURESHI.

Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll Lucid Imagination.

CS292 Computational Vision and Language Pattern Recognition and Classification.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for

Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.

Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Introduction to Bioinformatics - Tutorial no. 12

What is Cluster Analysis?

Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

Introduction to machine learning

Introduction to Data Mining Engineering Group in ACL.

Clustering Unsupervised learning Generating “classes”

Clustering Algorithms Mu-Yu Lu. What is Clustering? Clustering can be considered the most important unsupervised learning problem; so, as every other.

CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:

Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

Apache Mahout Feb 13, 2012 Shannon Quinn Cloud Computing CS

Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.

Efficient Model Selection for Support Vector Machines

CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

DATA MINING CLUSTERING K-Means.

Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data Thanh Le, Tom Altman and Katheleen Gardiner University.

Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Unsupervised learning introduction

Unsupervised Learning. Supervised learning vs. unsupervised learning.

Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.

Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.

Clustering Algorithms Presented by Michael Smaili CS 157B Spring

DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.

Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)

Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.

COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.

CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.

Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.

Clustering Unsupervised learning introduction Machine Learning.

Machine Learning Queens College Lecture 7: Clustering.

Apache Mahout Qiaodi Zhuang Xijing Zhang.

Redpoll A machine learning library based on hadoop Jeremy CS Dept. Jinan University, Guangzhou.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.

Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.

The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle.

Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016.

Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.

Fuzzy C-means Clustering Dr. Bernard Chen University of Central Arkansas.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Chun Kai Chen Author ： Andrew.

Intro. ANN & Fuzzy Systems Lecture 20 Clustering (1)

Color Image Segmentation Mentor : Dr. Rajeev Srivastava Students: Achit Kumar Ojha Aseem Kumar Akshay Tyagi.

Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.

Apache Mahout Industrial Strength Machine Learning Jeff Eastman.

Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.

Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.

Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD

Semi-Supervised Clustering

Tutorial: Big Data Algorithms and Applications Under Hadoop

Introducing Apache Mahout

Data Mining K-means Algorithm

K-means and Hierarchical Clustering

Roberto Battiti, Mauro Brunato

KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner

Introducing Apache Mahout

Presentation transcript:

Apache Mahout

Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion

What is Mahout? Distributed machine learning libraries – “scalable to reasonably large data sets” – Runs on Hadoop

What? Hadoop brings: – Map/Reduce API – HDFS – In other words, scalability and fault-tolerance Mahout brings: – Library of machine learning algorithms – Examples

Why Mahout? Many Open Source ML libraries either: – Lack Community – Lack Documentation and Examples – Lack Scalability – Lack the Apache License ;-) – Or are research-oriented

Clustering Unsupervised Find Natural Groupings – Documents – Search Results – People – Genetic traits in groups – Many, many more uses

Types Supervised – Using labeled training data, create function that predicts output of unseen inputs Unsupervised – Using unlabeled data, create function that predicts output Semi-Supervised – Uses labeled and unlabeled data

Example: Clustering Google News

K-means Algorithm 1)Pick a number (k) of cluster centers 2)Assign every element to its nearest cluster center 3)Move each cluster center to the mean of its assigned elements 4)Repeat 2-3 until convergence

Figure 1: K-means algorithm. Training examples are shown as dots, and cluster centroids are shown as crosses. K-means Example

Invocation using the command line takes the form:

Canopy Clustering Canopy Clustering is a very simple, fast and surprisingly accurate method for grouping objects into clusters. Define two thresholds Tight: T 1 Loose: T 2 Put all records into a set S While S is not empty Remove any record r from S and create a canopy centered at r For each other record r i, compute cheap distance d from r to r i If d < T 2, place r i in r’s canopy If d < T 1, remove r i from S

Canopy Clustering SequenceFile (WritableComparable, VectorWritable) Invocation using the command line takes the form:

Fuzzy K-Means Fuzzy K-Means (also called Fuzzy C-Means) is an extension of K-Means, the popular simple clustering technique. Like K-Means, Fuzzy K-Means works on those objects which can be represented in n- dimensional vector space and a distance measure is defined. The algorithm is similar to k-means. Initialize k clusters Until converged Compute the probability of a point belong to a cluster for every pair Re-compute the cluster centers using above probability membership values of points to clusters.

Fuzzy K-Means Invocation using the command line takes the form:

Conclusion Mahout did not scale well Mahout was not easy to learn Mahout was not easily modifiable For performance and efficiency, it is better to – Understand the data set – Understand data mining – Understand the methodology

Thank you !