Canopy Clustering and K-Means Clustering Machine Learning Big Data at Hacker Dojo Anandha L Ranganathan (Anand) Anandha L Ranganathan.

Slides:



Advertisements
Similar presentations
Escape Sequences \n newline \t tab \b backspace \r carriage return
Advertisements

Data Set used. K Means K Means Clusters 1.K Means begins with a user specified amount of clusters 2.Randomly places the K centroids on the data set 3.Finds.
Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
K-means method for Signal Compression: Vector Quantization
Jeff Howbert Introduction to Machine Learning Winter Collaborative Filtering Nearest Neighbor Approach.
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
Recommender System with Hadoop and Spark
Introduction to Bioinformatics
1 Machine Learning with Apache Hama Tommaso Teofili tommaso [at] apache [dot] org.
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
Post Silicon Test Optimization Ron Zeira
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Introduction to working with Loops  2000 Prentice Hall, Inc. All rights reserved. Modified for use with this course. Introduction to Computers and Programming.
Distance Measures Tan et al. From Chapter 2.
(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.
Cluster Analysis (1).
What is Cluster Analysis?
Gene Expression 1. Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC EPCLUST 2.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Tutorial 8 Clustering 1. General Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC –ArrayExpress.
Tal Mor  Create an automatic system that given an image of a room and a color, will color the room walls  Maintaining the original texture.
How to think in Map-Reduce Paradigm Ayon Sinha
CSIE Dept., National Taiwan Univ., Taiwan
Clustering Methods K- means. K-means Algorithm Assume that K=3 and initially the points are assigned to clusters as follows. C 1 ={x 1,x 2,x 3 }, C 2.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
What does C store? >>A = [1 2 3] >>B = [1 1] >>[C,D]=meshgrid(A,B) c) a) d) b)
More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Clustering.
Decisions, Decisions, Decisions Conditional Statements In Java.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
CS378 Final Project The Netflix Data Set Class Project Ideas and Guidelines.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Clustering Patrice Koehl Department of Biological Sciences National University of Singapore
Redpoll A machine learning library based on hadoop Jeremy CS Dept. Jinan University, Guangzhou.
Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.
Introduction to Computers and Programming Lecture 7:
Lloyd Algorithm K-Means Clustering. Gene Expression Susumu Ohno: whole genome duplications The expression of genes can be measured over time. Identifying.
Image Similarity Presented By: Ronak Patel Guided By: Dr. Longin Jan Latecki.
Canopy Clustering Given a distance measure and two threshold distances T1>T2, 1. Determine canopy centers - go through The list of input points to form.
Cluster Analysis, an Overview Laurie Heyer. Why Cluster? Data reduction – Analyze representative data points, not the whole dataset Hypothesis generation.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
Clustering Usman Roshan CS 675. Clustering Suppose we want to cluster n vectors in R d into two groups. Define C 1 and C 2 as the two groups. Our objective.
Color Image Segmentation Mentor : Dr. Rajeev Srivastava Students: Achit Kumar Ojha Aseem Kumar Akshay Tyagi.
K-MEANS CLUSTERING. INTRODUCTION- What is clustering? Clustering is the classification of objects into different groups, or more precisely, the partitioning.
Apache Mahout Industrial Strength Machine Learning Jeff Eastman.
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Chapter 14 – Association Rules and Collaborative Filtering © Galit Shmueli and Peter Bruce 2016 Data Mining for Business Analytics (3rd ed.) Shmueli, Bruce.
Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD
Semi-Supervised Clustering
Lecture 5. MapReduce and HDFS
Industrial Strength Machine Learning Jeff Eastman
AIM: Clustering the Data together
Collaborative Filtering Nearest Neighbor Approach
Representation of documents and queries
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Problem Definition Input: Output: Requirement:
Statistical Models and Machine Learning Algorithms --Review
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Using Clustering to Make Prediction Intervals For Neural Networks
Introduction to Machine learning
Progress Report Alvaro Velasquez.
Presentation transcript:

Canopy Clustering and K-Means Clustering Machine Learning Big Data at Hacker Dojo Anandha L Ranganathan (Anand) Anandha L Ranganathan MLBigData1

Movie Dataset Download the movie dataset from The data is in the format UserID::MovieID::Rating::Timestamp 1::1193::5:: ::1194::4:: ::1123::1:: Anandha L Ranganathan MLBigData

Similarity Measure Jaccard similarity coefficient Cosine similarity Anandha L Ranganathan MLBigData

Jaccard Index Distance = # of movies watched by by User A and B / Total # of movies watched by either user. In other words A  B / A  B. For our applicaton I am going to compare the the subset of user z₁ and z₂ where z₁,z₂ ε Z Anandha L Ranganathan MLBigData

Jaccard Similarity Coefficient. similarity(String[] s1, String[] s2){ List lstSx=Arrays.asList(s1); List lstSy=Arrays.asList(s2); Set unionSxSy = new HashSet (lstSx); unionSxSy.addAll(lstSy); Set intersectionSxSy =new HashSet (lstSx); intersectionSxSy.retainAll(lstSy); sim= intersectionSxSy.size() / (double)unionSxSy.size(); } Anandha L Ranganathan MLBigData

Cosine Similiarty distance = Dot Inner Product (A, B) / sqrt(||A||*||B||) Simple distance calculation will be used for Canopy clustering. Expensive distance calculation will be used for K-means clustering. Anandha L Ranganathan MLBigData

Canopy Clustering- Mapper Canopy cluster are subset of total popultation. Points in that cluster are movies. If z₁ subset of the whole population, rated movie M1 and same subset are rated M2 also then the movie M1 and M2 are belong the same canopy cluster. Anandha L Ranganathan MLBigData

Canopy Cluster – Mapper Anandha L Ranganathan MLBigData First received point/data is center of Canopy. Receive the second point and if it is distance from canopy center is less than T1 then they are point of that canopy. If d(P1,P2) >T1 then that point is new canopy center. If d(P1,P2) < T1 they are point of centroid P1. Continue the step 2,3,4 until the mapper complets its job. Distance is measured between 0 to 1. T1 value is and I expect around 200 canopy clusters. T2 value is

Canopy Cluster – Mapper Anandha L Ranganathan MLBigData Pseudo Code. boolean pointStronglyBoundToCanopyCenter = false for (Canopy canopy : canopies) { double centerPoint= canopyCenter.getPoint(); if(distanceMeasure.similarity(centerPoint, movie_id) > T1) pointStronglyBoundToCanopyCenter = true } if(!pointStronglyBoundToCanopyCenter){ canopies.add(new Canopy(0.0d));

Data Massaging Convert the data into the required format. In this case the converted data to be displayed in > Anandha L Ranganathan MLBigData

Canopy Cluster – Mapper A Anandha L Ranganathan MLBigData

Threshold value Anandha L Ranganathan MLBigData

Reducer Mapper A - Red center Mapper B – Green center Anandha L Ranganathan MLBigData

Redundant centers within the threshold of each other. Anandha L Ranganathan MLBigData

Add small error => Threshold+ξ Anandha L Ranganathan MLBigData

So far we found, only the canopy center. Run another MR job to find out points that are belong to canopy center. canopy clusters are ready when the job is completed. How it would look like ? Anandha L Ranganathan MLBigData

Canopy Cluster - Before MR job Sparse Matrix Anandha L Ranganathan MLBigData

Canopy Cluster – After MR job Anandha L Ranganathan MLBigData

Cells with values 1 are grouped together and users are moved from their original location

K – Means Clustering Output of Canopy cluster will become input of K-means clustering. Apply Cosine similarity metric to find out similar users. To find Cosine similarity create a vector in the format > Anandha L Ranganathan MLBigData

User AToy StoryAvatarJumanjiHeat User BAvatarGoldenEyeMoney TrainMortal Kombat User CToy StoryJumanjiMoney TrainAvatar Anandha L Ranganathan MLBigData Toy StoryAvatarJumanjiHeatGolden EyeMoneyTrainMortal Kombat UserA User B User C

Anandha L Ranganathan MLBigData Vector(A) Vector (B) Vector (C) distance(A,B) = Vector (A) * Vector (B) / (||A||*||B||) Vector(A)*Vector(B) = 1 ||A||*||B||=2*2=4  ¼=.25 Similarity (A,B) =.25

Find k-neighbors from the same canopy cluster. Do not get any point from another canopy cluster if you want small number of neighbors # of K-means cluster > # of Canopy cluster. After couple of map-reduce jobs K-means cluster is ready Anandha L Ranganathan MLBigData

Find Nearest Cluster of a point- Map Public void addPointToCluster(Point p,Iterable lstKMeansCluster) { kMeansCluster closesCluster = null; Double closestDistance = CanopyThresholdT1/3 For(KMeansCluster cluster :lstKMeansCluster){ double distance=distance(cluster.getCenter(),point) if(closesCluster || closestDistance >distance){ closesetCluster = cluster; closesDistance = distance } closesCluster.add(point); } Anandha L Ranganathan MLBigData

Find convergence and Compute Centroid - Reduce Public void computeConvergence((Iterable clusters){ for(Cluster cluster:clusters){ newCentroid = cluster.computeCentroid(cluster); if(cluster.getCentroid()== newCentroid ){ cluster.converged=true; } else { cluster.setCentroid(newCentroid ) } Run the process to find nearest cluster of a point and centroid until the centroid becomes static. Anandha L Ranganathan MLBigData

All points –before clustering Anandha L Ranganathan MLBigData

Canopy - clustering Anandha L Ranganathan MLBigData

Canopy Clusering and K means clustering. Anandha L Ranganathan MLBigData

?