Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD COSC6376 Cloud Computing Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston
Outline Homework1 Tutorial based on Netflix dataset
Homework 1 K-means Clustering of Amazon Reviews Create related product items based on the Amazon review ratings Understand the K-means and canopy clustering algorithms and their relationship Implement these algorithms using Apache Spark Analyze the effect of running these algorithms on a large data set using Amazon Cloud
Tutorial based on Netflix Dataset K-means example using Netflix dataset Rating dataset similar to Amazon reviews Amazon datasets productid, userid, rating, timestamp other meta data fields and review texts Netflix dataset movieid, userid, rating, timestamp
Netflix Prize Netflix provided a training data set of 100,480,507 ratings that 480,189 users gave to 17,770 movies Netflix internal movie rating predictor: Cinematch used for recommending movies $1,000,000 award to these who can improve the prediction by 10% (in terms of root means squared error) Winner: BellKor's Pragmatic Chaos Another team: Ensemble Results equally good but submitted 20 minutes later
Competition Cancelled Researchers demonstrated that individuals can be identified by matching the Netflix data sets with film ratings online Netflix users filed a class action lawsuit against Netflix for privacy violation Video Privacy Protection Act of 1988
Movie Dataset The data is in the format UserID::MovieID::Rating::Timestamp 1::1193::5::978300760 2::1194::4::978300762 7::1123::1::978300760
K-means Clustering Clustering problem description: iterate { Compute distance from all points to all kcenters Assign each point to the nearest k-center Compute the average of all points assigned to all specific k-centers Replace the k-centers with the new averages } Good survey: AK Jain etc. Data Clustering: A Review, ACM Computing Surveys, 1999
K-means Illustration Randomly select k centroids Assign cluster label of each point according to the distance to the centroids
K-means Illustration Reclustering Recalculate the centroids Repeat, until the cluster labels do not change, or the changes of centroids are very small
Summary of K-means Determine the value of k Determine the initial k centroids Repeat until converge - Determine membership: Assign each point to the closest centroid - Update centroid position: Compute the average of the assigned members
The Setting The dataset is stored in HDFS We use a MapReduce kMeans to get the clustering result Implement each iteration in one MapReduce process Pass the k centroids to the Maps Map: assign a label to each record according to the distances to the k centroids <cluster id, record> Reduce: calculate the mean for each cluster, and replace the centroid with the new mean
Complexity The complexity is pretty high: k * n * O ( distance metric ) * num (iterations) Moreover, it can be necessary to send tons of data to each Mapper Node. Depending on your bandwidth and memory available, this could be impossible.
Furthermore There are three big ways a data set can be large: There are a large number of elements in the set. Each element can have many features. There can be many clusters to discover Conclusion – Clustering can be huge, even when you distribute it.
Canopy Clustering Preliminary step to help parallelize computation. Clusters data into overlapping Canopies using super cheap distance metric. Efficient Accurate
Canopy Clustering While there are unmarked points { pick a point which is not strongly marked call it a canopy center mark all points within some threshold of it as in it’s canopy strongly mark all points within some stronger threshold }
After the Canopy Clustering… Run K-mean clustering as usual. Treat objects in separate clusters as being at infinite distances.
MapReduce Implementation: Problem – Efficiently partition a large data set (say… movies with user ratings!) into a fixed number of clusters using Canopy Clustering, K-Means Clustering, and a Euclidean distance measure. The Distance Metric The Canopy Metric ($) The K-Means Metric ($$$)
Steps Get Data into a form you can use (MR) Picking Canopy Centers (MR) Assign Data Points to Canopies (MR) Pick K-Means Cluster Centers K-Means algorithm (MR) Iterate!
Canopy Distance Function Canopy selection requires a simple distance function Number of rater IDs in common Close and far distance thresholds Close distance threshold: 8 rater IDs in common Far distance threshold: 2 rate IDs in common
K-means Distance Metric The set of ratings for a movie given by a set of users can be thought of as a vector A = [user1_score, user2_score, ..., userN_score] To evaluate the distance between two movies, A and B, use the similarity metric below, Similarity(A, B) = sum(A_i * B_i) / (sqrt(sum(A_i^2)) * sqrt(sum(B_i^2))) where the sum(...) functions retrieve all A_i or B_i for 0 <= i < n
Example Three vectors Distance or similarity between A and B ¼=0.25 Vector(A) - 1111000 Vector (B)- 0100111 Vector (C)- 1110010 Distance or similarity between A and B distance(A,B) = Vector (A) * Vector (B) / (||A||*||B||) Vector(A)*Vector(B) = 1 ||A||*||B||=2*2=4 ¼=0.25 Similarity (A,B) = 0.25
Data Massaging Convert the data into the required format. In this case the converted data to be displayed in <MovieId,List of Users> <MovieId, List<userId,ranking>>
Canopy Cluster – Mapper A
Threshold Value
Reducer Mapper A - Red center Mapper B – Green center
Redundant Centers within the Threshold of Each Other.
Add Small Error => Threshold+ξ
So far we found , only the canopy center. Run another MR job to find out points that are belong to canopy center. canopy clusters are ready when the job is completed. How it would look like ?
Canopy Cluster - Before MR job Sparse Matrix
Canopy Cluster – After MR job
Cells with values 1 are grouped together and users are moved from their original location
K – Means Clustering Output of Canopy cluster will become input of K-means clustering. Apply Cosine similarity metric to find out similar users. To find Cosine similarity create a vector in the format <UserId,List<Movies>> <UserId, {m1,m2,m3,m4,m5}>
User A Toy Story Avatar Jumanji Heat User B GoldenEye Money Train Mortal Kombat User C Toy Story Avatar Jumanji Heat Golden Eye MoneyTrain Mortal Kombat UserA 1 User B User C
Find k-neighbors from the same canopy cluster. Do not get any point from another canopy cluster if you want small number of neighbors # of K-means cluster > # of Canopy cluster. After couple of map-reduce jobs K-means cluster is ready
All Points – Before Clustering
Canopy - Clustering
Canopy Clustering and K-means Clustering