Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

Clustering Basic Concepts and Algorithms
Differentially Private Recommendation Systems Jeremiah Blocki Fall A: Foundations of Security and Privacy.
Canopy Clustering and K-Means Clustering Machine Learning Big Data at Hacker Dojo Anandha L Ranganathan (Anand) Anandha L Ranganathan.
Jeff Howbert Introduction to Machine Learning Winter Collaborative Filtering Nearest Neighbor Approach.
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching ANSHUL VARMA FAISAL QURESHI.
Introduction to Bioinformatics
Rubi’s Motivation for CF  Find a PhD problem  Find “real life” PhD problem  Find an interesting PhD problem  Make Money!
Hadoop Technical Workshop Module III: MapReduce Algorithms.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Malicious parties may employ (a) structure-based or (b) label-based attacks to re-identify users and thus learn sensitive information about their rating.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Introduction to Bioinformatics - Tutorial no. 12
What is Cluster Analysis?
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Birch: An efficient data clustering method for very large databases
Clustering Unsupervised learning Generating “classes”
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning and Clustering k-means clustering Sum-of-Squared Errors Competitive Learning SOM Pre-processing and Post-processing techniques.
Data mining and machine learning A brief introduction.
DATA MINING CLUSTERING K-Means.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Netflix Netflix is a subscription-based movie and television show rental service that offers media to subscribers: Physically by mail Over the internet.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Clustering.
A Content-Based Approach to Collaborative Filtering Brandon Douthit-Wood CS 470 – Final Presentation.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Canopy Clustering Given a distance measure and two threshold distances T1>T2, 1. Determine canopy centers - go through The list of input points to form.
Recommendation Systems By: Bryan Powell, Neil Kumar, Manjap Singh.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew.
Color Image Segmentation Mentor : Dr. Rajeev Srivastava Students: Achit Kumar Ojha Aseem Kumar Akshay Tyagi.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
COMP24111 Machine Learning K-means Clustering Ke Chen.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Big Data is a Big Deal!.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Semi-Supervised Clustering
Lecture 5. MapReduce and HDFS
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Constrained Clustering -Semi Supervised Clustering-
Data Mining K-means Algorithm
Data Clustering Michael J. Watts
Clustering Hongfei Yan
Clustering.
AIM: Clustering the Data together
Collaborative Filtering Nearest Neighbor Approach
Revision (Part II) Ke Chen
Ke Chen Reading: [7.3, EA], [9.1, CMB]
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
HPML Conference, Lyon, Sept 2018
DATA MINING Introductory and Advanced Topics Part II - Clustering
Ensembles.
CSE 491/891 Lecture 25 (Mahout).
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD COSC6376 Cloud Computing Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

Outline Homework1 Tutorial based on Netflix dataset

Homework 1 K-means Clustering of Amazon Reviews Create related product items based on the Amazon review ratings Understand the K-means and canopy clustering algorithms and their relationship Implement these algorithms using Apache Spark Analyze the effect of running these algorithms on a large data set using Amazon Cloud

Tutorial based on Netflix Dataset K-means example using Netflix dataset Rating dataset similar to Amazon reviews Amazon datasets productid, userid, rating, timestamp other meta data fields and review texts Netflix dataset movieid, userid, rating, timestamp

Netflix Prize Netflix provided a training data set of 100,480,507 ratings that 480,189 users gave to 17,770 movies Netflix internal movie rating predictor: Cinematch used for recommending movies $1,000,000 award to these who can improve the prediction by 10% (in terms of root means squared error) Winner: BellKor's Pragmatic Chaos Another team: Ensemble Results equally good but submitted 20 minutes later

Competition Cancelled Researchers demonstrated that individuals can be identified by matching the Netflix data sets with film ratings online Netflix users filed a class action lawsuit against Netflix for privacy violation Video Privacy Protection Act of 1988

Movie Dataset The data is in the format UserID::MovieID::Rating::Timestamp 1::1193::5::978300760 2::1194::4::978300762 7::1123::1::978300760

K-means Clustering Clustering problem description: iterate { Compute distance from all points to all kcenters Assign each point to the nearest k-center Compute the average of all points assigned to all specific k-centers Replace the k-centers with the new averages } Good survey: AK Jain etc. Data Clustering: A Review, ACM Computing Surveys, 1999

K-means Illustration Randomly select k centroids Assign cluster label of each point according to the distance to the centroids

K-means Illustration Reclustering Recalculate the centroids Repeat, until the cluster labels do not change, or the changes of centroids are very small

Summary of K-means Determine the value of k Determine the initial k centroids Repeat until converge - Determine membership: Assign each point to the closest centroid - Update centroid position: Compute the average of the assigned members

The Setting The dataset is stored in HDFS We use a MapReduce kMeans to get the clustering result Implement each iteration in one MapReduce process Pass the k centroids to the Maps Map: assign a label to each record according to the distances to the k centroids <cluster id, record> Reduce: calculate the mean for each cluster, and replace the centroid with the new mean

Complexity The complexity is pretty high: k * n * O ( distance metric ) * num (iterations) Moreover, it can be necessary to send tons of data to each Mapper Node. Depending on your bandwidth and memory available, this could be impossible.

Furthermore There are three big ways a data set can be large: There are a large number of elements in the set. Each element can have many features. There can be many clusters to discover Conclusion – Clustering can be huge, even when you distribute it.

Canopy Clustering Preliminary step to help parallelize computation. Clusters data into overlapping Canopies using super cheap distance metric. Efficient Accurate

Canopy Clustering While there are unmarked points { pick a point which is not strongly marked call it a canopy center mark all points within some threshold of it as in it’s canopy strongly mark all points within some stronger threshold }

After the Canopy Clustering… Run K-mean clustering as usual. Treat objects in separate clusters as being at infinite distances.

MapReduce Implementation: Problem – Efficiently partition a large data set (say… movies with user ratings!) into a fixed number of clusters using Canopy Clustering, K-Means Clustering, and a Euclidean distance measure. The Distance Metric The Canopy Metric ($) The K-Means Metric ($$$)

Steps Get Data into a form you can use (MR) Picking Canopy Centers (MR) Assign Data Points to Canopies (MR) Pick K-Means Cluster Centers K-Means algorithm (MR) Iterate!

Canopy Distance Function Canopy selection requires a simple distance function Number of rater IDs in common Close and far distance thresholds Close distance threshold: 8 rater IDs in common Far distance threshold: 2 rate IDs in common

K-means Distance Metric The set of ratings for a movie given by a set of users can be thought of as a vector A = [user1_score, user2_score, ..., userN_score] To evaluate the distance between two movies, A and B, use the similarity metric below, Similarity(A, B) = sum(A_i * B_i) / (sqrt(sum(A_i^2)) * sqrt(sum(B_i^2))) where the sum(...) functions retrieve all A_i or B_i for 0 <= i < n

Example Three vectors Distance or similarity between A and B  ¼=0.25 Vector(A) - 1111000 Vector (B)- 0100111 Vector (C)- 1110010 Distance or similarity between A and B distance(A,B) = Vector (A) * Vector (B) / (||A||*||B||) Vector(A)*Vector(B) = 1 ||A||*||B||=2*2=4  ¼=0.25 Similarity (A,B) = 0.25

Data Massaging Convert the data into the required format. In this case the converted data to be displayed in <MovieId,List of Users> <MovieId, List<userId,ranking>>

Canopy Cluster – Mapper A

Threshold Value

Reducer Mapper A - Red center Mapper B – Green center

Redundant Centers within the Threshold of Each Other.

Add Small Error => Threshold+ξ

So far we found , only the canopy center. Run another MR job to find out points that are belong to canopy center. canopy clusters are ready when the job is completed. How it would look like ?

Canopy Cluster - Before MR job Sparse Matrix

Canopy Cluster – After MR job

Cells with values 1 are grouped together and users are moved from their original location

K – Means Clustering Output of Canopy cluster will become input of K-means clustering. Apply Cosine similarity metric to find out similar users. To find Cosine similarity create a vector in the format <UserId,List<Movies>> <UserId, {m1,m2,m3,m4,m5}>

User A Toy Story Avatar Jumanji Heat User B GoldenEye Money Train Mortal Kombat User C   Toy Story Avatar Jumanji Heat Golden Eye MoneyTrain Mortal Kombat UserA 1 User B User C

Find k-neighbors from the same canopy cluster. Do not get any point from another canopy cluster if you want small number of neighbors # of K-means cluster > # of Canopy cluster. After couple of map-reduce jobs K-means cluster is ready

All Points – Before Clustering

Canopy - Clustering

Canopy Clustering and K-means Clustering