Analysis of Massive Data Sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
Clustering Francisco Moreno Extractos de Mining of Massive Datasets
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Clustering Shannon Quinn (with thanks to J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University)
Introduction to Bioinformatics
Clustering.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis (1).
What is Cluster Analysis?
Clustering Algorithms
CLUSTERING Eitan Lifshits Big Data Processing Seminar Prof. Amir Averbuch Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman, Jeffery.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Clustering Algorithms
Birch: An efficient data clustering method for very large databases
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
Clustering 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 8: Clustering Mining Massive Datasets.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Clustering.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
CURE: EFFICIENT CLUSTERING ALGORITHM FOR LARGE DATASETS VULAVALA VAMSHI PRIYA.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Analysis of Massive Data Sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing.
Clustering Shannon Quinn (with thanks to J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University)
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Data Mining and Text Mining. The Standard Data Mining process.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Clustering – Part II COSC 526 Class 13 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:
Clustering Anna Reithmeir Data Mining Proseminar 2017
Hierarchical Clustering
Clustering Hierarchical /Agglomerative and Point- Assignment Approaches Measures of “Goodness” for Clusters BFR Algorithm CURE Algorithm Jeffrey D. Ullman.
Clustering CSC 600: Data Mining Class 21.
Clustering Algorithms
Data Mining K-means Algorithm
Hierarchical Clustering
Clustering CS246: Mining Massive Datasets
Information Organization: Clustering
DATA MINING Introductory and Advanced Topics Part II - Clustering
Birch presented by : Bahare hajihashemi Atefeh Rahimi
Text Categorization Berlin Chen 2003 Reference:
SEEM4630 Tutorial 3 – Clustering.
Hierarchical Clustering
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

Analysis of Massive Data Sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing Consumer Computing Laboratory

Copyright © 2016 CCL: Analysis of massive data sets2 of 54Copyright © 2016 CCL: Analysis of massive data sets Clustering Goran Delač, PhD

Copyright © 2016 CCL: Analysis of massive data sets3 of 54 Outline □Problem of clustering o Cluster representation o Traditional approaches □BFR □CURE

Copyright © 2016 CCL: Analysis of massive data sets4 of 54 The problem of clustering □Why clustering? o To get a better understanding of the data set o Given a set of data points, we would like to understand their structure – to see which points are most similar when compared to other points in the data set outlier

Copyright © 2016 CCL: Analysis of massive data sets5 of 54 The problem of clustering □What is clustering? o A set of points is grouped into clusters with a notion of distance between the points  Goal: group together points that are close to each other (similar!)  Points in different clusters are distant / not similar o Distance  Euclidian, Cosine distance, Manhattan distance, Jaccard similarity, Edit distance, Hamming distance, … [2]2

Copyright © 2016 CCL: Analysis of massive data sets6 of 54 The problem of clustering The Opte project,

Copyright © 2016 CCL: Analysis of massive data sets7 of 54 The problem of clustering The Internet Map,

Copyright © 2016 CCL: Analysis of massive data sets8 of 54 The problem of clustering Web usage data outline map of knowledge,

Copyright © 2016 CCL: Analysis of massive data sets9 of 54 The problem of clustering National Map of Vegetation Ecoregions Produced Empirically Using Multivariate Spatial Clustering,

Copyright © 2016 CCL: Analysis of massive data sets10 of 54 The problem of clustering Clustering Map of Biomedical Articles (text similarity),

Copyright © 2016 CCL: Analysis of massive data sets11 of 54 The problem of clustering Music profiles (Last.fm),

Copyright © 2016 CCL: Analysis of massive data sets12 of 54 The problem of clustering Clustering of the 2,039 UK individuals into 17 clusters based only on genetic data,

Copyright © 2016 CCL: Analysis of massive data sets13 of 54 The problem of clustering Social network graphs,

Copyright © 2016 CCL: Analysis of massive data sets14 of 54 The problem of clustering □Hard problem! o High-dimensional data o A large amount of data points (possibly does not fit into main memory) □Curse of high dimensionality o In a high-dimensional space almost all pairs of data points are equally distant from each other

Copyright © 2016 CCL: Analysis of massive data sets15 of 54 Cluster representation □How to represent multiple points? o In Euclidian space, clusters can be represented by centroids centroid

Copyright © 2016 CCL: Analysis of massive data sets16 of 54 Cluster representation □How to represent multiple points? o In Euclidian space, clusters can be represented by centroids Distance to cluster = distance to centroid

Copyright © 2016 CCL: Analysis of massive data sets17 of 54 Cluster representation □Non-Euclidian spaces  Not possible to compute centroid 1.Clusteroid – actual representative point is chosen (a point from the data set)  Treat clusteroid as a cluster with an appropriate notion of proximity (Jaccard distance, Edit distance, …) How to pick a clusteroid?  Point that is “closest” to all other points in cluster Smallest maximum distance to other points Smallest average distance to other points Smallest sum of squares of distances to other points

Copyright © 2016 CCL: Analysis of massive data sets18 of 54 Cluster representation 2.Cluster Cohesion o Diameter of a merged cluster  Maximum distance between two points o Average distance  Between all the points in the cluster o Density-based clustering  Number of points in a measure of volume  Number of points divided by diameter or avg. distance

Copyright © 2016 CCL: Analysis of massive data sets19 of 54 Hierarchical clustering o Agglomerative (bottom up)  In the beginning each point is its own cluster  Iteratively merge closest clusters o Divisive (top down)  Start with a single cluster  Split it recursively O(n 3 ) Careful implementation O(n 2 log(n))

Copyright © 2016 CCL: Analysis of massive data sets20 of 54 Point assignment methods □Maintain a set clusters o Known number of clusters □Points are assigned to their nearest cluster □K-means clustering □Better suited for large data sets

Copyright © 2016 CCL: Analysis of massive data sets21 of 54 K-means clustering □Euclidian space 1.Initialization  Choose k points to be the initial centroids  Assign points to their nearest centroid 2.Recompute centroids’ locations 3.Reassign points to new centroids  Points can move across cluster boundaries

Copyright © 2016 CCL: Analysis of massive data sets22 of 54 K-means clustering o Steps 2-3 are repeated until convergence  Points do not change clusters  OR some other stopping condition is met

Copyright © 2016 CCL: Analysis of massive data sets23 of 54 K-means clustering □How to choose k? □How to choose k points from the data set?

Copyright © 2016 CCL: Analysis of massive data sets24 of 54 K-means clustering □Choosing k o Intuitively – if we know the number of clusters in advance or we want to cluster the data in a specific number of clusters o Trying different values and observing cluster quality  E.g. increasing k will decrease average distance to the centroid within a cluster Avg. dist. k

Copyright © 2016 CCL: Analysis of massive data sets25 of 54 K-means clustering □Choosing k points o At random? Not a good idea

Copyright © 2016 CCL: Analysis of massive data sets26 of 54 K-means clustering □Choosing k points o Sampling  Get a representative sample of the data set  Cluster it using hierarchical clustering to get k clusters  Choose k points closest to the centroids o Disperse points  Pick first point at random  Choose the next point to be the most distant one to the selected points (the one with the greatest minimal distance)  Repeat the process until k points are selected

Copyright © 2016 CCL: Analysis of massive data sets27 of 54 K-means clustering □Computational complexity o Number of rounds (r) can get large o N can be really large in massive data sets o Can we get meaningful clustering results in just one pass? k – number of centroids N – number of points r – number of rounds

Copyright © 2016 CCL: Analysis of massive data sets28 of 54 BFR □Bradley, Fayyad, Reina □Based on k-means clustering approach □Idea o Avoid storing information on all points as they do not fit into main memory o Keep track of summarized cluster information

Copyright © 2016 CCL: Analysis of massive data sets29 of 54 BFR O(clusters) instead of O(data) Cluster details (update when a chunk is processed) Data set chunk Main memory Data set

Copyright © 2016 CCL: Analysis of massive data sets30 of 54 BFR □Assumptions o Cluster members are normally distributed around the centroid o Euclidian space x y Standard deviations can vary for different dimensions

Copyright © 2016 CCL: Analysis of massive data sets31 of 54 BFR □Assumptions o Clusters are axis-aligned ellipses 

Copyright © 2016 CCL: Analysis of massive data sets32 of 54 BFR □Algorithm 1.Choose initial centroids  Some appropriate method, like in k-means clustering 2.Process data set  Read data set chunk at a time  Maintain cluster information 3.Finalize cluster information

Copyright © 2016 CCL: Analysis of massive data sets33 of 54 BFR □Cluster information o Discard set (DS)  Points that are close enough to the centroid to be summarized and then discarded o Compression set (CS)  Groups of points that are close together but not close to any cluster centroid  Points are summarized in a compression set and discarded (the are not assigned to any cluster!) o Retained set (RS)  Isolated points – kept in memory

Copyright © 2016 CCL: Analysis of massive data sets34 of 54 BFR □Cluster information Centroid RS CS Centroid CS DS RS

Copyright © 2016 CCL: Analysis of massive data sets35 of 54 BFR □Cluster information (DS + CS) o The number of points (N)  Points that exist in a cluster o Vector SUM  Each element is a sum of point coordinates across a dimension o Vector SUMSQ  Each element is a sum of squares of point coordinates across a dimension

Copyright © 2016 CCL: Analysis of massive data sets36 of 54 BFR □Cluster information (DS + CS) o 2d + 1 values are stored to represent each cluster / compression set  d is the number of dimensions o Centroid  Can be calculated as c i = SUM i / N (not stored!)  SUM i sum of coordinates across i th dimension o Variance  v i = (SUMSQ i / N) - (SUM / N) 2 (not stored!)  Standard deviation is the square root of v i

Copyright © 2016 CCL: Analysis of massive data sets37 of 54 BFR □Cluster information (DS + CS) – Benefits o (ds + cs) * (2d + 1) + rs*d values are stored at any time  ds number of clusters - fixed  cs number of compression sets  rs size of retains set  As cs increases, rs decreases o Large dataset  (ds + cs) * (2d + 1) + rs * d << N * d

Copyright © 2016 CCL: Analysis of massive data sets38 of 54 BFR □Cluster information (DS + CS) – Benefits o It is easier to compute new centroid from SUM i  Updates are simple additions to existing vector o Much easier to compute new variance using SUM i and SUMSQ i o Very efficient when combining sets  Simple computation of a new centroid and variance

Copyright © 2016 CCL: Analysis of massive data sets39 of 54 BFR □Dataset chunk processing 1.Load chunk from memory 2.Find points that are close enough to a cluster centroid  Update cluster data  Discard points

Copyright © 2016 CCL: Analysis of massive data sets40 of 54 BFR □Dataset chunk processing 3.Cluster the remaining points along with the points in the retain set (RS)  Points are added to compression sets (CS)  Any main-memory cluster algorithm can be applied  Define criterion when to add points to clusters and when to merge clusters  CS data is updated and points are discarded! 4.Check if some compression sets (CS) can be merged

Copyright © 2016 CCL: Analysis of massive data sets41 of 54 BFR □Finalized cluster information o CS and RS sets can be treated as outliers and discarded o CS and RS are merged with a cluster whose centroid is close enough

Copyright © 2016 CCL: Analysis of massive data sets42 of 54 BFR □When is a point close enough to be included into a cluster? □When to merge compression sets?

Copyright © 2016 CCL: Analysis of massive data sets43 of 54 BFR

Copyright © 2016 CCL: Analysis of massive data sets44 of 54 BFR

Copyright © 2016 CCL: Analysis of massive data sets45 of 54 BFR □Mahalanobis distance (MD)

Copyright © 2016 CCL: Analysis of massive data sets46 of 54 BFR □When to merge compression sets? o Calculate combined variance of merged two CS  SUM, SUMSQ make this operation easy o If variance is below a certain threshold, merge the compression sets o Many alternative approaches  Treat dimensions differently, as they might differ in importance  Cluster density

Copyright © 2016 CCL: Analysis of massive data sets47 of 54 CURE □Clustering using representatives o Can handle massive datasets o Euclidian space o No assumptions on the cluster shape!

Copyright © 2016 CCL: Analysis of massive data sets48 of 54 CURE □1. pass 1.Take a data set sample that fits into main memory 2.Cluster the sample data set  Any algorithm that can handle oddly shaped clusters  Usually hierarchical clustering 3.Choose n representative points  Points should be as distant from each other as possible (see previously descried methods)  Usually a small set of points is chosen

Copyright © 2016 CCL: Analysis of massive data sets49 of 54 CURE □1. pass Centroid Representative points

Copyright © 2016 CCL: Analysis of massive data sets50 of 54 CURE □1. pass 4.Move the representative points some portion of a distance towards the centroid  e.g. 20%  The data set is not actually changed, these are simply synthetic points used by the algorithm!

Copyright © 2016 CCL: Analysis of massive data sets51 of 54 CURE □1. pass Centroid Representative points

Copyright © 2016 CCL: Analysis of massive data sets52 of 54 CURE □1. pass 5.Check if clusters can be merged  Merge clusters if they have representative points that are sufficiently close  Note that distance to centroids is not taken into account! Both clusters have the same centroid!

Copyright © 2016 CCL: Analysis of massive data sets53 of 54 CURE □2. pass o Go through all points p in the original data set o Assign p to a cluster c if the point is closest to one of its representative points  All other representative points across all clusters are more distant

Copyright © 2016 CCL: Analysis of massive data sets54 of 54 Literature 1.J. Leskovec, A. Rajaraman, and J. D. Ullman, "Mining of Massive Datasets", 2014, Chapter 7 : “Clustering" (link)link 2.Distance and Similarity measures: rityMeasures.html rityMeasures.html