Download presentation
Presentation is loading. Please wait.
Published byElmer Snow Modified over 6 years ago
1
Clustering Anna Reithmeir Data Mining Proseminar 2017
we will now take a look at another important
2
Content What is clustering? Cluster models Cluster algorithms
Partitional clustering Hierarchical clustering Others Applications Anna Reithmeir | Data Mining Proseminar | Clustering
3
General Idea Goal: Find natural groupings in given data set
Input: Multivariate dataset Output: Clustering -clusters, cluster set Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006
4
General Idea Goal: Find natural groupings in given data set
Input: Multivariate dataset Output: Clustering Unsupervised learning method -no info about labels -as reminder: unsupervised learning methods describe hidden structure Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006
5
General Idea Goal: Find natural groupings in given data set
Input: Multivariate dataset Output: Clustering Unsupervised learning method Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006
6
General Idea Goal: Find natural groupings in given data set
Input: Multivariate dataset Output: Clustering Unsupervised learning method Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006
7
Cluster Models Partitional clustering Hierarchical clustering
-approached different perspectives/goals. Therefore models exist Anna Reithmeir | Data Mining Proseminar | Clustering
8
Cluster Models Partitional clustering Produces partition of dataset
Hierarchical clustering Produces hierarchy of clusters -in other words: set of nested clusters Anna Reithmeir | Data Mining Proseminar | Clustering
9
Cluster Models Partitional clustering Produces partition of dataset
Hierarchical clustering Produces hierarchy of clusters Agglomerative methods: Merge clusters iteratively Divisive methods: Divide clusters iteratively -for each model several algorithms introduced Anna Reithmeir | Data Mining Proseminar | Clustering
10
Partitional clustering Hierarchical clustering
Cluster models Partitional clustering Hierarchical clustering Divisive clustering Agglomerative clustering Anna Reithmeir | Data Mining Proseminar | Clustering
11
Partitional Clustering : K-Means
Introduced by MacQueen in 1967 Centroid based, hard clustering -one of first introduced algorithms -straight foreward -centroid: clusters represented through centers Anna Reithmeir | Data Mining Proseminar | Clustering
12
Partitional Clustering : K-Means
Introduced by MacQueen in 1967 Centroid based, hard clustering Minimizes sum of squares of distances from each data point to mean of its cluster Number of clusters, distance function, initial cluster centers need to be specified -in other words: assign points to cluster which has center nearest -what is distance function (euclidean, hamming) Anna Reithmeir | Data Mining Proseminar | Clustering
13
K-means Algorithm Step by step, k=2
Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006
14
K-means Algorithm Initialize cluster centers
Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006
15
K-means Algorithm Initialize cluster centers
Assign each point to nearest cluster Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006
16
K-means Algorithm Initialize cluster centers
Assign each point to nearest cluster Recompute cluster centers by mean of all data points in cluster Result of first iteration Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006
17
K-means Algorithm Initialize cluster centers
Assign each point to nearest cluster Recompute cluster centers by mean of all data points in cluster Repeat 2 and 3 until cluster centers do not change anymore Result of next iteration Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006
18
K-means Algorithm Initialize cluster centers
Assign each point to nearest cluster Recompute cluster centers by mean of all data points in cluster Repeat 2 and 3 until cluster centers do not change anymore -Final clustering after convergence -maybe noticed some in middle changed Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006
19
Hierarchical Clustering Agglomerative Algorithms
Agglomerative methods: merges two closest clusters in each step Single-link (SLINK), Complete-link (CLINK) -if we want a hierarchy instead of partition -general algorithm: each point in one cluster, merge with smallest distance -SLINK: distance two closest points -CLINK: two furthest points Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Murphy, ‘Machine Learning: A Probabalistic Perspective’, 2012
20
Hierarchical Clustering Agglomerative Algorithms
Agglomerative methods: merges two closest clusters in each step Single-link (SLINK), Complete-link (CLINK) Definition of distance SLINK CLINK -SLINK:regardless to similarity inside cluster->wide diameter -CLINK: smallest of maximum distances -> small diameter Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Murphy, ‘Machine Learning: A Probabalistic Perspective’, 2012
21
Dendrograms -visualized
-depending on level can split in diff numbers(level 5->3) -hierarchy of tumor subclasses of breast cancer Anna Reithmeir | Data Mining Proseminar | Clustering | Images: Jain, ‘Data Clustering: A Review’, 1999; Sorlie,’Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications’, 2001
22
Other Algorithms: Soft Clustering
Expectation Maximization: Models clusters with combination of probability distributions Computes maximum likelyhood Combination -> mixture model Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006
23
Other Algorithms: Soft Clustering
Expectation Maximization: Models clusters with combination of probability distributions Computes maximum likelyhood E-step: Calculate parameters of distributions M-step: Optimize distributions shape and location (M-step) Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006
24
Other Algorithms: DBSCAN
Density Based Spatial Clustering of Applications with Noise Density based method User input: Max distance of ‘reachable points‘ Min number of points which form dense region of ‘core points‘ -different approach-> graph theory -categorizes core points(red), reachable points(yellow), outliers(blue) -takes noise in input into account-> robust to outliers Anna Reithmeir | Data Mining Proseminar | Clustering
25
-now that weve seen how it works, lets compare
-k=3 for kmeans/hierarchical->always computes 3 clusters -smily mouth represented by one instead of three -DBSCAN: discovers numbers of clusters itself, how we expect it -infact dbscan 2014 awarded ‚test of time award‘ at ‚knowledge discovery and data mining conference‘, leading conference in data mining Anna Reithmeir | Data Mining Proseminar | Clustering | Image: http//: May 2017
26
Applications Recommender systems For what is clustering needed?
-netflix spotify, maybe have noticed they recommend for you -recommender systems use c – combined with other methods -important tool in online marketing & personalization of online applications Anna Reithmeir | Data Mining Proseminar | Clustering | Images : Anna Reithmeir
27
Applications Medical imaging Gene expression analysis
Tumor identification -distinguish between cancerous and non cancerous tissue -identify gene expression patterns in human DNA Anna Reithmeir | Data Mining Proseminar | Clustering | Image: http//: May 2017
28
Applications Image segmentation Image compression
-divide image into regions of nearly similar color (original, k=10,k=3) results in color reduction -compression: storing cluster ID way more efficient than storing RGB values for each pixel -others: speech and face recognition, search engines, prdictive analytics Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006
29
Thank you for your attention!
We have now come to an end. Seen different clustering methods, Have advantages/dis Especially as data is big and highdimensional nowadays methods face new problems Anna Reithmeir | Data Mining Proseminar | Clustering
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.