Clustering Anna Reithmeir Data Mining Proseminar 2017 we will now take a look at another important
Content What is clustering? Cluster models Cluster algorithms Partitional clustering Hierarchical clustering Others Applications Anna Reithmeir | Data Mining Proseminar | Clustering
General Idea Goal: Find natural groupings in given data set Input: Multivariate dataset Output: Clustering -clusters, cluster set Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006
General Idea Goal: Find natural groupings in given data set Input: Multivariate dataset Output: Clustering Unsupervised learning method -no info about labels -as reminder: unsupervised learning methods describe hidden structure Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006
General Idea Goal: Find natural groupings in given data set Input: Multivariate dataset Output: Clustering Unsupervised learning method Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006
General Idea Goal: Find natural groupings in given data set Input: Multivariate dataset Output: Clustering Unsupervised learning method Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006
Cluster Models Partitional clustering Hierarchical clustering -approached different perspectives/goals. Therefore models exist Anna Reithmeir | Data Mining Proseminar | Clustering
Cluster Models Partitional clustering Produces partition of dataset Hierarchical clustering Produces hierarchy of clusters -in other words: set of nested clusters Anna Reithmeir | Data Mining Proseminar | Clustering
Cluster Models Partitional clustering Produces partition of dataset Hierarchical clustering Produces hierarchy of clusters Agglomerative methods: Merge clusters iteratively Divisive methods: Divide clusters iteratively -for each model several algorithms introduced Anna Reithmeir | Data Mining Proseminar | Clustering
Partitional clustering Hierarchical clustering Cluster models Partitional clustering Hierarchical clustering Divisive clustering Agglomerative clustering Anna Reithmeir | Data Mining Proseminar | Clustering
Partitional Clustering : K-Means Introduced by MacQueen in 1967 Centroid based, hard clustering -one of first introduced algorithms -straight foreward -centroid: clusters represented through centers Anna Reithmeir | Data Mining Proseminar | Clustering
Partitional Clustering : K-Means Introduced by MacQueen in 1967 Centroid based, hard clustering Minimizes sum of squares of distances from each data point to mean of its cluster Number of clusters, distance function, initial cluster centers need to be specified -in other words: assign points to cluster which has center nearest -what is distance function (euclidean, hamming) Anna Reithmeir | Data Mining Proseminar | Clustering
K-means Algorithm Step by step, k=2 Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006
K-means Algorithm Initialize cluster centers Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006
K-means Algorithm Initialize cluster centers Assign each point to nearest cluster Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006
K-means Algorithm Initialize cluster centers Assign each point to nearest cluster Recompute cluster centers by mean of all data points in cluster Result of first iteration Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006
K-means Algorithm Initialize cluster centers Assign each point to nearest cluster Recompute cluster centers by mean of all data points in cluster Repeat 2 and 3 until cluster centers do not change anymore Result of next iteration Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006
K-means Algorithm Initialize cluster centers Assign each point to nearest cluster Recompute cluster centers by mean of all data points in cluster Repeat 2 and 3 until cluster centers do not change anymore -Final clustering after convergence -maybe noticed some in middle changed Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006
Hierarchical Clustering Agglomerative Algorithms Agglomerative methods: merges two closest clusters in each step Single-link (SLINK), Complete-link (CLINK) -if we want a hierarchy instead of partition -general algorithm: each point in one cluster, merge with smallest distance -SLINK: distance two closest points -CLINK: two furthest points Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Murphy, ‘Machine Learning: A Probabalistic Perspective’, 2012
Hierarchical Clustering Agglomerative Algorithms Agglomerative methods: merges two closest clusters in each step Single-link (SLINK), Complete-link (CLINK) Definition of distance SLINK CLINK -SLINK:regardless to similarity inside cluster->wide diameter -CLINK: smallest of maximum distances -> small diameter Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Murphy, ‘Machine Learning: A Probabalistic Perspective’, 2012
Dendrograms -visualized -depending on level can split in diff numbers(level 5->3) -hierarchy of tumor subclasses of breast cancer Anna Reithmeir | Data Mining Proseminar | Clustering | Images: Jain, ‘Data Clustering: A Review’, 1999; Sorlie,’Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications’, 2001
Other Algorithms: Soft Clustering Expectation Maximization: Models clusters with combination of probability distributions Computes maximum likelyhood Combination -> mixture model Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006
Other Algorithms: Soft Clustering Expectation Maximization: Models clusters with combination of probability distributions Computes maximum likelyhood E-step: Calculate parameters of distributions M-step: Optimize distributions shape and location (M-step) Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006
Other Algorithms: DBSCAN Density Based Spatial Clustering of Applications with Noise Density based method User input: Max distance of ‘reachable points‘ Min number of points which form dense region of ‘core points‘ -different approach-> graph theory -categorizes core points(red), reachable points(yellow), outliers(blue) -takes noise in input into account-> robust to outliers Anna Reithmeir | Data Mining Proseminar | Clustering
-now that weve seen how it works, lets compare -k=3 for kmeans/hierarchical->always computes 3 clusters -smily mouth represented by one instead of three -DBSCAN: discovers numbers of clusters itself, how we expect it -infact dbscan 2014 awarded ‚test of time award‘ at ‚knowledge discovery and data mining conference‘, leading conference in data mining Anna Reithmeir | Data Mining Proseminar | Clustering | Image: http//:www.machinelearningtutorial.net, May 2017
Applications Recommender systems For what is clustering needed? -netflix spotify, maybe have noticed they recommend for you -recommender systems use c – combined with other methods -important tool in online marketing & personalization of online applications Anna Reithmeir | Data Mining Proseminar | Clustering | Images : Anna Reithmeir
Applications Medical imaging Gene expression analysis Tumor identification -distinguish between cancerous and non cancerous tissue -identify gene expression patterns in human DNA Anna Reithmeir | Data Mining Proseminar | Clustering | Image: http//:www.medicaldaily.com, May 2017
Applications Image segmentation Image compression -divide image into regions of nearly similar color (original, k=10,k=3) results in color reduction -compression: storing cluster ID way more efficient than storing RGB values for each pixel -others: speech and face recognition, search engines, prdictive analytics Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006
Thank you for your attention! We have now come to an end. Seen different clustering methods, Have advantages/dis Especially as data is big and highdimensional nowadays methods face new problems Anna Reithmeir | Data Mining Proseminar | Clustering