Clustering Anna Reithmeir Data Mining Proseminar 2017

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
PARTITIONAL CLUSTERING
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Data Mining Techniques: Clustering
Machine Learning and Data Mining Clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis: Basic Concepts and Algorithms
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
DATA MINING LECTURE 8 Clustering The k-means algorithm
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Machine Learning Queens College Lecture 7: Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
May 2003 SUT Color image segmentation – an innovative approach Amin Fazel May 2003 Sharif University of Technology Course Presentation base on a paper.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
ΠΑΝΕΠΙΣΤΗΜΙΟ ΙΩΑΝΝΙΝΩΝ ΑΝΟΙΚΤΑ ΑΚΑΔΗΜΑΪΚΑ ΜΑΘΗΜΑΤΑ Εξόρυξη Δεδομένων Ομαδοποίηση (clustering) Διδάσκων: Επίκ. Καθ. Παναγιώτης Τσαπάρας.
Clustering (1) Clustering Similarity measure Hierarchical clustering
Data Mining: Basic Cluster Analysis
Semi-Supervised Clustering
More on Clustering in COSC 4335
Clustering CSC 600: Data Mining Class 21.
CSSE463: Image Recognition Day 21
Constrained Clustering -Semi Supervised Clustering-
Data Mining K-means Algorithm
Clustering in Ratemaking: Applications in Territories Clustering
Computer Vision Lecture 12: Image Segmentation II
CSE 5243 Intro. to Data Mining
K-means and Hierarchical Clustering
John Nicholas Owen Sarah Smith
CSSE463: Image Recognition Day 23
CSE572, CBS598: Data Mining by H. Liu
CS 685: Special Topics in Data Mining Jinze Liu
DATA MINING Introductory and Advanced Topics Part II - Clustering
CSE572, CBS572: Data Mining by H. Liu
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
CSSE463: Image Recognition Day 23
CSSE463: Image Recognition Day 23
SEEM4630 Tutorial 3 – Clustering.
CSE572: Data Mining by H. Liu
EM Algorithm and its Applications
Hierarchical Clustering
CS 685: Special Topics in Data Mining Jinze Liu
Machine Learning and Data Mining Clustering
Presentation transcript:

Clustering Anna Reithmeir Data Mining Proseminar 2017 we will now take a look at another important

Content What is clustering? Cluster models Cluster algorithms Partitional clustering Hierarchical clustering Others Applications Anna Reithmeir | Data Mining Proseminar | Clustering

General Idea Goal: Find natural groupings in given data set Input: Multivariate dataset Output: Clustering -clusters, cluster set Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006

General Idea Goal: Find natural groupings in given data set Input: Multivariate dataset Output: Clustering Unsupervised learning method -no info about labels -as reminder: unsupervised learning methods describe hidden structure Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006

General Idea Goal: Find natural groupings in given data set Input: Multivariate dataset Output: Clustering Unsupervised learning method Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006

General Idea Goal: Find natural groupings in given data set Input: Multivariate dataset Output: Clustering Unsupervised learning method Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006

Cluster Models Partitional clustering Hierarchical clustering -approached different perspectives/goals. Therefore models exist Anna Reithmeir | Data Mining Proseminar | Clustering

Cluster Models Partitional clustering Produces partition of dataset Hierarchical clustering Produces hierarchy of clusters -in other words: set of nested clusters Anna Reithmeir | Data Mining Proseminar | Clustering

Cluster Models Partitional clustering Produces partition of dataset Hierarchical clustering Produces hierarchy of clusters Agglomerative methods: Merge clusters iteratively Divisive methods: Divide clusters iteratively -for each model several algorithms introduced Anna Reithmeir | Data Mining Proseminar | Clustering

Partitional clustering Hierarchical clustering Cluster models Partitional clustering Hierarchical clustering Divisive clustering Agglomerative clustering Anna Reithmeir | Data Mining Proseminar | Clustering

Partitional Clustering : K-Means Introduced by MacQueen in 1967 Centroid based, hard clustering -one of first introduced algorithms -straight foreward -centroid: clusters represented through centers Anna Reithmeir | Data Mining Proseminar | Clustering

Partitional Clustering : K-Means Introduced by MacQueen in 1967 Centroid based, hard clustering Minimizes sum of squares of distances from each data point to mean of its cluster Number of clusters, distance function, initial cluster centers need to be specified -in other words: assign points to cluster which has center nearest -what is distance function (euclidean, hamming) Anna Reithmeir | Data Mining Proseminar | Clustering

K-means Algorithm Step by step, k=2 Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006

K-means Algorithm Initialize cluster centers Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006

K-means Algorithm Initialize cluster centers Assign each point to nearest cluster Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006

K-means Algorithm Initialize cluster centers Assign each point to nearest cluster Recompute cluster centers by mean of all data points in cluster Result of first iteration Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006

K-means Algorithm Initialize cluster centers Assign each point to nearest cluster Recompute cluster centers by mean of all data points in cluster Repeat 2 and 3 until cluster centers do not change anymore Result of next iteration Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006

K-means Algorithm Initialize cluster centers Assign each point to nearest cluster Recompute cluster centers by mean of all data points in cluster Repeat 2 and 3 until cluster centers do not change anymore -Final clustering after convergence -maybe noticed some in middle changed Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006

Hierarchical Clustering Agglomerative Algorithms Agglomerative methods: merges two closest clusters in each step Single-link (SLINK), Complete-link (CLINK) -if we want a hierarchy instead of partition -general algorithm: each point in one cluster, merge with smallest distance -SLINK: distance two closest points -CLINK: two furthest points Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Murphy, ‘Machine Learning: A Probabalistic Perspective’, 2012

Hierarchical Clustering Agglomerative Algorithms Agglomerative methods: merges two closest clusters in each step Single-link (SLINK), Complete-link (CLINK) Definition of distance SLINK CLINK -SLINK:regardless to similarity inside cluster->wide diameter -CLINK: smallest of maximum distances -> small diameter Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Murphy, ‘Machine Learning: A Probabalistic Perspective’, 2012

Dendrograms -visualized -depending on level can split in diff numbers(level 5->3) -hierarchy of tumor subclasses of breast cancer Anna Reithmeir | Data Mining Proseminar | Clustering | Images: Jain, ‘Data Clustering: A Review’, 1999; Sorlie,’Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications’, 2001

Other Algorithms: Soft Clustering Expectation Maximization: Models clusters with combination of probability distributions Computes maximum likelyhood Combination -> mixture model Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006

Other Algorithms: Soft Clustering Expectation Maximization: Models clusters with combination of probability distributions Computes maximum likelyhood E-step: Calculate parameters of distributions M-step: Optimize distributions shape and location (M-step) Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006

Other Algorithms: DBSCAN Density Based Spatial Clustering of Applications with Noise Density based method User input: Max distance of ‘reachable points‘ Min number of points which form dense region of ‘core points‘ -different approach-> graph theory -categorizes core points(red), reachable points(yellow), outliers(blue) -takes noise in input into account-> robust to outliers Anna Reithmeir | Data Mining Proseminar | Clustering

-now that weve seen how it works, lets compare -k=3 for kmeans/hierarchical->always computes 3 clusters -smily mouth represented by one instead of three -DBSCAN: discovers numbers of clusters itself, how we expect it -infact dbscan 2014 awarded ‚test of time award‘ at ‚knowledge discovery and data mining conference‘, leading conference in data mining Anna Reithmeir | Data Mining Proseminar | Clustering | Image: http//:www.machinelearningtutorial.net, May 2017

Applications Recommender systems For what is clustering needed? -netflix spotify, maybe have noticed they recommend for you -recommender systems use c – combined with other methods -important tool in online marketing & personalization of online applications Anna Reithmeir | Data Mining Proseminar | Clustering | Images : Anna Reithmeir

Applications Medical imaging Gene expression analysis Tumor identification -distinguish between cancerous and non cancerous tissue -identify gene expression patterns in human DNA Anna Reithmeir | Data Mining Proseminar | Clustering | Image: http//:www.medicaldaily.com, May 2017

Applications Image segmentation Image compression -divide image into regions of nearly similar color (original, k=10,k=3) results in color reduction -compression: storing cluster ID way more efficient than storing RGB values for each pixel -others: speech and face recognition, search engines, prdictive analytics Anna Reithmeir | Data Mining Proseminar | Clustering | Image: Bishop, ‘Pattern Recognition and Machine Learning’, 2006

Thank you for your attention! We have now come to an end. Seen different clustering methods, Have advantages/dis Especially as data is big and highdimensional nowadays methods face new problems Anna Reithmeir | Data Mining Proseminar | Clustering