Hierarchical clustering & Graph theory

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Clustering II.
Clustering.
Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Introduction to Graph Theory Instructor: Dr. Chaudhary Department of Computer Science Millersville University Reading Assignment Chapter 1.
22C:19 Discrete Math Graphs Fall 2014 Sukumar Ghosh.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Data Mining Techniques: Clustering
Introduction to Bioinformatics
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
Lecture 6 Image Segmentation
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis: Basic Concepts and Algorithms
1 Kunstmatige Intelligentie / RuG KI2 - 7 Clustering Algorithms Johan Everts.
Graphs. Graph A “graph” is a collection of “nodes” that are connected to each other Graph Theory: This novel way of solving problems was invented by a.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
CLUSTERING (Segmentation)
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
22C:19 Discrete Math Graphs Spring 2014 Sukumar Ghosh.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Clustering Unsupervised learning Generating “classes”
Clustering Algorithms Mu-Yu Lu. What is Clustering? Clustering can be considered the most important unsupervised learning problem; so, as every other.
Data Structures Using C++ 2E
Graph Theory Topics to be covered:
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Clustering Algorithms Presented by Michael Smaili CS 157B Spring
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
Data Mining and Text Mining. The Standard Data Mining process.
DATA MINING WITH CLUSTERING AND CLASSIFICATION
Data Mining: Basic Cluster Analysis
Hierarchical Clustering
More on Clustering in COSC 4335
Clustering CSC 600: Data Mining Class 21.
Data Mining K-means Algorithm
Hierarchical Clustering
Fuzzy Clustering.
DATA MINING Introductory and Advanced Topics Part II - Clustering
CSE572, CBS572: Data Mining by H. Liu
Clustering Wei Wang.
Hierarchical Clustering
CSE572: Data Mining by H. Liu
Hierarchical Clustering
Data Mining Cluster Analysis: Basic Concepts and Algorithms
INTRODUCTION A graph G=(V,E) consists of a finite non empty set of vertices V , and a finite set of edges E which connect pairs of vertices .
Presentation transcript:

Hierarchical clustering & Graph theory Single-link, coloring example

Introduction What is clustering? Most important unsupervised learning problem Find structure in a collection of unlabeled data The process of organizing objects into groups whose members are similar in some way Example of distance based clustering

Goals of Clustering Data reduction: Natural data types: Finding representatives for homogeneous group Natural data types: Finding natural clusters and describe their property Useful data class: Finding useful and suitable groupings Outlier detection Finding usual data objects

Applications Marketing Biology Libraries Insurance City-planning Earthquake studies

Requirements Scalability Dealing with different types of attributes Discovering clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters Ability to deal with noise and outliers Insensitivity to order of input records High dimensionality Interpretability and usability

Clustering Algorithms Exclusive Clustering K-means Overlapping Clustering Fuzzy C-means Hierarchical Clustering Probabilistic Clustering Mixture of Gaussians

Hierarchical Clustering (Agglomerative) Given a set of N items to be clustered, and an N*N distance matrix, the basic process of hierarchical clustering is: Step 1. Assign each data as a cluster, so we have N clusters from N items. Distance between clusters=distance between the items they contain Step 2. Find the closest pair of clusters and merge them into a single cluster (become N-1 clusters) Step 3. Compute the distances between the new cluster and each of the old cluster Step 4. Repeat step 2 and 3 until all clusters are combined into a single cluster of size N.

Illustration Ryan Baker

Different Algorithms to calculate distances Single-linkage clustering Shortest distance from any member of one cluster to any member of the other cluster Complete-linkage clustering Greatest distance from any member of one cluster to any member of the other cluster Average-linkage clustering Average distance from … UCLUS method by R.D’Andrade Median distance from …

Single-linkage clustering example Cluster cities http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html

To Start Calculate the N*N proximity matrix D=[d(i,j)] The clustering are assigned sequence numbers k from 0 to (n-1) and L(k) is the level of the kth clustering.

Algorithm Summary Step1. Begin with disjoint clustering having level L(0)=0 and sequence number m=0 Step 2. Find the most similar(smallest distance) pair of clusters in the current clustering (r),(s) according to d[(r),(s)]=min d[(i),(j)] Step 3. Increment the sequence number from mm+1 Merge clusters r, s to a single cluster. Set the level of this new clustering m to L(m)= d[(r),(s)] Step 4. Update the proximity matrix, D by deleting the rows and columns of (r), (s) and adding a new row and column of the combined (r, s). The proximity of the new cluster (r, s) and old cluster (k) is defined by d[(k),(r, s)]=min {d[(k), (r)], d[(k), (s)] } Step 5. Repeat from step 2 if m<N-1, else stop as all objects are in one cluster now

Iteration 0 The table is the distance matrix D=[d(I,j)]. m=0 and L(0)=0 for all clusters.

Iteration 1 Merge MI with TO into MI/TO, L(MI/TO)=138 m=1

Iteration 2 merge NA, RMNA/RM, L(NA/RM)=219, m=2

Iteration 3 Merge BA and NA/RM into BA/NA/RM L(BA/NA/RM)=255, m=3

Iteration 4. Merge FI with BA/NA/RM into FI/BA/NA/RM L(FI/BA/NA/RM)=268, M=4

Hierarchical tree (Dendrogram) The process can be summarized by the following hierarchical tree

Complete-link clustering Complete-link distance between clusters Ci and Cj is the maximum distance between any object in Ci and any object in Cj The distance is defined by the two most dissimilar objects

Group average clustering Group average distance between clusters Ci and Cj is the average distance between any object in Ci and any object in Cj

Demo http://home.deib.polimi.it/matteucc/Clustering/tuto rial_html/AppletH.html

Comparison Distance Algorithm Advantage Disadventage Single-link Can handle non-elliptical shapes Sensitive to noise and outliers It produces long, elongated clusters Complete-link More balanced clusters Less susceptible to noise Tends to break large clusters All clusters tend to have the same diameter-small clusters are merged with large ones Group Average Less susceptible to noise and outliers Biased towards globular clusters

Graph Theory A graph is an ordered pair G=(V,E) comprising a set V of vertices or nodes together with a set E of edges or lines. The order of a graph is |V|, the number of vertices The size of a graph is |E|, the number of edges The degree of a vertex is the number of edges that connect to it

Seven Bridges of Konigsberg Problem Leonhard Euler, published in 1736 Can we find a path to go around A, B, C, D using the path exactly once ?

Euler Paths and Euler Circuit Simplifies the graph to Path that transverse every edge in the graph exactly once Circuit: a Euler path is starts and ends at the same point

Theorems Theorem: A connected graph has an Euler path (non circuit) if and only if it has exactly 2 vertices with odd degree Theorem: A connected graph has Euler circuit if and only if all vertices have even degrees

Coloring Problem Class conflicts and representation using graph http://web.math.princeton.edu/math_alive/5/Notes2.pdf

Proper Coloring Colors the vertices of a graph so that two adjacent vertices have different colors. 4 color graph: http://web.math.princeton.edu/math_alive/5/Notes2.pdf

Greedy Coloring Algorithm Step 1. Color a vertex with color 1 Step 2. Pick an uncolored vertex v, color it with the lowest-numbered color that has not been been used on any previously-colored vertices adjacent to v. Step 3. Repeat step 2 until all vertices are colored

Coloring Steps

Chromatic Number & Greedy Algorithm The chromatic number of a graph is the minimum number of colors in a proper coloring of that graph. Greedy Coloring Theorem: If d is the largest of the degrees of the vertices in a graph G, then G has a proper coloring with d+1 or fewer colors, i.e. the chromatic number of G is at most d+1(Upper Bond) Real chromatic number may be smaller than the upper bond

Related Algorithms Bellman-ford Dijkstra Ford-Fulkerson Nearest neighbor Depth-first search Breadth-first search Prim

Resources Princeton web math A tutorial on clustering algorithms http://web.math.princeton.edu/math_alive/5/Notes2.pdf A tutorial on clustering algorithms http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/hierarchica l.html Andrew Moore K-means and Hierarchical clustering http://www.autonlab.org/tutorials/kmeans.html Ryan S.J.d. Baker Big Data Education, video lecture week 7, couresa https://class.coursera.org/bigdata-edu-001/lecture Chris Caldwell Graph theor tutorials http://www.utm.edu/departments/math/graph/