DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University.

Slides:



Advertisements
Similar presentations
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Clustering.
PARTITIONAL CLUSTERING
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Data Mining Techniques: Clustering
Introduction to Bioinformatics
Cluster Analysis.
1 Introduction to Bioinformatics 2 Introduction to Bioinformatics. LECTURE 7: Phylogenetic Trees * Chapter 7: SARS, a post-genomic epidemic.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
Clustering.
Cluster Analysis: Basic Concepts and Algorithms
Unsupervised Learning and Data Mining
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Cluster Analysis (1).
What is Cluster Analysis?
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Clustering.
Clustering Algorithms Presented by Michael Smaili CS 157B Spring
Prepared by: Mahmoud Rafeek Al-Farra
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Machine Learning Queens College Lecture 7: Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Clustering Patrice Koehl Department of Biological Sciences National University of Singapore
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Identifying Ethnic Origins with A Prototype Classification Method Fu Chang Institute of Information Science Academia Sinica ext. 1819
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Fuzzy Logic in Pattern Recognition
Inferring a phylogeny is an estimation procedure.
Data Mining K-means Algorithm
CSE 5243 Intro. to Data Mining
Data Mining – Chapter 4 Cluster Analysis Part 2
Lecture 7 – Algorithmic Approaches
Cluster Analysis.
Phylogeny.
Text Categorization Berlin Chen 2003 Reference:
Introduction to Machine learning
Presentation transcript:

DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

CLUSTERING AND CLUSTER ANALYSIS Data Mining Lecture IV [Chapter 8: sections 8.4 and Chapter 9 from Principles of Data Mining by Hand,, Manilla, Smyth ]

1. Clustering versus Classification classification: give a pre-determined label to a sample clustering: provide the relevant labels for classification from structure in a given dataset clustering: maximal intra-cluster similarity and maximal inter-cluster dissimilarity Objectives: - 1. segmentation of space - 2. find natural subclasses

Examples of Clustering and Classification 1. Computer Vision

Examples of Clustering and Classification: 1. Computer Vision

Example of Clustering and Classification: 1. Computer Vision

Examples of Clustering and Classification: 2. Types of chemical reactions

Examples of Clustering and Classification: 2. Types of chemical reactions

Voronoi Clustering Georgy Fedoseevich Voronoy

Voronoi Clustering A Voronoi diagram (also called a Voronoi tessellation, Voronoi decomposition, Dirichlet tessellation), is a special kind of decomposition of a metric space determined by distances to a specified discrete set of objects in the space, e.g., by a discrete set of points.

Voronoi Clustering

Voronoi Clustering

Voronoi Clustering

Partitional Clustering [book section 9.4] score-functions centroid intra-cluster distance inter-cluster distance C-means [book page 303]

k-means clustering (also: C-means) The k-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest. The center is the average of all the points in the cluster, ie its coordinates is the arithmetic mean for each dimension separately for all the points in the cluster.

k-means clustering (also: C-means) Example: The data set has three dimensions and the cluster has two points: X = (x1, x2, x3) and Y = (y1, y2, y3). Then the centroid Z becomes Z = (z1, z2, z3), where z1 = (x1 + y1)/2 and z2 = (x2 + y2)/2 and z3 = (x3 + y3)/2

k-means clustering (also: C-means) This is the basic structure of the algorithm (J. MacQueen, 1967): Randomly generate k clusters and determine the cluster centers or directly generate k seed points as cluster centers Assign each point to the nearest cluster center. Recompute the new cluster centers. Repeat until some convergence criterion is met (usually that the assignment hasn't changed).

C-means [book page 303] while changes in cluster C k % form clusters for k=1,…,K do C k = {x | ||x – r k || < || x – r l || } end % compute new cluster centroids for k=1,…,K do r k = mean({x | x  C k }) end

k-means clustering (also: C-means) The main advantages of this algorithm are its simplicity and speed, which allows it to run on large datasets. Yet it does not systematically yield the same result with each run of the algorithm. Rather, the resulting clusters depend on the initial assignments. The k- means algorithm maximizes inter-cluster (or minimizes intra-cluster) variance, but does not ensure that the solution given is not a local minimum of variance.

k-means clustering

k-means clustering (also: C-means)

Fuzzy c-means One of the problems of the k-means algorithm is that it gives a hard partitioning of the data, that is to say that each point is attributed to one and only one cluster. But points on the edge of the cluster, or near another cluster, may not be as much in the cluster as points in the center of cluster.

Fuzzy c-means Therefore, in fuzzy clustering, each point does not pertain to a given cluster, but has a degree of belonging to a certain cluster, as in fuzzy logic. For each point x we have a coefficient giving the degree of being in the k-th cluster u k (x). Usually, the sum of those coefficients has to be one, so that u k (x) denotes a probability of belonging to a certain cluster:

Fuzzy c-means With fuzzy c-means, the centroid of a cluster is computed as being the mean of all points, weighted by their degree of belonging to the cluster, that is:

Fuzzy c-means The degree of being in a certain cluster is related to the inverse of the distance to the cluster then the coefficients are normalized and fuzzyfied with a real parameter m > 1 so that their sum is 1. So :

Fuzzy c-means For m equal to 2, this is equivalent to normalising the coefficient linearly to make their sum 1. When m is close to 1, then cluster center closest to the point is given much more weight than the others, and the algorithm is similar to k-means.

Fuzzy c-means The fuzzy c-means algorithm is greatly similar to the k-means algorithm :

Fuzzy c-means Choose a number of clusters Assign randomly to each point coefficients for being in the clusters Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than ε, the given sensitivity threshold) : Compute the centroid for each cluster, using the formula above For each point, compute its coefficients of being in the clusters, using the formula above

Fuzzy C-means uij is membership of sample i to custer j ck is centroid of custer i while changes in cluster Ck % compute new memberships for k=1,…,K do for i=1,…,N do ujk = f(xj – ck) end % compute new cluster centroids for k=1,…,K do % weighted mean ck = SUMj jkxk xj /SUMj ujk end

Fuzzy c-means The fuzzy c-means algorithm minimizes intra- cluster variance as well, but has the same problems as k-means, the minimum is local minimum, and the results depend on the initial choice of weights.

Fuzzy c-means

Fuzzy c-means

The Correct Number of Clusters Algorithms like C-means and fuzzy C-means need the “correct” number K of clusters in your data set. In realistic cases it is mostly impossible to define what this number K should be. Therefore, the following approach is often used.

The Correct Number of Clusters The sum all distances between points and their respective centroid

The Correct Number of Clusters Now plot this error E as function of the number of clusters K Shoulder K E

The Correct Number of Clusters Remark that the number of errors is minimal when K reflects the natural number of clusters in your data set. Now, hoe to define the error of your clustering? A solution is to sum all distances between points and their respective centroid

Hierarchical Clustering [book section 9.5] One major problem with partitional clustering is that the number of clusters (= #classes) must be pre-specified !!! This poses the question: what IS the real number of clusters in a given set of data? Answer: it depends! Agglomerative methods: bottom-up Divisive methods: top-down

Hierarchical Clustering Agglomerative hierarchical clustering

Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering

Introduction to Bioinformatics 7.3 INFERRING TREES 7.3Inferring trees * n taxa {t 1,…,t n } * D matrix of pairwise genetic distances + JC-correction * Additive distances: distance over path from i → j is: d(i,j) * (total) length of a tree: sum of all branch lengths.

Introduction to Bioinformatics 7.3 INFERRING TREES Finding Branche lengths: Three-point formula: L x + L y = d AB L x + L z = d AC L y + L z = d BC L x = (d AB +d AC -d BC )/2 L y = (d AB +d BC -d AC )/2 L z = (d AC +d BC -d AB )/2 A B centre C LxLx LyLy LzLz

Introduction to Bioinformatics 7.3 INFERRING TREES Four-point formula: d(1,2) + d(i,j) < d(i,1) + d(2,j) R i = ∑ j d(t i,t j ) M(i,j) = (n-2)d(i,j) – R i – R j M(i,j) < M(i,k) for all k not equal to j 1 2 centre i LxLx LyLy LzLz j LqLq when (1,2) and (i,j) are neighbor-couples !Four-point conditionMinimize d(i,j) AND total distance in treeIf i and j are neighbours!

NJ algorithm: Input: nxn distance matrix D and an outgroup Output: rooted phylogenetic tree T Step 1: Compute new table M using D – select smallest value of M to select two taxa to join Step 2: Join the two taxa t i and t j to a new vertex V - use 3-point formula to calculate the updates distance matrix D’ where t i and t j are replaced by V. Step 3: Compute branch lengths from t k to V using 3-point formula, T(V,1) = t i and T(V,2) = t j and TD(t i ) = L(t i,V) and TD(t i ) = L(t i,V). Step 4: The distance matrix D’ now contains n – 1 taxa. If there are more than 2 taxa left go to step 1. If two taxa are left join them by an branch of length d(t i,t j ). Step 5: Define the root node as the branch connecting the outgroup to the rest of the tree. (Alternatively, determine the so-called “mid-point”)

Introduction to Bioinformatics 7.3 INFERRING TREES UPGMA and ultrametric trees: If the distance from the root to all leafs is equal the tree is ultrametric In that case we can use D instead of M and the algorithm is called UPGMA (Unweighted Pair Group Method) Ultrametricity must be valid for the real tee, bur due to noise this condition will in practice generate erroneous trees.

Example of Clustering and Classification

1. Clustering versus Classification classification: give a pre-determined label to a sample clustering: provide the relevant labels for classification from structure in a given dataset clustering: maximal intra-cluster similarity and maximal inter-cluster dissimilarity Objectives: - 1. segmentation of space - 2. find natural subclasses