1 CLUSTERING  Basic Concepts In clustering or unsupervised learning no training data, with class labeling, are available. The goal becomes: Group the.

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Clustering.
Clustering Basic Concepts and Algorithms
Clustering Categorical Data The Case of Quran Verses
Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ
Statistics It is the science of planning studies and experiments, obtaining sample data, and then organizing, summarizing, analyzing, interpreting data,
CLUSTERING PROXIMITY MEASURES
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Data Mining Techniques: Clustering
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Data Clustering Methods
Concept of Measurement
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 9 Clustering Algorithms Bioinformatics Data Analysis and Tools.
Chapter 2: Pattern Recognition
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
What is Cluster Analysis
Distance Measures Tan et al. From Chapter 2.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
DEPARTMENT OF MATHEMATI CS [ YEAR OF ESTABLISHMENT – 1997 ] DEPARTMENT OF MATHEMATICS, CVRCE.
Clustering analysis workshop Clustering analysis workshop CITM, Lab 3 18, Oct 2014 Facilitator: Hosam Al-Samarraie, PhD.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Chapter 4 CONCEPTS OF LEARNING, CLASSIFICATION AND REGRESSION Cios / Pedrycz / Swiniarski / Kurgan.
DATA MINING CLUSTERING K-Means.
Presented by Tienwei Tsai July, 2005
Chapter 4 Statistics. 4.1 – What is Statistics? Definition Data are observed values of random variables. The field of statistics is a collection.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
Network Systems Lab. Korea Advanced Institute of Science and Technology No.1 Appendix A. Mathematical Background EE692 Parallel and Distribution Computation.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Basics: Data Remark: Discusses “basics concerning data sets (first half of Chapter.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
So Far……  Clustering basics, necessity for clustering, Usage in various fields : engineering and industrial fields  Properties : hierarchical, flat,
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Quantifying Knowledge Fouad Chedid Department of Computer Science Notre Dame University Lebanon.
1  The goals:  Select the “optimum” number l of features  Select the “best” l features  Large l has a three-fold disadvantage:  High computational.
CHAPTER 5 SIGNAL SPACE ANALYSIS
Theory and Applications
Chapter 2: Getting to Know Your Data
AGC DSP AGC DSP Professor A G Constantinides©1 Signal Spaces The purpose of this part of the course is to introduce the basic concepts behind generalised.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Clustering.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1 CLUSTER VALIDITY  Clustering tendency Facts  Most clustering algorithms impose a clustering structure to the data set X at hand.  However, X may not.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Overview Data Mining - classification and clustering
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
May 2003 SUT Color image segmentation – an innovative approach Amin Fazel May 2003 Sharif University of Technology Course Presentation base on a paper.
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
COMP24111 Machine Learning Clustering Analysis Basics Ke Chen.
Data Preliminaries CSC 600: Data Mining Class 1.
CLUSTERING Basic Concepts
Lecture 2-2 Data Exploration: Understanding Data
Clustering Analysis Basics
Haim Kaplan and Uri Zwick
Lecture Notes for Chapter 2 Introduction to Data Mining
School of Computer Science & Engineering
REMOTE SENSING Multispectral Image Classification
REMOTE SENSING Multispectral Image Classification
Text Categorization Berlin Chen 2003 Reference:
Chapter 7: Transformations
Group 9 – Data Mining: Data
Data Preliminaries CSC 576: Data Mining.
Presentation transcript:

1 CLUSTERING  Basic Concepts In clustering or unsupervised learning no training data, with class labeling, are available. The goal becomes: Group the data into a number of sensible clusters (groups). This unravels similarities and differences among the available data.  Applications: Engineering Bioinformatics Social Sciences Medicine Data and Web Mining  To perform clustering of a data set, a clustering criterion must first be adopted. Different clustering criteria lead, in general, to different clusters.

2  A simple example Blue shark, sheep, cat, dog Lizard, sparrow, viper, seagull, gold fish, frog, red mullet 1.Two clusters 2.Clustering criterion: How mammals bear their progeny Gold fish, red mullet, blue shark Sheep, sparrow, dog, cat, seagull, lizard, frog, viper 1.Two clusters 2.Clustering criterion: Existence of lungs

3  Clustering task stages  Feature Selection: Information rich features-Parsimony  Proximity Measure: This quantifies the term similar or dissimilar.  Clustering Criterion: This consists of a cost function or some type of rules.  Clustering Algorithm: This consists of the set of steps followed to reveal the structure, based on the similarity measure and the adopted criterion.  Validation of the results.  Interpretation of the results.

4  Depending on the similarity measure, the clustering criterion and the clustering algorithm different clusters may result. Subjectivity is a reality to live with from now on.  A simple example: How many clusters?? 2 or 4 ??

5  Basic application areas for clustering  Data reduction. All data vectors within a cluster are substituted (represented) by the corresponding cluster representative.  Hypothesis generation.  Hypothesis testing.  Prediction based on groups.

6  Clustering Definitions  Hard Clustering: Each point belongs to a single cluster Let An m -clustering R of X, is defined as the partition of X into m sets (clusters), C 1, C 2,…,C m, so that – – – In addition, data in C i are more similar to each other and less similar to the data in the rest of the clusters. Quantifying the terms similar-dissimilar depends on the types of clusters that are expected to underlie the structure of X.

7  Fuzzy clustering: Each point belongs to all clusters up to some degree. A fuzzy clustering of X into m clusters is characterized by m functions

8 These are known as membership functions. Thus, each x i belongs to any cluster “up to some degree”, depending on the value of

9 TYPES OF FEATURES  With respect to their domain  Continuous (the domain is a continuous subset of  ).  Discrete (the domain is a finite discrete set). Binary or dichotomous (the domain consists of two possible values).  With respect to the relative significance of the values they take  Nominal (the values code states, e.g., the sex of an individual).  Ordinal (the values are meaningfully ordered, e.g., the rating of the services of a hotel (poor, good, very good, excellent)).  Interval-scaled (the difference of two values is meaningful but their ratio is meaningless, e.g., temperature).  Ratio-scaled (the ratio of two values is meaningful, e.g., weight).

10 PROXIMITY MEASURES  Between vectors  Dissimilarity measure (between vectors of X ) is a function with the following properties

11 If in addition (triangular inequality) d is called a metric dissimilarity measure.

12  Similarity measure (between vectors of X ) is a function with the following properties

13 If in addition s is called a metric similarity measure.  Between sets Let D i  X, i=1,…,k and U={D 1,…,D k } A proximity measure  on U is a function A dissimilarity measure has to satisfy the relations of dissimilarity measure between vectors, where D i ’ ‘ s are used in place of x, y (similarly for similarity measures).

14 PROXIMITY MEASURES BETWEEN VECTORS  Real-valued vectors  Dissimilarity measures (DMs) Weighted l p metric DMs Interesting instances are obtained for –p =1 (weighted Manhattan norm) –p =2 (weighted Euclidean norm) –p =∞ ( d  ( x,y )= max 1  i  l w i |x i -y i | )

15 Other measures – where b j and a j are the maximum and the minimum values of the j -th feature, among the vectors of X (dependence on the current data set) –

16  Similarity measures Inner product Tanimoto measure

17  Discrete-valued vectors  Let F={0,1,…,k-1} be a set of symbols and X={x 1,…,x N }  F l  Let A(x,y)=[a ij ], i, j=0,1,…,k-1, where a ij is the number of places where x has the i -th symbol and y has the j -th symbol. NOTE: Several proximity measures can be expressed as combinations of the elements of A(x,y).  Dissimilarity measures: The Hamming distance (number of places where x and y differ) The l 1 distance

18  Similarity measures: Tanimoto measure : where Measures that exclude a 00 : Measures that include a 00 :

19  Mixed-valued vectors Some of the coordinates of the vectors x are real and the rest are discrete. Methods for measuring the proximity between two such x i and x j :  Adopt a proximity measure (PM) suitable for real-valued vectors.  Convert the real-valued features to discrete ones and employ a discrete PM. The more general case of mixed-valued vectors:  Here nominal, ordinal, interval-scaled, ratio-scaled features are treated separately.

20 The similarity function between x i and x j is: In the above definition: w q =0, if at least one of the q -th coordinates of x i and x j are undefined or both the q -th coordinates are equal to 0. Otherwise w q =1. If the q -th coordinates are binary, s q (x i,x j )=1 if x iq =x jq =1 and 0 otherwise. If the q -th coordinates are nominal or ordinal, s q (x i,x j )=1 if x iq =x jq and 0 otherwise. If the q -th coordinates are interval or ratio scaled-valued where r q is the interval where the q -th coordinates of the vectors of the data set X lie.

21  Fuzzy measures Let x, y  [0,1] l. Here the value of the i -th coordinate, x i, of x, is not the outcome of a measuring device.  The closer the coordinate x i is to 1 ( 0 ), the more likely the vector x possesses (does not possess) the i -th characteristic.  As x i approaches 0.5, the certainty about the possession or not of the i -th feature from x decreases. A possible similarity measure that can quantify the above is: Then

22  Missing data For some vectors of the data set X, some features values are unknown Ways to face the problem:  Discard all vectors with missing values (not recommended for small data sets)  Find the mean value m i of the available i -th feature values over that data set and substitute the missing i -th feature values with m i.  Define b i =0, if both the i -th features x i, y i are available and 1 otherwise. Then where  (x i,y i ) denotes the PM between two scalars x i, y i.  Find the average proximities  avg (i) between all feature vectors in X along all components. Then where  (x i,y i )=  (x i,y i ), if both x i and y i are available and  avg (i) otherwise.

23 PROXIMITY FUNCTIONS BETWEEN A VECTOR AND A SET  Let X={x 1,x 2,…,x N } and C  X, x  X  All points of C contribute to the definition of  (x, C)  Max proximity function  Min proximity function  Average proximity function ( n C is the cardinality of C )

24  A representative(s) of C, r C, contributes to the definition of  ( x, C ) In this case:  ( x, C )=  ( x, r C ) Typical representatives are:  The mean vector:  The mean center:  The median center: NOTE: Other representatives (e.g., hyperplanes, hyperspheres) are useful in certain applications (e.g., object identification using clustering techniques). where n C is the cardinality of C d : a dissimilarity measure

25 PROXIMITY FUNCTIONS BETWEEN SETS  Let X={x 1,…,x N }, D i, D j  X and n i =|D i |, n j =|D j |  All points of each set contribute to  (D i,D j )  Max proximity function (measure but not metric, only if  is a similarity measure)  Min proximity function (measure but not metric, only if  is a dissimilarity measure)  Average proximity function (not a measure, even if  is a measure)

26  Each set D i is represented by its representative vector m i  Mean proximity function (it is a measure provided that  is a measure):  NOTE: Proximity functions between a vector x and a set C may be derived from the above functions if we set D i ={x}.

27  Remarks: Different choices of proximity functions between sets may lead to totally different clustering results. Different proximity measures between vectors in the same proximity function between sets may lead to totally different clustering results. The only way to achieve a proper clustering is  by trial and error and,  taking into account the opinion of an expert in the field of application.