L2 and L1 Criteria for K-Means Bilinear Clustering B. Mirkin School of Computer Science Birkbeck College, University of London Advert of a Special Issue:

Slides:



Advertisements
Similar presentations
Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006.
Advertisements

Yinyin Yuan and Chang-Tsun Li Computer Science Department
Clustering II.
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
K-means Clustering Ke Chen.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 March 12, 2012.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
« هو اللطیف » By : Atefe Malek. khatabi Spring 90.
PCA + SVD.
AEB 37 / AE 802 Marketing Research Methods Week 7
Cluster Analysis.
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Cluster Analysis (from Chapter 12)
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Lecture 6: Multiple Regression
Chapter 11 Multiple Regression.
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Prof.Dr.Cevdet Demir
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Evaluating Performance for Data Mining Techniques
Clustering: Tackling Challenges with Data Recovery Approach B. Mirkin School of Computer Science Birkbeck University of London Advert of a Special Issue:
Метод К-средних в кластер-анализе и его интеллектуализация Б.Г. Миркин Профессор, Кафедра анализа данных и искусственного интеллекта, НИУ ВШЭ Москва РФ.
Large Two-way Arrays Douglas M. Hawkins School of Statistics University of Minnesota
CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2014.
Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.
Spatial Statistics Applied to point data.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
Non Negative Matrix Factorization
Data clustering: Topics of Current Interest Boris Mirkin 1,2 1 National Research University Higher School of Economics Moscow RF 2 Birkbeck University.
CSE554AlignmentSlide 1 CSE 554 Lecture 5: Alignment Fall 2011.
Rasch trees: A new method for detecting differential item functioning in the Rasch model Carolin Strobl Julia Kopf Achim Zeileis.
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Clustering.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Basic statistical concepts Variance Covariance Correlation and covariance Standardisation.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
CSE 554 Lecture 8: Alignment
Data Transformation: Normalization
Unsupervised Learning: Clustering
MATH-138 Elementary Statistics
Unsupervised Learning: Clustering
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
Clustering (3) Center-based algorithms Fuzzy k-means
INITIALISATION OF K-MEANS
Stats Club Marnie Brennan
Lecture 21 Clustering (2).
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Dimension reduction : PCA and Clustering
Data Mining – Chapter 4 Cluster Analysis Part 2
15.1 The Role of Statistics in the Research Process
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Clustering The process of grouping samples so that the samples are similar within each group.
Cluster analysis Presented by Dr.Chayada Bhadrakom
Hierarchical Clustering
Presentation transcript:

L2 and L1 Criteria for K-Means Bilinear Clustering B. Mirkin School of Computer Science Birkbeck College, University of London Advert of a Special Issue: The Computer Journal, Profiling Expertise and Behaviour: Deadline 15 Nov To submit,

Outline: More of Properties than Methods Clustering, K-Means and Issues Data recovery PCA model and clustering Data scatter decompositions for L2 and L1 Contributions of nominal features Explications of Quadratic criterion One-by-one cluster extraction: Anomalous patterns and iK-Means Issue of the number of clusters Comments on optimisation problems Conclusion and future work

WHAT IS CLUSTERING; WHAT IS DATA K-MEANS CLUSTERING: Conventional K-Means; Initialization of K- Means; Intelligent K-Means; Mixed Data; Interpretation Aids WARD HIERARCHICAL CLUSTERING: Agglomeration; Divisive Clustering with Ward Criterion; Extensions of Ward Clustering DATA RECOVERY MODELS: Statistics Modelling as Data Recovery; Data Recovery Model for K-Means; for Ward; Extensions to Other Data Types; One-by-One Clustering DIFFERENT CLUSTERING APPROACHES: Extensions of K-Means; Graph-Theoretic Approaches; Conceptual Description of Clusters GENERAL ISSUES: Feature Selection and Extraction; Similarity on Subsets and Partitions; Validity and Reliability

Clustering, K-Means and Issues Bilinear PCA model and clustering Data Scatter Decompositions: Quadratic and Absolute Contributions of nominal features Explications of Quadratic criterion One-by-one cluster extraction: Anomalous patterns and iK-Means Issue of the number of clusters Comments on optimisation problems Conclusion and future work

Example: W. Jevons (1857) planet clusters, updated (Mirkin, 1996) Pluto doesnt fit in the two clusters of planets: originated another cluster (2006)

Clustering algorithms Nearest neighbour Wards Conceptual clustering K-means Kohonen SOM Spectral clustering ………………….

K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence K= 3 hypothetical centroids * * * * * * * @ ** * * *

K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence * * * * * * * @ ** * * *

K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence * * * * * * * @ ** * * *

K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence 4. Output final centroids and clusters * * * * * * ** * *

Advantages of K-Means M odels typology building C omputationally effective C an be utilised incrementally, `on-line Shortcomings (?) of K-Means Initialisation affects results C onvex cluster shape

Initial Centroids: Correct Two cluster case

Initial Centroids: Correct Initial Final

Different Initial Centroids

Different Initial Centroids: Wrong Initial Final

Issues: K-Means gives no advice on: * Number of clusters * Initial setting * Data normalisation * Mixed variable scales * Multiple data sets K-Means gives limited advice on: * Interpretation of results These all can be addressed with the data recovery approach

Clustering, K-Means and Issues Data recovery PCA model and clustering Data Scatter Decompositions: Quadratic and Absolute Contributions of nominal features Explications of Quadratic criterion One-by-one cluster extraction: Anomalous patterns and iK-Means Issue of the number of clusters Comments on optimisation problems Conclusion and future work

Data recovery for data mining (discovery of patterns in data) Type of Data Similarity Temporal Entity-to-feature Type of Model Regression Principal components Clusters Model: Data = Model_Derived_Data + Residual Pythagoras: |Data| m = |Model_Derived_Data| m + |Residual| m m =1, 2. The better fit, the better the model: a natural source of optimisation problems

K-Means as a data recovery method

Representing a partition Cluster k : Centroid c kv (v - feature) Binary 1/0 membership z ik (i - entity)

Basic equations (same as for PCA, but score vectors z k constrained to be binary) y – data entry, z – 1/0 membership, not score c - cluster centroid, N – cardinality i - entity, v - feature /category, k - cluster

Clustering: general and K-Means Bilinear PCA model and clustering Data Scatter Decompositions L2 and L1 Contributions of nominal features Explications of Quadratic criterion One-by-one cluster extraction: Anomalous patterns and iK-Means Issue of the number of clusters Comments on optimisation problems Conclusion and future work

Quadratic data scatter decomposition (classic) K-means: Alternating LS minimisation y – data entry, z – 1/0 membership c - cluster centroid, N – cardinality i - entity, v - feature /category, k - cluster

Absolute Data Scatter Decomposition (Mirkin 1997) C kv are medians

Outline Clustering: general and K-Means Bilinear PCA model and clustering Data Scatter Decompositions L2 and L1 Implications for data pre-processing Explications of Quadratic criterion One-by-one cluster extraction: Anomalous patterns and iK-Means Issue of the number of clusters Comments on optimisation problems Conclusion and future work

Meaning of the Data scatter m=1,2; The sum of contributions of features – the basis for feature pre-processing (dividing by range rather than std) Proportional to the summary variance (L2) / absolute deviation from the median (L1)

Standardisation of features Y ik = (X ik –A k )/B k X - original data Y – standardised data i – entities k – features A k – shift of the origin, typically, the average B k – rescaling factor, traditionally the standard deviation, but range may be better in clustering

Normalising by std decreases the effect of more useful feature 2 by range keeps the effect of distribution shape in T(Y): B = range* #categories (for L2 case) (under the equality-of-variables assumption) B1=Std1 << B2= Std2

Data standardisation Categories as one/zero variables Subtracting the average All features: Normalising by range Categories - sometimes by the number of them

Illustration of data pre-processing Mixed scale data table

Conventional quantitative coding + … data standardisation

No normalisation Tom Sawyer

Z-scoring (scaling by std) Tom Sawyer

Normalising by range* #categories Tom Sawyer

Outline Clustering: general and K-Means Bilinear PCA model and clustering Data Scatter Decompositions: Quadratic and Absolute Contributions of nominal features Explications of Quadratic criterion One-by-one cluster extraction: Anomalous patterns and iK-Means Issue of the number of clusters Comments on optimisation problems Conclusion and future work

Contribution of a feature F to a partition (m=2) Proportional to correlation ratio 2 if F is quantitative a contingency coefficient between cluster partition and F, if F is nominal: Pearson chi-square (Poisson normalised) Goodman-Kruskal tau-b (Range normalised) Contrib(F) =

Contribution of a quantitative feature to a partition (m=2) Proportional to correlation ratio 2 if F is quantitative

Contribution of a pair nominal feature – partition, L2 case Proportional to a contingency coefficient Pearson chi-square (Poisson normalised) Goodman-Kruskal tau-b (Range normalised) B j =1 Still needs be normalised by the square root of #categories, to balance the contribution of a numerical feature

Contribution of a pair nominal feature – partition, L1 case A highly original contingency coefficient Still needs be normalised by square root of #categories, to balance the contribution of a numerical feature

Clustering: general and K-Means Bilinear PCA model and clustering Data Scatter Decompositions: Quadratic and Absolute Contributions of nominal features Explications of Quadratic criterion One-by-one cluster extraction: Anomalous patterns and iK-Means Issue of the number of clusters Comments on optimisation problems Conclusion and future work

Equivalent criteria (1) A. Bilinear residuals squared MIN Minimizing difference between data and cluster structure B. Distance-to-Centre Squared MIN Minimizing difference between data and cluster structure

Equivalent criteria (2) C. Within-group error squared MIN Minimizing difference between data and cluster structure D. Within-group variance Squared MIN Minimizing within-cluster variance

Equivalent criteria (3) E. Semi-averaged within distance squared MIN Minimizing dissimilarities within clusters F. Semi-averaged within similarity squared MAX Maximizing similarities within clusters

Equivalent criteria (4) G. Distant Centroids MAX Finding anomalous types H. Consensus partition MAX Maximizing correlation between sought partition and given variables

Equivalent criteria (5) I. Spectral Clusters MAX Maximizing summary Raileigh quotient over binary vectors

Gowers controversy: 2N+1 entities Two-cluster possibilities W(c1,c2/c3)= N 2 d(c1,c2) W(c1/c2,c3)= N d(c2,c3) W(c1/c2,c3)=o(W(c1,c2/c3)) Separation over grand mean/median rather than just over distances (in the most general d setting) c1 c2 c3 N N 1

Outline Clustering: general and K-Means Bilinear PCA model and clustering Data Scatter Decompositions: Quadratic and Absolute Contributions of nominal features Explications of Quadratic criterion One-by-one cluster extraction strategy: Anomalous patterns and iK-Means Comments on optimisation problems Issue of the number of clusters Conclusion and future work

PCA inspired Anomalous Pattern Clustering y iv =c v z i + e iv, where z i = 1 if i S, z i = 0 if i S With Euclidean distance squared c S must be anomalous, that is, interesting

Spectral clustering can be not optimal (1) Spectral clustering (becoming popular, in a different setting): Find maximum eigenvector by maximising over all possible x Define z i =1 if x i >a ; z i =0, if x i a, for some a

Spectral clustering can be not optimal (2) Example (for similarity data): z | | i | This cannot be typical…

Initial setting with Anomalous Pattern Cluster Tom Sawyer

Anomalous Pattern Clusters: Iterate 0 Tom Sawyer 1 2 3

iK-Means: Anomalous clusters + K-means After extracting 2 clusters (how one can know that 2 is right?) Final

iK-Means: Defining K and Initial Setting with Iterative Anomalous Pattern Clustering Find all Anomalous Pattern clusters Remove smaller (e.g., singleton) clusters Put the number of remaining clusters as K and initialise K-Means with their centres

Outline Clustering: general and K-Means Bilinear PCA model and clustering Data Scatter Decompositions: Quadratic and Absolute Contributions of nominal features Explications of Quadratic criterion One-by-one cluster extraction: Anomalous patterns and iK-Means Issue of the number of clusters Comments on optimisation problems Conclusion and future work

Study of eight Number-of-clusters methods (joint work with Mark Chiang): Variance based: Hartigan (HK) Calinski & Harabasz (CH) Jump Statistic (JS) Structure based: Silhouette Width (SW) Consensus based: Consensus Distribution area (CD) Consensus Distribution mean (DD) Sequential extraction of APs (iK-Means): Least Square (LS) Least Moduli (LM)

Experimental results at 9 Gaussian clusters (3 size patterns), 1000 x 15 data Estimated number of clusters Adjusted Rand Index Large spread Small spread Large spreadSmall spread HK CH JS SW CD DD LS LM 1-time winner 2-times winner 3-times winner Two winners counted each time

Clustering: general and K-Means Bilinear PCA model and clustering Data Scatter Decompositions: Quadratic and Absolute Contributions of nominal features Explications of Quadratic criterion One-by-one cluster extraction: Anomalous patterns and iK-Means Issue of the number of clusters Comments on optimisation problems Conclusion and future work

Some other data recovery clustering models Hierarchical clustering: Ward agglomerative and Ward-like divisive, relation to wavelets and Haar basis (1997) Additive clustering: partition and one-by-one clustering (1987) Biclustering: box clustering (1995)

Hierarchical clustering for conventional and spatial data Model: Same Cluster structure: 3-valued z representing a split A split S=S1+S2 of a node S in children S1, S2: z i = 0 if i S, = a if i S1 = -b if i S2 If a and b taken to z being centred, the node vectors for a hierarchy form orthogonal base (an analogue to SVD) S1 S2 S

Last: Data recovery K-Means-wise model is an adequate tool that involves wealth of interesting criteria for mathematical investigation L1 criterion remains a mystery, even for the most popular method, the PCA Greedy-wise approaches remain a vital element both theoretically and practically Evolutionary approaches started sneaking in: should be given more attention Extending the approach to other data types is a promising direction