Intelligent and Adaptive Systems Research Group A Novel Method of Estimating the Number of Clusters in a Dataset Reza Zafarani and Ali A. Ghorbani Faculty.

Slides:

Advertisements

Similar presentations

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.

Advertisements

Random Forest Predrag Radenković 3237/10

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Unsupervised Learning

11 Simple Linear Regression and Correlation CHAPTER OUTLINE

Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.

Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.

Reduced Support Vector Machine

FACE RECOGNITION, EXPERIMENTS WITH RANDOM PROJECTION

Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.

Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:

Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.

© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.

Evaluating Performance for Data Mining Techniques

Efficient Model Selection for Support Vector Machines

Clustering methods Course code: Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

1 Least squares procedure Inference for least squares lines Simple Linear Regression.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

General Information Course Id: COSC6342 Machine Learning Time: TU/TH 10a-11:30a Instructor: Christoph F. Eick Classroom:AH123

Estimating the Number of Data Clusters via the Gap Statistic Paper by: Robert Tibshirani, Guenther Walther and Trevor Hastie J.R. Statist. Soc. B (2001),

Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:

Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.

ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.

Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.

Stochastic DAG Scheduling using Monte Carlo Approach Heterogeneous Computing Workshop (at IPDPS) 2012 Extended version: Elsevier JPDC (accepted July 2013,

Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.

1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.

Click to Edit Talk Title USMA Network Science Center Specific Communication Network Measure Distribution Estimation Daniel.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.

Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.

1 Neighboring Feature Clustering Author: Z. Wang, W. Zheng, Y. Wang, J. Ford, F. Makedon, J. Pearlman Presenter: Prof. Fillia Makedon Dartmouth College.

1 CLUSTER VALIDITY  Clustering tendency Facts  Most clustering algorithms impose a clustering structure to the data set X at hand.  However, X may not.

Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering.

Clustering Patrice Koehl Department of Biological Sciences National University of Singapore

Data Mining and Decision Support

A new clustering tool of Data Mining RAPID MINER.

Extending linear models by transformation (section 3.4 in text) (lectures 3&4 on amlbook.com)

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,

Methods of multivariate analysis Ing. Jozef Palkovič, PhD.

Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:

Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.

Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,

CSE 4705 Artificial Intelligence

Machine Learning for Computer Security

Data Transformation: Normalization

Machine Learning for the Quantified Self

PREDICT 422: Practical Machine Learning

DEEP LEARNING BOOK CHAPTER to CHAPTER 6

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.

IEM Fall SEMINAR SERIES Raghu Pasupathy

Basic machine learning background with Python scikit-learn

Clustering (3) Center-based algorithms Fuzzy k-means

A Consensus-Based Clustering Method

Clustering Evaluation The EM Algorithm

CSE 4705 Artificial Intelligence

Lecture 2 – Monte Carlo method in finance

Lecture 4 - Monte Carlo improvements via variance reduction techniques: antithetic sampling Antithetic variates: for any one path obtained by a gaussian.

Text Categorization Berlin Chen 2003 Reference:

Machine Learning – a Probabilistic Perspective

Supervised machine learning: creating a model

What is Artificial Intelligence?

Presentation transcript:

Intelligent and Adaptive Systems Research Group A Novel Method of Estimating the Number of Clusters in a Dataset Reza Zafarani and Ali A. Ghorbani Faculty of Computer Science, University of New Brunswick, Fredericton, NB, Canada {r.zafarani, unb.ca What is Clustering? The unsupervised division of patterns (data points, feature vectors, instances,…) into groups of similar objects. Objects in the same group are more similar whereas objects in different groups are more dissimilar. The Need of Clustering Analysis of Data / Finding (dis)similar data. Need of abstraction / Reducing redundancy. Alleviate the effect of low computation power. Cost efficiency and business use-cases. Related Work Early literature in the area of dynamic clustering have attempted to solve this by running algorithms for several Ks (number of clusters). Best K among them is determined based on some coefficients or statistics. Distance between two cluster centroids normalized by cluster's standard deviation could be used as a coefficient. Silhouette coefficient, which compares the average distance Value between points in the same cluster and the average Distance value between points in different clusters. These coefficients are plotted as a function of K (number of clusters) and the best K is selected. Probabilistic measures which determine the best model in mixture models can also be used. In this area, an optimal K corresponds to the best fitting model. Some famous criteria in this area are BIC, MDL, and MML. The Oracle of Clustering Definition: A function, called the Oracle, that can predict whether two random data points belong to the same cluster or not. Oracle Approximation Thresholding: A simple yet effective approach to predict the Oracle is to use thresholding on the similarity function between the data. A justifiable threshold could be a linear combination of the mean and standard deviation of the similarities between the data. In order to make this more accurate dimensionality reduction methods can be used on the data first. Ensemble Clustering Two points are considered to be in the same cluster if a majority of different clustering algorithms consider them to be in the same cluster. This method can be computationally inefficient. Predicting the Number of Clusters Using the Oracle Monte Carlo Sampling: Given this Oracle function and using Monte Carlo sampling of this Oracle function, the probability of random points being in the same cluster can be estimated., and are the size of the cluster, the number of clusters, and the dataset size, respectively. The sampling is controlled using chernoff bounds: Where p is the actual probability, N is the sample size, and are two prefixed constants. For Example: This problem links this area to the research avenues in Partition Theory, and more specifically, variations of Kloosterman sums and summand distributions in integer partitions. Discussions It's simple to see that given this Oracle function, the clustering can be done within a O(mn) time complexity where m is the number of clusters and n is the number of datapoints. The Oracle function can be approximated for different clustering algorithms (preferably for those with quadratic running times). Their running times can be reduced to O(mn), if this approximation can take place. Transfer Learning can be used in order to approximate this Oracle for algorithms with quadratic running times (the relation between oracles of different clustering algorithms can be learnt or approximated). Future Work A reasonable way to predict the number of clusters (m) from these probabilities is to use methods in Partition Theory along with the methods in Convex Optimization. Gap statistics is another area worth investigating in order to detect the number of clusters here. The methods in gap statistics can be used to refine the threshold values which are used in oracle prediction. The transfer learning function can be learnt and the optimum conditions under which the function is learnable can be discovered. References Milligan, G., Cooper, M.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2) (1985) Pelleg, D., Moore, A.: X-means: Extending K-means with efficient estimation of the number of clusters. Proceedings of the 17 th International Conf. on Machine Learning (2000) Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 63(2) (2001) Guan, Y., Ghorbani, A., Belacel, N.: Y-means: A clustering method for intrusion detection. Proceedings of Canadian Conference on Electrical and Computer Engineering (2003) Erdos, P., Lehner, J.: The distribution of the number of summands in the partitions of a positive integer. Duke Math. J 8(2) (1941) Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press (2004) Challenges in Clustering Dynamism Validity High Dimensions (curse of dimensionality) Subjectivity e.g. the set {ship, bird, fish} can be clustered in two different ways. Large Data Sets Complexity Proximity Measures Initial Conditions Ensemble Techniques