Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Slides:

Advertisements

Similar presentations

Component Analysis (Review)

Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL

Machine learning continued Image source:

2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

Supervised Learning Recap

Visual Recognition Tutorial

Principal Component Analysis CMPUT 466/551 Nilanjan Ray.

Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.

Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Principal Component Analysis

Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.

0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.

Prénom Nom Document Analysis: Non Parametric Methods for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Machine Learning CMPT 726 Simon Fraser University

Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction

Unsupervised Learning

Visual Recognition Tutorial

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Chapter 3 (part 1): Maximum-Likelihood & Bayesian Parameter Estimation  Introduction  Maximum-Likelihood Estimation  Example of a Specific Case  The.

Maximum likelihood (ML)

Radial Basis Function Networks

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Summarized by Soo-Jin Kim

Principle Component Analysis (PCA) Networks (§ 5.8) PCA: a statistical procedure –Reduce dimensionality of input vectors Too many features, some of them.

Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.

Principles of Pattern Recognition

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

First topic: clustering and pattern recognition Marc Sobel.

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.

ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.

Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819

Lecture 2: Statistical learning primer for biologists

Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

1 Unsupervised Learning and Clustering Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of.

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:

Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

Chapter 3: Maximum-Likelihood Parameter Estimation

Deep Feedforward Networks

LECTURE 11: Advanced Discriminant Analysis

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

LECTURE 10: DISCRIMINANT ANALYSIS

Classification of unlabeled data:

Latent Variables, Mixture Models and EM

Course Outline MODEL INFORMATION COMPLETE INCOMPLETE

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Feature space tansformation methods

LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.

LECTURE 09: DISCRIMINANT ANALYSIS

Parametric Methods Berlin Chen, 2005 References:

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

EM Algorithm and its Applications

Presentation transcript:

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold 2 Outline  Introduction to data clustering  Unsupervised learning  Background of clustering  K-means clustering & Fuzzy k-means  Model based clustering  Gaussian mixtures  Principal Components Analysis

© Prof. Rolf Ingold 3 Introduction to data clustering  Statistically representative datasets may include implicitly valuable semantic information  the aim is to group samples into meaningful classes  Various terminologies refer to that principle  data clustering, data mining  unsupervised learning  taxonomy analysis  knowledge discovery  automatic inference  In this course we address two aspects  Clustering : perform unsupervised classification  Principal Components Analysis : reduce feature spaces

© Prof. Rolf Ingold 4 Application to document analysis  Data analysis and clustering can potentially be applied at many levels of document analysis  at pixel level for foreground/background separation  on connected components for segmentation or character/symbol recognition  on blocks for document understanding  on entire pages to perform document classification ...

© Prof. Rolf Ingold 5 Unsupervised classification  Unsupervised learning consists of inferring knowledge about classes  by using unlabeled training data, i.e. where samples are not assigned to classes  There are at least five good reasons for performing unsupervised learning  no labeled samples are available or ground-truthing is too costly  useful preprocessing to produce ground-truthed data  the classes are not known a priori  for some problems classes are evolving over time  useful for studying relevant features

© Prof. Rolf Ingold 6 Background of data clustering (1)‏  Clusters are formed following different criteria  members of a class share same or closely related properties  members of class have small distances or large similarities  members of a class are clearly distinguishable from members of other classes  Data clustering can be performed on various data types  nominal types (categories)‏  discrete types  continuous types  time series

© Prof. Rolf Ingold 7 Background of data clustering (2)‏  Data clustering requires similarity measures  various similarity and dissimilarity measures (often in [0,1])‏  distances (with triangular inequality property)‏  Feature transformation and normalization is often required  Clusters may be  center based : members are close to a representative model  chain based : members are close to at least one other member  There is a distinction between  hard clustering : each sample is member of exactly one class  fuzzy clustering : samples have membership functions (probabilities) associated to each class

© Prof. Rolf Ingold 8 K-means clustering  k-means clustering is a popular algorithm for unsupervised classification assuming the following information to be available  the number of classes c  a set of unlabeled samples x 1  x n  Classes are modeled by their centers  1  c and each sample x k is assigned to the classes of the nearest center  The algorithm works as follows  initialize the vectors  1  c randomly  assign each sample x k to the class that minimizes ||x k  m || 2  update the centers  1  c using  stop when classes do no more change

© Prof. Rolf Ingold 9 Illustration of k-means algorithm

© Prof. Rolf Ingold 10 Convergence of k-means algorithm  The k-means algorithm always converges  to a local minimum depending of the centers' initialization

© Prof. Rolf Ingold 11 Fuzzy k-means  Fuzzy k-means is a generalization taking into account a membership function P * (  i |x k ) normalized as follows  The clustering method consists in minimizing the following cost (where b is fixed)  the centers are updated using  and the membership functions are updated using

© Prof. Rolf Ingold 12 Model based clustering  In this approach, we assume the following information to be available  the number of classes c  the a priori probability of each class P(  i )‏  the shape of the feature densities p(x|  j,  j ) with parameter  j  a dataset of unlabeled samples {x 1,...,x n }, supposed to be drawn  by selecting the class  i with probability P(  i )‏  then, selecting x k according to p(x|  j,  j )‏  The goal is to estimate the parameter vector  1  c  t

© Prof. Rolf Ingold 13 Maximum likelihood estimation  The goal is to estimate  that maximizes the likelihood of the set D={x 1,...,x n }, that is  Equivalently we can also maximize its logarithm, namely

© Prof. Rolf Ingold 14 Maximum likelihood estimation (cont.)‏  To find the solution, we require the gradient is zero  By assuming that  i and  j are statistically independent, we can state  By combining with the Bayesian rule we finally obtain that for i=1,...,c the estimation of  i must satisfy the condition

© Prof. Rolf Ingold 15 Maximum likelihood estimation (cont.)‏  In most cases the equation can not be solved analytically  Instead an iterative gradient descending approach can be used  to avoid convergence to a local minimum, an approximate initial should be used

© Prof. Rolf Ingold 16 Application to Gaussian mixture models  We consider the case where of a mixture of Gaussians where the parameter  i,  i et P(  i ) have to be determined  By applying the gradient method we can estimate  i,  i iteratively  where

© Prof. Rolf Ingold 17 Problem with local minimums  Maximum likelihood estimation by the gradient descending method can converge to local minimums

© Prof. Rolf Ingold 18 Conclusion about clustering  Unsupervised learning allows to extract valuable information from unlabeled training data  it is very useful in practice  analytical approaches are generally not practicable  iterative methods may be used, but they sometimes converge to local minimums  sometimes even the number of classes is not known; in such cases clustering can be performed with several hypothesis and the best solution can be selected by information theory  Clustering does not work well in high dimensions  there is an interest to reduce the dimensionality of the feature space

© Prof. Rolf Ingold 19 Objective of Principle Component Analysis (PCA)‏  From a Bayesian point of view, the more features are used, more accurate are the classification results  But the higher the dimension of the feature space, more difficult it is to get reliable models  PCA can be seen as a systematic way to reduce the dimensionality of the feature space by minimizing the loss of information

© Prof. Rolf Ingold 20 Center of gravity  Let us consider a set of samples described by their feature vectors {x 1,x 2,...,x n }  The point that best represents the entire set is x 0 minimizing  This point corresponds to the center of gravity since

© Prof. Rolf Ingold 21 Projection on a line  The goal is to find the line crossing the center of gravity m that best approximate the sample set {x 1,x 2,...,x n }  let vector e be the unit vector of its direction; then the equation of the line is x = m+ae where a is a scalar representing the distance of x from m  the optimal solution is given by minimizing the squared error  First, the values for a 1,a 2,...,a n minimizing this function are given that is a k = e t (x k - m) corresponding to the orthogonal projection

© Prof. Rolf Ingold 22 Scatter matrix  The scatter matrix of the set {x 1,x 2,...,x n } is defined as  it differs from the covariance matrix by a factor n-1

© Prof. Rolf Ingold 23 Finding the best line  The best line (minimizing) can be obtained by where S is the scatter matrix of the set {x 1,x 2,...,x n }

© Prof. Rolf Ingold 24 Finding the best line (cont.)‏  To minimize J 1 (e) we must maximize e t Se  using the method of Lagrange multipliers  and by differentiating  we obtain Se = e and e t Se = e t e =  This means that to maximize e t Se we need to select the eigenvector corresponding to the largest eigenvalue of the scatter matrix

© Prof. Rolf Ingold 25 Generalization to d dimensions  Principal components analysis can be applied for any dimension d up to the dimension of the original feature space  each sample is mapped on a hyperplane defined by  the objective function to minimize is  the solution for e is given by the eigenvectors corresponding to the d highest eigenvalues