DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일

DataMining, Morgan Kaufmann, p218-227 Mining Lab. 김완섭 2004년 10월 27일
EM Algorithm: Expectation Maximazation Clustering Algorithm book: “DataMining, Morgan Kaufmann, Frank” DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일

Content Clustering K-Means via EM Mixture Model EM Algorithm
Simple examples of EM EM Application; WEKA References

Clustering (1/2) Clustering ? Clustering vs. Classification
Clustering algorithms divide a data set into natural groups (clusters). Instances in the same cluster are similar to each other, they share certain properties. e.g Customer Segmentation. Clustering vs. Classification Supervised Learning Unsupervised Learning Not target variable to be predicted.

Clustering (2/2) Categorization of Clustering Methods
Partitioning mehtods K-Means / K-medoids / PAM / CRARA / CRARANS Hierachical methods CURE / CHAMELON / BIRCH Density-based methods DBSCAN / OPTICS Grid-based methods STING / CLIQUE / Wave-Cluster Model-based methods EM / COBWEB / Bayesian / Neural Model-Based Clustering Probability-based Clustering Statistical Clustering

K-Means (1) Algorithm Step 0 : Step 1 : (Assignment)
Select K objects as initial centroids. Step 1 : (Assignment) For each object compute distances to k centroids. Assign each object to the cluster to which it is the closest. Step 2 : (New Centroids) Compute a new centroid for each cluster. Step 3: (Converage) Stop if the change in the centroids is less than the selected covergence criterion. Otherwise repeat Step 1.

K-Means (2) simple example
Input Data Random Centroids New Centroids & (Check) Assignment New Centroids & (check) Assignment Centroids & (check) Assignment

K-Means (3) weakness on outlier (noise)

K-Means (4) Calculation
(4,4), (3,4) (4,4), (3,4) (4,2), (0,2), (1,1), (1,0) (4,2), (0,2), (1,1), (1,0) (100, 0) 1. 1) <3.5, 4> <21, 1> 1. 1) <3.5, 4> <1.5, 1.25> 2) <3.5, 4> - (0,2), (1,1), (1,0),(3,4),(4,4),(4,2) <21, 1> - (100,1) 2) <3.5, 4> - (3, 4), (4, 4), (4, 2) <1.5, 1.25> - (0, 2) (1, 1), (1, 0) 2. 1) <2.1, 2.1> <100, 0> 2. 2) <3.67, 3.3> <0.67, 1> 2) <2.1, 2.1> - (0, 2),(1,1),(1,0),(3,4),(4,4),(4,2) <100, 1> - (100, 1) 3) <3.67, 3.3> - (3, 4), (4, 4), (4, 2) <0.67, 1> - (0, 2) (1, 1), (1, 0)

K-Means (5) comparison with EM
Hard Clustering. A instance belong to only one Cluster. Based on Euclidean distance. Not Robust on outlier, value range. EM Soft Clustering. A instance belong to several clusters with membership probability. Based on density probability. Can handle both numeric and nominal attributes. I C2 C1 0.7 0.3 I C2

Mixture Model (1) A Mixture is a set of k probability distributions, representing k clusters. A probability distribution have mean and variances. The mixture model combines several normal distributions.

Mixture Model (2) Only one numeric attribute five parameter

Mixture Model (3) Simple Example
Probability that an instance x belongs to cluster A Probability Density Function

Mixture Model (4) Probability Density Function
Normal Distribution Gaussian Density Function Poisson Distribution

Mixture Model (5) Probability Density Function
Iteration Iteration

EM Algorithm (1) Step 1. (Initialization) Step 2. (Maximization Step)
Random probability Step 2. (Maximization Step) Re-create cluster model Re-compute the parameter Θ(mean, variance) normal distribution. Step 3. (Expectation Step) Update record’s weight Step 4. Calculate log-likelihood If the value saturates, exit If not, Go to Step 2. Parameter Adjustment Weight Adjustment

EM Algorithm (2) Initialization
Random Probability M-Step Example Num Math English 1 80 90 2 50 75 3 85 100 4 30 70 5 95 6 60 Cluster1 Cluster2 0.25 0.75 0.8 0.2 0.43 0.57 0.7 0.3 0.15 0.85 0.6 0.40 2.93 3.07

EM Algorithm (3) M-Step : Parameter (Mean, Dev)
Estimating parameters from weighted instances Parameters means, deviations.

EM Algorithm (3) M-Step : Parameter (Mean, Dev)
Num Math English 1 80 90 2 50 75 3 85 100 4 30 70 5 95 6 60 Cluster-A Cluster-B 0.25 0.75 0.8 0.2 0.43 0.57 0.7 0.3 0.15 0.85 0.6 0.40 2.93 3.07

EM Algorithm (4) E-Step : Weight
compute weight here

EM Algorithm (5) E-Step : Weight
Num Math English 1 80 90 compute weight here

EM Algorithm (6) Objective Function (check)
Log-likelihood Function For all instances, it’s probability belong to cluster A, Use log for analysis 1-Dimensional data 2-Cluster A,B N-Dimensional data K-cluster - Mean vector - Covariance matrix

EM Algorithm (7) Objective Function (check)
- Covariance Matrix - Mean Vector

EM Algorithm (8) Termination
Procedure stops when log-likelihood saturates. Q4 Q3 Q2 Q1 Q0 # of Iteration

EM Algorithm (1) Simple Data
EM example 6 data (3 sample per 1 class) 2 class (circle, rectangle)

EM Algorithm (2) Likelihood function of two component means Θ1, Θ2

EM Algorithm (3)

EM Example (1) Example dataset 2 Column(Math, English), 6 record Num
80 90 2 60 75 3 4 30 5 100 6 15

EM Example (2) Distri. Of Math Distri. Of Eng mean : 56.67
variance : Distri. Of Eng mean : 82.5 variance : 100 50 50 100

EM Example (3) Random Cluster Weight 2.93 3.07 Num Math English 1 80
90 2 50 75 3 85 100 4 30 70 5 95 6 60 Cluster1 Cluster2 0.25 0.75 0.8 0.2 0.43 0.57 0.7 0.3 0.15 0.85 0.6 0.40 2.93 3.07

(parameter adjustment)
EM Example (4) Iteration 1 Maximization Step (parameter adjustment)

EM Example (4)

EM Example (5) Iteration 2 Expectation Step (Weight adjustment) Maximization Step (parameter adjustment)

EM Example (6) Iteration 3 Expectation Step (Weight adjustment) Maximization Step (parameter adjustment)

EM Application (1) Weka Weka Experiment Data
Waikato University in Newzealand Open Source Mining Tool Experiment Data Iris data Real Data Department Customer Data Modified Customer Data

EM Application (2) IRIS Data
Data Info Attribute Information: sepal length in cm / sepal width / petal length / petal width in cm class : Iris Setosa / Iris Versicolour / Iris Virginica

EM Application (3) IRIS Data

EM Application (4) Weka Usage
Weka Clustering Packages Command line Execution GUI Execution Weka.clusterers Java weka.clusterers.EM –t iris.arff –N 2 Java weka.clusterers.EM –t iris.arff –N 2 -V Java –jar weka.jar

EM Application (4) Weka Usage
Options for clustering in weka -t <training file> Specify training file -T <test file> Specify test file -x <number of folds> Specify number of folds for cross-validation -s <random number seed> Specify random number seed -l <input file> Specify input file for model -d <output file> Specify ouput file for model -p Only output prediction for test instances

EM Application (5) Weka usage

EM Application (5) Weka usage – input file format
% Summary Statistics: % Min Max Mean SD Class Correlation % sepal length: % sepal width: % petal length: (high!) % petal width: (high!) @RELATION iris @ATTRIBUTE sepallength REAL @ATTRIBUTE sepalwidth REAL @ATTRIBUTE petallength REAL @ATTRIBUTE petalwidth REAL @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa

EM Application (6) Weka usage – output format
Number of clusters: 3 Cluster: 0 Prior probability: Attribute: sepallength Normal Distribution. Mean = StdDev = Attribute: sepalwidth Normal Distribution. Mean = StdDev = Attribute: petallength Normal Distribution. Mean = StdDev = Attribute: petalwidth Normal Distribution. Mean = StdDev = Attribute: class Discrete Estimator. Counts = (Total = 53) ( 33%) ( 32%) ( 35%) Log likelihood:

EM Application (6) Result Visualization

References DataMining DataMining, Concepts and Techiques.
Morgan Cauffmann. IAN H. p218-p255. DataMining, Concepts and Techiques. Jiawei Han. Chapter 8. The Expectation Maximization Algorithm Frank Dellaert, Febrary 2002. A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models.

DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일

Similar presentations

Presentation on theme: "DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일

Similar presentations

Presentation on theme: "DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일"— Presentation transcript:

Similar presentations

About project

Feedback