Download presentation
Presentation is loading. Please wait.
Published byHerbert Norris Modified over 6 years ago
1
DataMining, Morgan Kaufmann, p218-227 Mining Lab. 김완섭 2004년 10월 27일
EM Algorithm: Expectation Maximazation Clustering Algorithm book: “DataMining, Morgan Kaufmann, Frank” DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일
2
Content Clustering K-Means via EM Mixture Model EM Algorithm
Simple examples of EM EM Application; WEKA References
3
Clustering (1/2) Clustering ? Clustering vs. Classification
Clustering algorithms divide a data set into natural groups (clusters). Instances in the same cluster are similar to each other, they share certain properties. e.g Customer Segmentation. Clustering vs. Classification Supervised Learning Unsupervised Learning Not target variable to be predicted.
4
Clustering (2/2) Categorization of Clustering Methods
Partitioning mehtods K-Means / K-medoids / PAM / CRARA / CRARANS Hierachical methods CURE / CHAMELON / BIRCH Density-based methods DBSCAN / OPTICS Grid-based methods STING / CLIQUE / Wave-Cluster Model-based methods EM / COBWEB / Bayesian / Neural Model-Based Clustering Probability-based Clustering Statistical Clustering
5
K-Means (1) Algorithm Step 0 : Step 1 : (Assignment)
Select K objects as initial centroids. Step 1 : (Assignment) For each object compute distances to k centroids. Assign each object to the cluster to which it is the closest. Step 2 : (New Centroids) Compute a new centroid for each cluster. Step 3: (Converage) Stop if the change in the centroids is less than the selected covergence criterion. Otherwise repeat Step 1.
6
K-Means (2) simple example
Input Data Random Centroids New Centroids & (Check) Assignment New Centroids & (check) Assignment Centroids & (check) Assignment
7
K-Means (3) weakness on outlier (noise)
8
K-Means (4) Calculation
(4,4), (3,4) (4,4), (3,4) (4,2), (0,2), (1,1), (1,0) (4,2), (0,2), (1,1), (1,0) (100, 0) 1. 1) <3.5, 4> <21, 1> 1. 1) <3.5, 4> <1.5, 1.25> 2) <3.5, 4> - (0,2), (1,1), (1,0),(3,4),(4,4),(4,2) <21, 1> - (100,1) 2) <3.5, 4> - (3, 4), (4, 4), (4, 2) <1.5, 1.25> - (0, 2) (1, 1), (1, 0) 2. 1) <2.1, 2.1> <100, 0> 2. 2) <3.67, 3.3> <0.67, 1> 2) <2.1, 2.1> - (0, 2),(1,1),(1,0),(3,4),(4,4),(4,2) <100, 1> - (100, 1) 3) <3.67, 3.3> - (3, 4), (4, 4), (4, 2) <0.67, 1> - (0, 2) (1, 1), (1, 0)
9
K-Means (5) comparison with EM
Hard Clustering. A instance belong to only one Cluster. Based on Euclidean distance. Not Robust on outlier, value range. EM Soft Clustering. A instance belong to several clusters with membership probability. Based on density probability. Can handle both numeric and nominal attributes. I C2 C1 0.7 0.3 I C2
10
Mixture Model (1) A Mixture is a set of k probability distributions, representing k clusters. A probability distribution have mean and variances. The mixture model combines several normal distributions.
11
Mixture Model (2) Only one numeric attribute five parameter
12
Mixture Model (3) Simple Example
Probability that an instance x belongs to cluster A Probability Density Function
13
Mixture Model (4) Probability Density Function
Normal Distribution Gaussian Density Function Poisson Distribution
14
Mixture Model (5) Probability Density Function
Iteration Iteration
15
EM Algorithm (1) Step 1. (Initialization) Step 2. (Maximization Step)
Random probability Step 2. (Maximization Step) Re-create cluster model Re-compute the parameter Θ(mean, variance) normal distribution. Step 3. (Expectation Step) Update record’s weight Step 4. Calculate log-likelihood If the value saturates, exit If not, Go to Step 2. Parameter Adjustment Weight Adjustment
16
EM Algorithm (2) Initialization
Random Probability M-Step Example Num Math English 1 80 90 2 50 75 3 85 100 4 30 70 5 95 6 60 Cluster1 Cluster2 0.25 0.75 0.8 0.2 0.43 0.57 0.7 0.3 0.15 0.85 0.6 0.40 2.93 3.07
17
EM Algorithm (3) M-Step : Parameter (Mean, Dev)
Estimating parameters from weighted instances Parameters means, deviations.
18
EM Algorithm (3) M-Step : Parameter (Mean, Dev)
Num Math English 1 80 90 2 50 75 3 85 100 4 30 70 5 95 6 60 Cluster-A Cluster-B 0.25 0.75 0.8 0.2 0.43 0.57 0.7 0.3 0.15 0.85 0.6 0.40 2.93 3.07
19
EM Algorithm (4) E-Step : Weight
compute weight here
20
EM Algorithm (5) E-Step : Weight
Num Math English 1 80 90 compute weight here
21
EM Algorithm (6) Objective Function (check)
Log-likelihood Function For all instances, it’s probability belong to cluster A, Use log for analysis 1-Dimensional data 2-Cluster A,B N-Dimensional data K-cluster - Mean vector - Covariance matrix
22
EM Algorithm (7) Objective Function (check)
- Covariance Matrix - Mean Vector
23
EM Algorithm (8) Termination
Procedure stops when log-likelihood saturates. Q4 Q3 Q2 Q1 Q0 # of Iteration
24
EM Algorithm (1) Simple Data
EM example 6 data (3 sample per 1 class) 2 class (circle, rectangle)
25
EM Algorithm (2) Likelihood function of two component means Θ1, Θ2
26
EM Algorithm (3)
27
EM Example (1) Example dataset 2 Column(Math, English), 6 record Num
80 90 2 60 75 3 4 30 5 100 6 15
28
EM Example (2) Distri. Of Math Distri. Of Eng mean : 56.67
variance : Distri. Of Eng mean : 82.5 variance : 100 50 50 100
29
EM Example (3) Random Cluster Weight 2.93 3.07 Num Math English 1 80
90 2 50 75 3 85 100 4 30 70 5 95 6 60 Cluster1 Cluster2 0.25 0.75 0.8 0.2 0.43 0.57 0.7 0.3 0.15 0.85 0.6 0.40 2.93 3.07
30
(parameter adjustment)
EM Example (4) Iteration 1 Maximization Step (parameter adjustment)
31
EM Example (4)
32
(parameter adjustment)
EM Example (5) Iteration 2 Expectation Step (Weight adjustment) Maximization Step (parameter adjustment)
33
(parameter adjustment)
EM Example (6) Iteration 3 Expectation Step (Weight adjustment) Maximization Step (parameter adjustment)
34
(parameter adjustment)
EM Example (6) Iteration 3 Expectation Step (Weight adjustment) Maximization Step (parameter adjustment)
35
EM Application (1) Weka Weka Experiment Data
Waikato University in Newzealand Open Source Mining Tool Experiment Data Iris data Real Data Department Customer Data Modified Customer Data
36
EM Application (2) IRIS Data
Data Info Attribute Information: sepal length in cm / sepal width / petal length / petal width in cm class : Iris Setosa / Iris Versicolour / Iris Virginica
37
EM Application (3) IRIS Data
38
EM Application (4) Weka Usage
Weka Clustering Packages Command line Execution GUI Execution Weka.clusterers Java weka.clusterers.EM –t iris.arff –N 2 Java weka.clusterers.EM –t iris.arff –N 2 -V Java –jar weka.jar
39
EM Application (4) Weka Usage
Options for clustering in weka -t <training file> Specify training file -T <test file> Specify test file -x <number of folds> Specify number of folds for cross-validation -s <random number seed> Specify random number seed -l <input file> Specify input file for model -d <output file> Specify ouput file for model -p Only output prediction for test instances
40
EM Application (5) Weka usage
41
EM Application (5) Weka usage – input file format
% Summary Statistics: % Min Max Mean SD Class Correlation % sepal length: % sepal width: % petal length: (high!) % petal width: (high!) @RELATION iris @ATTRIBUTE sepallength REAL @ATTRIBUTE sepalwidth REAL @ATTRIBUTE petallength REAL @ATTRIBUTE petalwidth REAL @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa
42
EM Application (6) Weka usage – output format
Number of clusters: 3 Cluster: 0 Prior probability: Attribute: sepallength Normal Distribution. Mean = StdDev = Attribute: sepalwidth Normal Distribution. Mean = StdDev = Attribute: petallength Normal Distribution. Mean = StdDev = Attribute: petalwidth Normal Distribution. Mean = StdDev = Attribute: class Discrete Estimator. Counts = (Total = 53) ( 33%) ( 32%) ( 35%) Log likelihood:
43
EM Application (6) Result Visualization
44
References DataMining DataMining, Concepts and Techiques.
Morgan Cauffmann. IAN H. p218-p255. DataMining, Concepts and Techiques. Jiawei Han. Chapter 8. The Expectation Maximization Algorithm Frank Dellaert, Febrary 2002. A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.