Density-Based Clustering of Uncertain Data (KDD2005)

Density-Based Clustering of Uncertain Data (KDD2005)
HKU Department of Computer Science Database Research Seminar 18th May 2006 Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) Supervisor: Dr. Benjamin C.M. Kao.

Presentation Outline Introduction
What is clustering? Density based similarity measurment DBSCAN Issues from mining certain data to uncertain data Why data exhibit uncertainty? How to represent / model data uncertainty? How to represent the distance between two uncertain objects? Theoretical foundation of changing DBSCAN to FDBSCAN FDBSCAN From DBSCAN to FDBSCAN Computational Issues Experimental Results Conclusions

Introduction

What is Clustering? Problem description A set of objects
A similarity measurement Discover groups of similar objects More precisely, find sets of objects which intra-cluster similarity is high while inter-clusters similarity is relatively low.

Different Clusters Discovered by Different Similarity Measurement
Distance-based Density-based Pattern-based …etc

Density-based clustering
The main reason why we recognize the clusters is that within each cluster we have a typical density of objects which is considerably higher than outside of the cluster. The clusters are separated by low object density regions (noise) Any clusters ? x

Density-based clustering
The main reason why we recognize the clusters is that within each cluster we have a typical density of objects which is considerably higher than outside of the cluster. The clusters are separated by low object density regions (noise) Density-based clustering can detect arbitrary cluster shapes

Key idea of density-based clustering
Density constraint for objects to form clusters Intuitively for each object of a cluster the neighborhood of a given radius has to contain at least a minimum number of objects. (density constraint) i.e The density in the neighborhood has. to exceed some threshold. Objects not belong to any clusters are regard as noise.

Previous Works on Density Based Clustering
DBSCAN A density-based clustering algorithm Work on data with no uncertainty Will present the uncertainty version of DBSCAN later

For the sake of discussion, these two definitions are skipped
DBSCAN Two important definitions of DBSCAN Core objects Directly-density reachable Density reachable (skip) Density connected (skip) For the sake of discussion, these two definitions are skipped

DBSCAN Definition 1: Core Object
Given the density constraint (µ andε) An object o is defined as a core object iff there are µ or more objects within theε-range of o. Basically, we can conduct a range search on object o with radius ε, if there are µ or more objects returned, then o is a core object.

DBSCAN Definition 1: Core Object
Example (µ=5 ) Is o1 a core object? o2 ε o1 ε Since there are 5 objects within the ε-range of o2, o2 is a core object too. Since there are 5 objects within the ε-range of o1, o1 is a core object

DBSCAN Definition 2: Directly-density reachable
An object p is directly-density reachable from o if the following conditions are satisfied 1st condition: o is a core object 2nd condition: d(p,o) ≤ε

DBSCAN Definition 2: Directly-density reachable
Example (µ=5 ) Question: Is o2 directly-density reachable from o1? Thus, o2 is directly-density reachable from o1 2nd condition: Is d(o2,o1) ≤ε ? Yes, it is within the ε-range of o1. o2 o1 ε 1st condition: Is o1 a core object? Since there are 5 objects within the ε-range of o1, o1 is a core object

DBSCAN How it works? Brief idea…
Search for clusters by checking the ε-neighborhood of each object in the database. If a core object o is found, a new cluster with o and it’s direct density-reachable objects is created. DBSCAN iteratively collects the directly density-reachable objects from the objects in the cluster.

DBSCAN Example (µ=5 ) Eventually, clusters are formed
Objects that not assigned to any clusters are regarded as noise Eventually, clusters are formed Objects that not assigned to any clusters are regarded as noise Pick another point for next iteration if the current cluster does not expand. Example (µ=5 ) ε ε ε o1 ε DBSCAN continues to “expand” the cluster by adding objects which are directly density reachable from cluster objects ε Arbitrary pick a point, e.g. o1, check if it is a core object… Since a1 is not a core object, a2 is NOT direct-density reachable from a1. a2 is NOT added into the cluster o2 o1 is a core object A cluster with o1 and all o1’s density reachable objects ε a1 a2

From Certain Data to Uncertain Data

From certain to uncertain data Five major issues …
Why data exhibit uncertainty? How to represent / model data uncertainty? How to represent the distance between two uncertain objects? What is core object in uncertain data? What is direct density-reachable in uncertain data?

Why data exhibit uncertainty?
In many modern application ranges, e.g. the clustering of moving objects or sensor databases, only uncertain data is available. For instance, in the area of mobile services, the objects continuously change their positions so that exact positional information is often not available.

Why data exhibit uncertainty?
In application areas such as clustering of distributed feature vectors, due to security aspects or to limited bandwidth, only approximated information is transmitted to a central server site.

Uncertain Data (Example)
Somewhere in a tropical rain forest… Location tracking of a group of about 300 Chimpanzees. Implanted device reports location of a Chimpanzee regularly. However the reported location is not precise, it only return the area the Chimpanzee is located. The area is called an uncertainty region Assume the probability that the Chimpanzee located in any location inside the uncertainty region is the same.

The Chimpanzee society is complicated, some young Chimpanzees may gather to fight against the leader. Zoologists are interested to study the factors that affect the formation of different groups (clusters) inside the Chimpanzee society.

One observation is that Chimpanzees of the same group usually stay closely together. Assume that one Chimpanzee belongs to one group only. Density based clustering can help to discover the Chimpanzee groups (clusters).

Clusters y Uncertainty region of 15 Chimpanzees reported by the location tracking devices (location of each Chimpanzee) x Somewhere in the tropical rain forest…

From certain to uncertain data Five major issues…

Representing Uncertain Objects
Probability density functions of 1-D objects Value (e.g. temperature) y Probability density functions for 2-D objects probability x

Representing Uncertain Objects
Question: What is the distance between ouncertain and o’uncertain? The probability that an object o is having a value between a and b can be obtained by Probability density functions of 1-D objects Area Value (e.g. temperature) value a b

How to represent the distance between uncertain objects?
Distance Density Function pd(o,o’) Distance Distribution Function Pd(o,o’)(b) Distance expectation value Ed(o,o’) Aggregated value Information loss

Distance Density Function pd(o,o’)
Express the distance between two objects by means of a probability density function. Let d be a distance function. Let P(a≤d(o,o’)≤b) denote the probability that d(o,o’) is between a and b. A probability density function pd(o,o’) is called a distance density function if the following condition holds:

Probability density functions (pdf) of each uncertain data item is considered independent. Value (e.g. temperature) probability pd(o,o’)(dis) = Distance density function express the distance between two uncertain objects by mean of pdf. Pd (o,o’) (dis) dis Distance between o and o’

(represents the distance between two uncertain objects) pd (o,o’) probability Distance between o and o’

From the distance density function, the probability that the distance between two uncertain objects is between a and b is given by probability |Area | = 1 Area = P(a≤d(o,o’)≤b) Minumum possible distance between o and o’ pd (o,o’) Maximum possible distance between o and o’ a b Distance between o and o’

Distance Density Function pd(o,o’) Distance Distribution Function Pd(o,o’)(b) Distance expectation value Ed(o,o’) Aggregated value Information loss

Distance Distribution Function
Captures the probability that the distance between two uncertain objects is smaller than or equal to a value b. Useful in density-based clustering, when expressing the probability that the d(o’,o) ≤b. 2nd condition for directly density reachable in DBSCAN

In density-based clustering, when evaluating whether an object o’ is directly density-reachable from o, we may want to ask Probability density functions (pdf) What is the probability that o and o’ are close to each other? i.e. distance between o and o’ smaller than or equal to b? o’ o The distance distribution function Pd(o,o’)(b) is the answer.

The distance distribution function Pd(o,o’)(b) is equal to the integration of the distance density function pd(o,o’) from negative infinity to b . probability Distance Density Function pd (o,o’) b Distance between o and o’

Distance Density Function pd(o,o’) Distance Distribution Function Pd(o,o’)(b) Distance Expectation Value Ed(o,o’) Aggregated value Information loss

Distance Expectation Value Ed(o,o’)
Represent the distance between two uncertain objects by one numerical value. Advantage: Since the distance between two uncertain objects is represented by a single value, traditional clustering algorithms work. E.g. DBSCAN Disadvantage: Information loss Distance density function Average distance between two objects aggregated from the distance density function

Theoretical Foundations I Core Object Probability
Let denotes the probability that an object o is a core object. Core object probability of an object o is given by the following formula We start derive this formula from the core object definition of DBSCAN…

In DBSCAN, an object o is a core object if the density constraint (µ andε) is satisfied. i.e. There are µ or more objects p within the ε-range of o. (d(p,o) ≤ε) The probability that an object o is a core object is the probability that the density constraint is satisified. The probability that there are µ or more objects p with d(p,o) ≤ε

Example µ=5 Sometime, d(p,o) ≤εand sometime d(p,o) ≥ε If ε is this small, what is the core object probability of o? If ε is this large, obviously, core-object probability of o is 1 p o What is the core object probability of o? Probability density functions (pdf) ε

For each subset A of the database D which having the cardinality higher than or equal to µ.

For each subset A of the database D which having the cardinality higher than or equal to µ Determine the probability that only the objects p of A with d(p,o) ≤ε but no other objects in D\A. The probability that only the objects p of A having d(p,o) ≤ε but no other objects in D\A

Remind that is the probability that the distance between two uncertain objects is smaller than or equal to a value b. First part: Probability that ALL objects p in A with d(p,o) ≤ε Second part : Probability that ALL objects p in D\A are NOT d(p,o) ≤ε The probability that only the objects p of A having d(p,o) ≤ε but no other objects in D\A

Theoretical Foundations II Reachability Probability
Let be the probability that p is reachable from o. In DBSCAN, an object p is directly density reachable form o if 1st condition : o is a core object 2nd condition : d(p,o) ≤ε The two events are Dependent to each other ! These two conditions are NOT independent! Incorrect, why? ×

Example (µ=3) In this case, The probability that o is a core object is depend on the probability that d(p,o) ≤ε i.e. 1st and 2nd conditions are NOT independent. o p q Probability density functions (pdf) ε –range of o The two events are Dependent to each other ! These two conditions are NOT independent! Incorrect, why? ×

Two independent conditions 1st condition We consider the core object probability in D\p. And relax the density constraint µ by 1. p p o Their product corresponds to the probability that at least µ objects o’ from D are having d(o’,o) ≤ε, and that object p is one of them. Which correspond to the definition of directly density reachable in DBSCAN q 2nd condition We consider the probability that d(p,o) ≤ε ×

The probability that at least µ-1 objects from D\p are located within anε-range of o is

The probability that at least µ-1 objects from D\p are located within anε-range of o is The probability that the distance between p and o is smaller than or equal to ε is

The two conditions are independent Their product corresponds to the probability that at least µ objects from D are located in ε- range of o, and that p is one of them. The probability that at least µ-1 objects from D\p having their distance with o smaller than or equal toε The probability that the distance between p and o is smaller than or equal to ε

How does FDBSCAN works? Traditional DBSCAN algorithm clusters a data set by always adding objects to the current cluster which are directly density reachable from the current query object o. FDBSCAN works very similar to the traditional approach.

How does FDBSCAN works? For each uncertain object o
Check if it is a core object If yes, for each other object p Check the reachability of p from o If the reachability probability ≥ 0.5, p and o form a cluster There are O(|DB|2) reachability probability computations

Computational Aspect I
Computing the reachability probability

Computational Aspect I Computing
Integration Integration Reachability Probability Core Object Probability Distance Density Function

Direction 1: Avoid calculating the integration Sampling Monte-carlo sampling Each uncertain object o is represented by a sequence of s sample points. i.e. <o1,o2,…os> Compute base on the sample sequences. How it can be done? (If time allowed)

Direction 2: Reduce the number of reachability probability computations. Some objects maybe located very far away from o, which is obviously no chance to be directly density-reachable from o. Use MBRs to bound the object samples Compute for all objects o, the MBR(o) bounding the sample points <o1,o2,…os> If MBR(p) is outside theε- range of o, p must NOT be direct density-reachable from o.

Computational Aspect II (If time allows)
Computing Core Object Probability Interesting, but complicated, click here to skip!

Computational Aspect II Computing Core Object Probability
Two issues 1st issue : There are many core object probability computations. 2nd issue : In each core object probability computation, we have to consider (in |DB|) exponentially many subsets A of DB.

1st Issue : Many Core object Probability Computations
For each uncertain object o Check the probability that o is a core object Core object probability ≥ 0.5 For each other object p Check the reachability of p from o If the reachability probability ≥ 0.5, p and o form a cluster The 1st condition of reachability probability is a core object probability , for all p in D

2nd Issue: Exponentially many subsets to consider for each core-object value
Furthermore, the computation of core-object values has to consider (in |DB|) exponentially many subsets A of DB. For all subsets A in D with cardinality greater than or equal to µ

2nd Issue: Exponentially many subsets to consider for each core-object value
Sampling Monte-carlo sampling Each uncertain object o is represented by a sequence of s sample points. i.e. <o1,o2,…os> Compute base on the sample sequences. How it can be done?

Compute base on the sample sequences
s is the sample rate. <o1,o2,…os> Determine the core-object probability base on s 2 meaningful samples. oj is called the j th instance of o. Dj is the collection of j th instance of all objects in D. E.g. s=5 a1, a2, a3, a4, a5 b1, b2, b3, b4, b5 c1, c2, c3, c4, c5 d1, d2, d3, d4, d5 D1 = {a1,b1,c1,d1,e1} D2 ={a2,b2,c2,d2,e2} …

Compute base on the sample sequences
If we want to compute the core object probability of o, create a s×s sample matrix M(o) M(o) keep track of the information for deducing With some modification, it can be used to deduce Each cell mi,j of M(o) indicates the number of ε-neighbors of oi in Dk.

Create sample matrix M(o) (skip)
Each cell mi,j contains the number of ε-neighbor of object sample oi in database instance Dj. Dj consists of all other objects’ j-th sample (excluding oj)

Create sample matrix M(o) (Example: Sample rate=3, µ = 5)
o is the query object All object samples are bounded by MBRs Sample rate=3 µ = 5 d a b o1 o o2 o3 c

Build M(o) d database instances 1 2 3 instances of o 1 2 3 a b o1 o o2 o3 c

Build M(o) d b1 b2 b3 a1 a2 a3 database instances 1 2 3 instances of o 1 2 3 a 1 6 4 b o1 o o2 Although b2 is ε-neighbor of o1, it is not counted as it is NOT in database instance 1. o3 How many ε-neighbors of o1 in D1? MBR(b) and MBR(a) cannot be pruned Retrieve their sample sequences c 6 is the final value. This indicates that there are 6 ε-neighbors of object sample o1 in database instance D1. We are going fill m1,1 Since o1 itself is also counted, it is initialized to 1. We are going fill m1,1 b1 and a1 are ε-neighbors By min-max dist, we are sure these three objects contain ε-neighbors of o1 in D1 By MBR pruning, we are sure these three objects contain ε-neighbors of o1 in D1

Build M(o) d a2 database instances 1 2 3 instances of o 1 2 3 a 6 5 a1 b3 a3 b b2 o1 o b1 o2 o3 c

Build M(o) d a2 database instances 1 2 3 instances of o 1 2 3 a 6 5 5 a1 b3 a3 b b2 o1 o b1 o2 o3 c

Build M(o) d a2 database instances 1 2 3 instances of o 1 2 3 a 6 5 5 a1 b3 a3 6 b b2 o1 o b1 o2 o3 c

Build M(o) d a2 database instances 1 2 3 instances of o 1 2 3 a 6 5 5 a1 b3 a3 6 4 b b2 o1 o b1 o2 o3 c

Build M(o) d a2 database instances 1 2 3 instances of o 1 2 3 a 6 5 5 a1 b3 a3 6 4 5 b b2 o1 o b1 o2 o3 c

Build M(o) d a2 database instances 1 2 3 instances of o 1 2 3 a 6 5 5 a1 b3 a3 6 4 5 b 4 b2 o1 o b1 o2 o3 c

Build M(o) d a2 database instances 1 2 3 instances of o 1 2 3 a 6 5 5 a1 b3 a3 6 4 5 b 4 4 b2 o1 o b1 o2 o3 c

Build M(o) d a2 database instances 1 2 3 instances of o 1 2 3 a 6 5 5 a1 b3 a3 6 4 5 b 4 4 5 b2 o1 o b1 o2 o3 c Now we have the sample matrix M(o).

Compute base on the sample matrix M(o), (µ = 5)
For each uncertain object o Check the probability that o is a core object Core object probability ≥ 0.5 For each other object p Check the reachability of p from o If the reachability probability ≥ 0.5, p and o form a cluster

Core object probability 1st Step: Count the number of elements in the sample matrix M(o) which contain values higher than or equal to µ 2nd Step: Normalize the value by s^2 yields database instances 1 2 3 1st Step: Count = 6 instances of o 1 2 3 6 5 5 2nd Step: Core-object probability of o = 6 / 9 6 4 5 Since the core object probability is > 0.5, o is treated as a core-object 4 4 5

For each uncertain object o Check the probability that o is a core object Core object probability ≥ 0.5 For each other object p Check the reachability of p from o If the reachability probability ≥ 0.5, p and o form a cluster The first part Can be derived from M(o) The second part Can do some pruning using the object samples’ MBRs

Compute The first part 1st step: Decrease the values mi,j by 1 for which d(oi,pj)≤εholds. 2nd step: Count the number of elements in the sample matrix M(o) which contain values higher than or equal to µ-1. 3rd step: Normalizing the number by s2 yield the probability

Computing the first part
database instances 1 2 3 Decrease m3,3 by 1 Decrease m2,1 and m2,3 by 1 Decrease m1,1 and m1,3 by 1 d instances of o 1 2 3 a2 5 6 5 5 4 5 6 4 5 4 a a1 b3 4 4 4 5 a3 b b2 o1 o 1st Step: decrease the values mi,j by 1 for which d(oi,pj)≤εholds. b1 o2 o3 c Conceptually, M(o) contains the ε-neighbor information in D, we want it contains the information in D\a.

Computing the first part
database instances 1 2 3 d instances of o 1 2 3 a2 5 5 4 5 4 4 a a1 b3 4 4 4 a3 b b2 o1 o b1 o2 o3 2nd Step: Count the number of elements in the sample matrix M(o) which contain values higher than or equal to µ-1 c 3rd Step: Since all the cell are greater than or equal to 5-1 =4, the first part probability is equal to 9/9 = 1

Compute The second part
Count the number of events d(oi,pj)≤ε, and by normalizing the number by s×s. The MBRs of the object samples can be used for pruning.

Computing the second part
1st Step: Count the Number of events d(oi,pj)≤ε Count = d a2 a 2 + 2 + 1 a1 b3 a3 = 5 b b2 o1 o b1 2nd Step: Normalize the count by s^2. The reachability probability of a from o is 5/9. o2 o3 c

Reachability of a from o
= 1 × 5/9 = 5/9 Since ≥ 0.5, p is directly density reachable from o. p and o form a cluster.

Reachability of other objects from o
database instances 1 2 3 d a2 instances of o 1 2 3 6 5 5 6 4 5 a a1 b3 a3 4 4 5 b b2 o1 o b1 o2 o3 c

Experimental Evaluation

Datasets Artificial data set (ART) dimensional objects which are normally distributed in [0,1] Each object is randomly surrounded by a box having a side length of p<1 in each dimension (Data fuzziness) Assume uniform probability distribution within the box Engineering data set (PLANE) dimensional objects Normalized

Implementation FDBSCAN EXPDBSCAN Represent the distance between two uncertain objects by a single distance expectation value Ed(o,o’). Use the traditional DBSCAN algorithm to mine the data.

Implementation Java 1.4 Window platform 730 MHz processor 512 MB main memory Sample rate s = 5

Experiment 1 Efficiency of the FDBSCAN
Measure the runtimes of FDBSCAN and EXPDBSCAN on ART dataset p=0.01 Little fuzziness in the datasets Runtime (s) Does EXPDBSCAN applied MBR pruning strategies as FDBSCAN?

Experiment 2 Effectively of FDBSCAN
Measures the relation between the quality of the cluster results and data fuzziness of FDBSCAN and EXPDBSCAN. How to measure the quality of clusters? Treat as a black box for the time being… Good cluster will have the quality value close to 1, vice versa

Experiment 2 Effectively of FDBSCAN
FDBSCAN returns clusters with better quality than EXPDBSCAN in all data fuzziness and number of dimensions. i.e. more effective In ART, EXPDBSCAN performs quite well, but for high dimensional data, its quality is much worse than the FDBSCAN approach. The quality of EXPDBSCAN and FDBSCAN fall in high data fuzziness, however, the degree of falling of FDBSCAN is smaller than EXPDBSCAN.

Experiment 3 Accuracy of the core object classification
How accurate do FDBSCAN and EXPDBSCAN classify core object? Precision and recall rate of core object Precision shows how precise the reported core set of core objects is. # reported real core objects / #of core objects reported Recall shows the percentage of real core objects reported. #reported real core objects/ total # of real core objects in D

Experiment 3 Accuracy of the core object classification
The precision and recall rate are not 100% because FDBSCAN use sampling approach for calculating the core object probability Very few real core objects are found for EXPDBSCAN, however nearly most of the returned core objects are real core objects FDBSCAN has a higher precision and recall rate of core object in 2D ART dataset. The precision and recall rate of FDBSCAN increases in high dimension. Why? EXPDBSCAN has a lower recall rate than FDBSCAN. Why?

Why EXPDBSCAN suffer from low recall rate? (Example µ=5)
Probability density function Gaussian Distribution 9 2 10 1 8 5 B 6 Core point candidates A 7 3 4

Why EXPDBSCAN suffer from low recall rate? (Example µ=5)
9 2 10 1 8 5 Number of ε-neighbor = 5 A is a core object B ε 6 A ε Number of ε-neighbor = 4 B is NOT a core object 7 3 4

Conclusion Demonstrated how density based clustering can be carried out based on uncertain information. Presented the theoretical foundations for density based clustering of uncertain data. FDBSCAN work on the fuzzy distance function directly instead of working on lossy aggregated information.

My comments We also want to know…
The relationship between the sample rate and the execution time, a higher sample rate should suggest a more accurate result, but generally it tradeoffs with execution time. What is the relationship between these two parameters? Sample rate vs cluster quality Sample rate vs data dimensionality, which is a reference to determine the sample rate based on the data characteristic Sample rate vs fuzziness of data Since we represent each uncertain object by MBRs, the MBR(o) are bounding the samples of o This means that the MBR(o) may not bounding the whole uncertainty region of o In high data fuzziness, MBR(o) may not precisely indicate the uncertainty region of the real object o.

Something confused… We also want to know…
The reason for using 0.5 probability to determine core object is questionable. Why don’t treat this as a parameter? A higher value should suggests more false negative core objects, a lower value suggests more false positive core objects.

The End Thank you 

Density-Based Clustering of Uncertain Data (KDD2005)

Similar presentations

Presentation on theme: "Density-Based Clustering of Uncertain Data (KDD2005)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Density-Based Clustering of Uncertain Data (KDD2005)

Similar presentations

Presentation on theme: "Density-Based Clustering of Uncertain Data (KDD2005)"— Presentation transcript:

Similar presentations

About project

Feedback