Clustering Sequential Data: Research Paper Review Presented by Glynis Hawley April 28, 2003 On the Optimal Clustering of Sequential Data by Cheng-Ru Lin and Ming-Syan Chen, Electrical Engineering Department National Taiwan University, Taipei, Taiwan Second SIAM International Conference on Data Mining April 11-13,
Agenda n Introduction: What is sequential clustering? n Problem definition for algorithm design n Optimal Algorithm: SC OPT n Greedy Algorithm: SC GD n Conclusion
Sequential Clustering Problem n Attributes and sequence of objects are both important. n Objects within a cluster form a continuous region. n An object within one cluster may be closer to the centroid of a different cluster than it is to its own centroid.
Conventional Clustering vs. Sequential Clustering
Application Areas n Analysis of motion patterns of objects. –Cellular phones. n Analysis of status logs of running machines.
Problem Definition n Partitioning problem –n sequential objects into k clusters n Dissimilarity measurement –Squared Euclidean distance n Cluster quality –Cost measurement: penalizes clusters for amount of dissimilarity of objects n Best solution minimizes the sum of the costs of all clusters
Cost Definition n Cost of a cluster: summation over all m objects of the squared Euclidean distance of the object from the cluster centroid.
Sequential Clustering Algorithms n Optimal Sequential Clustering Algorithm –SC OPT n Greedy Sequential Clustering Algorithm –SC GD
Algorithm SC OPT n Determines optimal k-partition of a set of sequential objects. n Uses the property of optimal substructure. –Systematically solves all possible sub- problems. –Stores results to be used in later steps.
Complexity of Algorithm SC OPT Time: O (kn 2 ) Space: O (kn)
n Initially, arbitrarily insert separators to divide the n objects into k clusters | | Algorithm SC GD
n Reposition the separators by “moves” and “jumps” to reduce the cost of the clusters n The best possible move or jump is determined by calculating the cost reductions of all possible moves and jumps. Algorithm SC GD (Cont.) move jump move jump
Algorithm SC GD (Cont.) n Continue repositioning separators until no further cost reductions are possible. n Complexity –Time: O (nl / k + n), linear –Space: O (k) Quality of clusters increases with n and with average cluster size.
Conclusion n Sequential clustering requires that the sequence of data points be considered as well as the similarity of attributes. n Algorithms: –SC OPT and SC GD –SC GD approaches SC OPT in terms of quality of clusters when average cluster sizes are large.