Download presentation
Presentation is loading. Please wait.
Published byMaura Gaye Modified over 9 years ago
1
1 Very Large-Scale Incremental Clustering Berk Berker Mumin Cebe Ismet Zeki Yalniz 27 March 2007
2
2 Table of Contents Why Clustering? Why Incremental Clustering? Related Work Incremental C3M (C2ICM) A Former Implementation of C2ICM for very large datasets Conclusion
3
3 Why clustering ? It is an effective tool to manage information overload To browse large document collections quickly To easily grasp the distinct topics and subtopics (concept hierarchies) To allow search engines to efficiently query large document collections
4
4 Types of Clustering Hierarchical vs. Non-hierarchical Partitional vs. Agglomerative Deterministic vs. Probabilistic algorithms Incremental vs. Batch algorithms
5
5 Why Incremental Clustering ? The current information explosion Popular sources of informational text documents such as Newswire and Blogs Delay would be unacceptable in several important areas
6
6 Related Work The cluster-splitting approach Adaptive clustering based on user queries Cobweb algorithm Hierarchical Clustering in Incremental manner
7
7 C2ICM Algorithm C3M is known as an efficient, effective and robust algorithm for clustering documents C3M is well-developed for initial clustering, but maintenance is also necessary in clustering
8
8 C2ICM algorithm is based on cover coefficient concept as C3M. C2ICM is suitable for dynamic environments where there are additions and deletions of documents With C2ICM, reclustering for each update is avoided. C2ICM Algorithm
9
9 C2ICM Algorithm Details First we compute the number of clusters and cluster seed powers in the updated database Then we determine the newly added documents and falsified documents
10
10 How do the clusters become false? When a seed document becomes non-seed or is deleted One or more non-seed documents of that cluster becomes seed C2ICM Algorithm Details
11
11 C2ICM Algorithm Details We cluster these documents by assigning them to the cluster of the seed that covers them most The documents which does not belong to any cluster are grouped into ragbag cluster
12
12 C2ICM: An example Current state of the clusters d5 d4 d3 d1 d7 d2 d8 d9 d15 d6 d10 d11 d18 d16 d17 d12 d13 d14 Ragbag cluster Seed List d1 d6 d12 d19
13
13 C2ICM: CASE 1 When a seed document becomes nonseed d5 d4 d3 d1 d7 d2 The set of documents to be clustered New Seed List d1 d6 d13 d19 New documents arrived d19 d20 d21 d22 Old Seed List d1 d6 d12 d18 d16 d17 d12 d13 d14 d8 d9 d15 d6 d10 d11
14
14 C2ICM: CASE 1 Seed document d12 becomes nonseed d5 d4 d3 d1 d7 d2 d22 d13 d14 d12 d16 d17 d18 d19 d20 d21 The set of documents to be clustered New Seed List d1 d6 d13 d19 d8 d9 d15 d6 d10 d11
15
15 C2ICM: CASE 1 d5 d4 d3 d1 d7 d2 New Seed List d1 d6 d13 d19 d20 d16 d12 d13 d18 d21 d14 d17 d19 d22 No elements remaining in the ragbag cluster Final clusters d8 d9 d15 d6 d10 d11
16
16 When a nonseed document in a cluster becomes seed Old Seed List d1 d6 d12 New documents arrived The set of documents to be clustered C2ICM: CASE 2 New Seed List d1 d6 d12 d14 d5 d4 d3 d1 d7 d2 d19 d20 d21 d22 d18 d16 d17 d12 d13 d14 d8 d9 d15 d6 d10 d11
17
17 Nonseed document d14 becomes seed. d5 d4 d3 d1 d7 d2 d12 d13 d14 d16 d17 d18 d19 d20 d21 d22 New Seed List d1 d6 d12 d14 The set of documents to be clustered Becomes new seed C2ICM: CASE 2 d8 d9 d15 d6 d10 d11
18
18 C2ICM: CASE 2 d5 d4 d3 d1 d7 d2 d20 d16 d13 d12 d22 d18 d21 d19 d17 d14 New Seed List d1 d6 d12 d14 No elements remaining in the ragbag cluster Becomes new seed Final clusters d8 d9 d15 d6 d10 d11
19
19 A Former Implementation of C2ICM for Very Large Datasets C2ICM is implemented by two programs (VS Pascal) Program I selects the seeds Program II clusters documents by using C2ICM algorithm. These programs communicate by exchanging files. Program I Seed Selector Program II C2ICM text filesdocuments clusters
20
20 Former Experiments C2ICM is tested with a subset of MARIAN database (~43K documents) in 1995. 6 experiments are done. Each incremental update added ~6K documents to the different sizes of initially clustered documents
21
21 Results for the Former Experiments C2ICM provides time savings Clusters formed with C2ICM was very similar to the clusters formed with C3M
22
22 Conclusion Cluster maintenance problem is challenging Our aim is to conduct experiments for C2ICM with very large number of documents (i.e. millions of documents) HARD dataset will be used for evaluation. Information retrieval performance will be measured. Implementation of C2ICM must be time and memory efficient.
23
23 References Can, F., Ozkarahan, E. A. "Concepts and effectiveness of the cover coefficient-based clustering methodology for text databases." ACM Transactions on Database Systems. Vol. 15, No. 4 (December, 1990), pp. 483-517. Can, F. "Incremental clustering for dynamic information processing." ACM Transactions on Information Systems. Vol. 11, No. 2 (April, 1993), 143-164. Can, F., Fox, E. A., Snavely, C. D., France, R. K. "Incremental clustering for very large document databases: initial MARIAN experience." Information Sciences. Vol. 84 (1995), pp. 101-114. A. K. Jain, M. N. Murty, P. J. Flynn, Data clustering: a review, ACM Computing Surveys (CSUR), v.31 n.3, p.264-323, Sept. 1999
24
24 Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.