1 Clustering Algorithms Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall,

1 Clustering Algorithms Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992. (Chapter 16)

2 Term clustering: from column viewpoint thesaurus construction Document clustering: from row viewpoint searching browsing Application of Clustering

3 Automatic Document Classification l Searching vs. Browsing l Disadvantages in using inverted index files »information pertaining to a document is scattered among many different inverted-term lists »information relating to different documents with similar term assignment is not in close proximity in the file system l Approaches »inverted-index files (for searching) + clustered document collection (for browsing) »clustered file organization (for searching and browsing)

4 CentroidsDocuments Typical Search path Highest-level centroid Supercentroids Centroids Documents Typical Clustered File Organization

5 Cluster Generation vs. Cluster Search l Cluster generation »Cluster structure is generated only once. »Cluster maintenance can be carried out at relatively infrequent intervals. »Cluster generation process may be slower and more expensive. l Cluster search »Cluster search operations may have to be performed continually. »Cluster search operations must be carried out efficiently.

6 Hierarchical Cluster Generation l Two strategies »pairwise item similarities »heuristic methods l Models »Divisive Clustering (top down) –The complete collection is assumed to represent one complete cluster. –Then the collection is subsequently broken down into smaller pieces. »Hierarchical Agglomerative Clustering (bottom up) –Individual item similarities are used as a starting point. –A gluing operation collects similar items, or groups, into larger group.

7 Hierarchical Agglomerative Clustering Basic procedure 1. Place each of N documents into a class of its own. 2. Compute all pairwise document-document similarity coefficients. (N(N-1)/2 coefficients) 3. Form a new cluster by combining the most similar pair of current clusters i and j; update similarity matrix by deleting the rows and columns corresponding to i and j; calculate the entries in the row corresponding to the new cluster i+j. 4. Repeat step 3 if the number of clusters left is great than 1.

8 How to Combine Clusters? l Intercluster similarity »Single-link »Complete-link »Group average link l Single-link clustering »Each document must have a similarity exceeding a stated threshold value with at least one other document in the same class. »similarity between a pair of clusters is taken to be the similarity between the most similar pair of items »each cluster member will be more similar to at least one member in that same cluster than to any member of another cluster

9 How to Combine Clusters? ( Continued ) l Complete-link Clustering »Each document has a similarity to all other documents in the same class that exceeds the the threshold value. »similarity between the least similar pair of items from the two clusters is used as the cluster similarity »each cluster member is more similar to the most dissimilar member of that cluster than to the most dissimilar member of any other cluster

10 How to Combine Clusters? ( Continued ) l Group-average link clustering »a compromise between the extremes of single-link and complete-link systems »each cluster member has a greater average similarity to the remaining members of that cluster than it does to all members of any other cluster

11 A-F (6 items) 6(6-1)/2 (15) pairwise similarities decreasing order Example for Agglomerative Clustering

12 1. AF 0.9 A F A B C D E F A..3.5.6.8.9 B.3..4.5.7.8 C.5.4..3.5.2 D.6.5.3..4.1 E.8.7.5.4..3 F.9.8.2.1.3. AF B C D E AF..8.5.6.8 B.8..4.5.7 C.5.4..3.5 D.6.5.3..4 E.8.7.5.4. 0.9 2. AE 0.8 A F E 0.9 0.8 sim(AF,X)=max(sim(A,X),sim(F,X)) sim(AEF,X)=max(sim(AF,X),sim(E,X)) Single Link Clustering

13 3. BF 0.8 A F AEF B C D AEF..8.5.6 B.8..4.5 C.5.4..3 D.6.5.3. E 0.9 0.8 B 4. BE 0.7 A F E 0.9 0.8 B ABEF C D ABEF..5.6 C.5..3 D.6.3. sim(ABEF,X)=max(sim(AEF,X), sim(B,X)) Note E and B are on the same level. sim(ABDEF,X)=max(sim(ABEF,X), sim(D,X)) Single Link Clustering (Cont.)

14 5. AD 0.6 A F E 0.9 0.8 B D 6. AC 0.5 A F ABDEF C ABDEF..5 C.5. E 0.9 0.8 B D C 0.6 0.5 Single Link Clustering (Cont.)

15 Single-Link Clusters l Similarity level 0.7 (i.e., similarity threshold) »ABEF »C »D l Similarity level 0.5 (i.e., similarity threshold) »ABEFCD

16 1. AF 0.9 A F A B C D E F A..3.5.6.8.9 B.3..4.5.7.8 C.5.4..3.5.2 D.6.5.3..4.1 E.8.7.5.4..3 F.9.8.2.1.3. 0.9 2. AE 0.8(A,E) (A,F) new check EF  3. BF 0.8check AB  (A,E) (A,F) (B,F) Step Number Check Operations Similarity Pair Complete Link Structure & Pairs Covered Similarity Matrix sim(AF,X)=min(sim(A,X), sim(F,X)) Complete-Linke Cluster Generation

17 4. BE 0.7 new B E 0.7 AF B C D E AF..3.2.1.3 B.3..4.5.7 C.2.4..3.5 D.1.5.3..4 E.3.7.5.4. 5. AD 0.6 check DF  (A,D)(A,E)(A,F) (B,E)(B,F) 6. AC 0.6check CF  (A,C)(A,D)(A,E)(A,F) (B,E)(B,F) 7. BD 0.5check DE  (A,C)(A,D)(A,E)(A,F) (B,D)(B,E)(B,F) Step Number Similarity Pair Check Operations Complete Link Structure & Pairs Covered Similarity Matrix Complete-Linke Cluster Generation (Cont.)

18 8. CE 0.5 B E C 0.7 0.4 check BC  AF BE C D AF..3.2.1 BE.3..4.4 C.2.4..3 D.1.4.3. 9. BC 0.4 check CE0.5 10. DE 0.4Check BD0.5 DE  (A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,E)(D,E) 11. AB 0.3 Check AC0.5 AE0.8 BF0.8 CF , EF  (A,B)(A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,E)(D,E) Step Number Similarity Pair Check Operations Similarity Matrix (A,C)(A,D)(A,E)(A,F) (B,D)(B,E)(B,F)(C,E) (in the checklist) Complete-Linke Cluster Generation (Cont.)

19 B E C 0.7 0.4 D 0.3 AF BCE D AF..2.1 BCE.2..3 D.1.3. 12. CD 0.3Check BD0.5 DE0.4 13. EF 0.3 Check BF0.8 CF  DF  (A,B)(A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,D)(C,E)(D,E)(E,F) 14. CF 0.2 Check BF0.8 EF0.3 DF  (A,B)(A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,D)(C,E)(C,F)(D,E)(E,F) Step Number Similarity Pair Check Operations Similarity Matrix Complete-Linke Cluster Generation (Cont.)

20 B E C 0.7 0.4 D AF BCDE AF..1 BCDE.1. A F 0.3 0.1 0.9 15. DF 0.1last pair Complete-Linke Cluster Generation (Cont.)

21 Similarity level 0.7 AF 0.9 BE 0.7 CD Similarity level 0.4 AF 0.9 BE 0.7 D C 0.4 0.5 Similarity level 0.3 AF 0.9 BE D C 0.50.4 0.3 0.7 0.4 0.5 Complete Link Clusters

22 Group Average Link Clustering l Group average link clustering »use the average values of the pairwise links within a cluster to determine similarity »all objects contribute to intercluster similarity »resulting in a structure intermediate between the loosely bound single link cluster and tightly bound complete link clusters

23 Comparison l The Behavior of Single-Link Cluster »The single-link process tends to produce a small number of large clusters that are characterized by a chaining effect. »Each element is usually attached to only one other member of the same cluster at each similarity level. »It is sufficient to remember the list of previously clustered single items.

24 Comparison l The Behavior of Complete-Link Cluster »Complete-link process produces a much larger number of small, tightly linked groupings. »Each item in a complete-link cluster is guaranteed to resemble all other items in that cluster at the stated similarity level. »It is necessary to remember the list of all item pairs previously considered in the clustering process. l Comparison »The complete-link clustering system may be better adapted to retrieval than the single-link clusters. »A complete-link cluster generation is more expensive to perform than a comparable single-link process.

25 D i =(w 1,i, w 2,i,..., w t,i ) document vector for D i L j =(l j,1, l j,2,..., l j,n j ) inverted list for term T j l ji denotes document identifier of ith document listed under term T j n j denote number of postings for term T j for j=1 to t (for each of t possible terms) for i=1 to n j (for all n j entries on the jth list) compute sim(D lj,i,D lj,k ) i+1<=k<=n j end for How to Generate Similarity

26 set S ji =0, 1<=j<=N for j=1 to N (fore each document in collection) for each term k in document D j take up inverted list L k for i=1 to n k (for each document identifier on list L k ) if j<l k,i or if S ji =1 take up next document D i else compute sim(D j,D l k,i ) set S ji =1 end for j i Similarity without Recomputation

27 Heuristic Clustering Methods l Hierarchical clustering strategies »use all pairwise similarities between items »the clustering-generation are relatively expensive »produce a unique set of well-formed clusters for each set of data, regardless of the order in which the similarity pairs are introduced into the clustering process l Heuristic clustering methods »produce rough cluster arrangements at relatively little expense »single-pass clustering

28 Single-Pass Clustering Heuristic Methods l Item 1 is first taken and placed into a cluster of its own. l Each subsequent item is then compared against all existing clusters. l It is placed in a previously existing cluster whenever it is similar to any existing cluster. »Compute the similarities between all existing centroids and the new incoming item. »When an item is added to an existing cluster, the corresponding centroid must then be appropriately updated. l If a new item is not sufficiently similar to any existing cluster, the new item forms a cluster of its own.

29 Single-Pass Clustering Heuristic Methods ( Continued ) l Characteristics »Produce uneven cluster structures. l Solutions »cluster splitting: cluster sizes »variable similarity thresholds: the number of clusters, and the overlap among clusters l Produce cluster arrangements that vary according to the order of individual items.

Cluster Splitting Addition of one more item to cluster A Splitting cluster A into two pieces A’ and A’’ Splitting superclusters S into two pieces S’ and S’’

31 Cluster Searching l Cluster centroid the average vector of all the documents in a given cluster l strategies »top down the query is first compared with the highest-level centroids »bottom up only the lowest-level centroids are stored, the higher-level cluster structure is disregarded

32 Top-down entire-clustering search 1. Initialized by adding top item to active node list; 2. Take centroid with highest-query similarity from active node list; if the number of singleton items in subtree headed by that centroid is not larger than number of items wanted, then retrieve these singleton items and eliminate the centroid from active node list; else eliminate the centroid with highest query similarity from active node list and add its sons to active node list; 3. if number of retrieved  number wanted then stop else repeat step 2

Active node listNumber of singleRetrieved items in subtreeitems (1,0.2)14 (too big) (2,0.5), (4,0.7), (3,0)6 (too big) (2,0.5), (8,0.8), (9,0.3),(3,0)2I, J (2,0.5), (9,0.3), (3,0)4 (too big) (5,0.6), (6,0.5), (9,0.3), (3,0)2A,B

34 Bottom-up Individual-Cluster Search Take a specified number of low-level centroids if there are enough singleton items in those clusters to equal the number of items wanted, then retrieve the number of items wanted in ranked order; else add additional low-level centroids to list and repeat test

Active centroid list: (8,.8), (4,.7), (5,.6) Ranked documents from clusters: (I,.9), (L,.8), (A,.8), (K,.6), (B,.5), (J,.4), (N,.4), (M,.2) Retrieved items: I, L, A

1 Clustering Algorithms Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall,

Similar presentations

Presentation on theme: "1 Clustering Algorithms Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Clustering Algorithms Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall,

Similar presentations

Presentation on theme: "1 Clustering Algorithms Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall,"— Presentation transcript:

Similar presentations

About project

Feedback