Download presentation
1
Clustering C.Watters CS6403
2
Clustering What Why How Results C.Watters CS6403
3
Clustering Assign items to groups based on some calculation of degree of likeness between items Groups are not known before hand Uses multivariate analysis techniques Feature set determination critical C.Watters CS6403
4
Example News data Sports, World news, Entertainment etc
Short items, items with photos, items with names C.Watters CS6403
5
Why Improve efficiency of retrieval Improve effectiveness of retrieval
Ranking of retrieved results Visualization of results Karnaugh and SOM (self organizing maps) Discovery of content Discovery of relationships C.Watters CS6403
6
How Put items into groups so that members have a high degree of association within the group AND items have low degree of association with items in other groups Association for IR documents? Feature set? C.Watters CS6403
7
Feature Sets for IR Clustering
Term occurrences Citations Names Structure (tags) Co-occurences (thesaurus construction) C.Watters CS6403
8
Problems Choosing the best feature set Choosing the similarity measure
Evaluation of results Updates Searching clusters C.Watters CS6403
9
Measures of Similarity
Need to quantify the degree of association of an item with others Generally want a measure that is normalized by document vector length Not clear that weighted document terms are better than binary ones in clustering C.Watters CS6403
10
General Measures Dice coefficient Jaccard Coefficient
Cosine Coefficient C.Watters CS6403
11
Dice Coefficient Binary weights
C= Terms in common, A terms in i, and B terms in j C.Watters CS6403
12
Jaccard Coefficient Binary Weights
C= Terms in common, A terms in i, and B terms in j C.Watters CS6403
13
Cosine Coefficient Binary weights
C= Terms in common, A terms in i, and B terms in j C.Watters CS6403
14
Now what? Need to be able to compare any doc to any other doc Need?
11 12 13 14 15 21 22 23 24 25 31 32 33 34 35 41 42 43 44 45 51 52 53 54 55 Doc-Doc Similarity Matrix C.Watters CS6403
15
Generating Similarity Matrix
Use inverted file Documents with no terms in common do not need similarity calculation Generally generate only one row at a time as needed C.Watters CS6403
16
Algorithms Problem: sort N things into M groups, where M=[1,N]
Choice of algorithm determines M membership C.Watters CS6403
17
General Classes of Algorithms
Hierarchical Nested groups Pairwise connections made Non-hierarchical No overlap Centroid C.Watters CS6403
18
Evaluation of results Was method appropriate for data set
Do the clusters represent the data well Are the docs in the right cluster C.Watters CS6403
19
How to test? Overlap test Run a known query set and evaluate against known results Randomly select docs and judge relevance to group members Examine distribution of docs in groups Density test = term occurrences docs x unique terms C.Watters CS6403
20
Concepts to keep in mind
Cluster hypothesis Nearest neighbour centroid C.Watters CS6403
21
Cluster Hypothesis Associations between documents are related to the relevance of documents to queries Van Rijsbergen, 1979 C.Watters CS6403
22
Nearest Neighbour Find the document most similar to the given one
This one is most likely closely related Works with terms, citations, & clusters C.Watters CS6403
23
Centroids Representative of a cluster
May be a document from that cluster May be a composite of doc features from that cluster Why: query-centroid calculations higher level representations of data set build ontologies and thesauri C.Watters CS6403
24
Visualization of Clusters
Kohonen Maps Star maps SOM (self organizing maps) Etc C.Watters CS6403
25
Samples C.Watters CS6403
26
Cluster Map C.Watters CS6403 19
27
Starfield C.Watters CS6403 21
28
C.Watters CS6403
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.