Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering C.Watters CS6403.

Similar presentations


Presentation on theme: "Clustering C.Watters CS6403."— Presentation transcript:

1 Clustering C.Watters CS6403

2 Clustering What Why How Results C.Watters CS6403

3 Clustering Assign items to groups based on some calculation of degree of likeness between items Groups are not known before hand Uses multivariate analysis techniques Feature set determination critical C.Watters CS6403

4 Example News data Sports, World news, Entertainment etc
Short items, items with photos, items with names C.Watters CS6403

5 Why Improve efficiency of retrieval Improve effectiveness of retrieval
Ranking of retrieved results Visualization of results Karnaugh and SOM (self organizing maps) Discovery of content Discovery of relationships C.Watters CS6403

6 How Put items into groups so that members have a high degree of association within the group AND items have low degree of association with items in other groups Association for IR documents? Feature set? C.Watters CS6403

7 Feature Sets for IR Clustering
Term occurrences Citations Names Structure (tags) Co-occurences (thesaurus construction) C.Watters CS6403

8 Problems Choosing the best feature set Choosing the similarity measure
Evaluation of results Updates Searching clusters C.Watters CS6403

9 Measures of Similarity
Need to quantify the degree of association of an item with others Generally want a measure that is normalized by document vector length Not clear that weighted document terms are better than binary ones in clustering C.Watters CS6403

10 General Measures Dice coefficient Jaccard Coefficient
Cosine Coefficient C.Watters CS6403

11 Dice Coefficient Binary weights
C= Terms in common, A terms in i, and B terms in j C.Watters CS6403

12 Jaccard Coefficient Binary Weights
C= Terms in common, A terms in i, and B terms in j C.Watters CS6403

13 Cosine Coefficient Binary weights
C= Terms in common, A terms in i, and B terms in j C.Watters CS6403

14 Now what? Need to be able to compare any doc to any other doc Need?
11 12 13 14 15 21 22 23 24 25 31 32 33 34 35 41 42 43 44 45 51 52 53 54 55 Doc-Doc Similarity Matrix C.Watters CS6403

15 Generating Similarity Matrix
Use inverted file Documents with no terms in common do not need similarity calculation Generally generate only one row at a time as needed C.Watters CS6403

16 Algorithms Problem: sort N things into M groups, where M=[1,N]
Choice of algorithm determines M membership C.Watters CS6403

17 General Classes of Algorithms
Hierarchical Nested groups Pairwise connections made Non-hierarchical No overlap Centroid C.Watters CS6403

18 Evaluation of results Was method appropriate for data set
Do the clusters represent the data well Are the docs in the right cluster C.Watters CS6403

19 How to test? Overlap test Run a known query set and evaluate against known results Randomly select docs and judge relevance to group members Examine distribution of docs in groups Density test = term occurrences docs x unique terms C.Watters CS6403

20 Concepts to keep in mind
Cluster hypothesis Nearest neighbour centroid C.Watters CS6403

21 Cluster Hypothesis Associations between documents are related to the relevance of documents to queries Van Rijsbergen, 1979 C.Watters CS6403

22 Nearest Neighbour Find the document most similar to the given one
This one is most likely closely related Works with terms, citations, & clusters C.Watters CS6403

23 Centroids Representative of a cluster
May be a document from that cluster May be a composite of doc features from that cluster Why: query-centroid calculations higher level representations of data set build ontologies and thesauri C.Watters CS6403

24 Visualization of Clusters
Kohonen Maps Star maps SOM (self organizing maps) Etc C.Watters CS6403

25 Samples C.Watters CS6403

26 Cluster Map C.Watters CS6403 19

27 Starfield C.Watters CS6403 21

28 C.Watters CS6403


Download ppt "Clustering C.Watters CS6403."

Similar presentations


Ads by Google