Clustering C.Watters CS6403.

Name: Clustering C.Watters CS6403.
Uploaded: 2017-10-13T12:56:12+00:00
Duration: PTM7S53
Channel: Suzan Hill
Description: Clustering C.Watters CS6403.

Clustering C.Watters CS6403

Clustering What Why How Results C.Watters CS6403

Clustering Assign items to groups based on some calculation of degree of likeness between items Groups are not known before hand Uses multivariate analysis techniques Feature set determination critical C.Watters CS6403

Example News data Sports, World news, Entertainment etc
Short items, items with photos, items with names C.Watters CS6403

Why Improve efficiency of retrieval Improve effectiveness of retrieval
Ranking of retrieved results Visualization of results Karnaugh and SOM (self organizing maps) Discovery of content Discovery of relationships C.Watters CS6403

How Put items into groups so that members have a high degree of association within the group AND items have low degree of association with items in other groups Association for IR documents? Feature set? C.Watters CS6403

Feature Sets for IR Clustering
Term occurrences Citations Names Structure (tags) Co-occurences (thesaurus construction) C.Watters CS6403

Problems Choosing the best feature set Choosing the similarity measure
Evaluation of results Updates Searching clusters C.Watters CS6403

Measures of Similarity
Need to quantify the degree of association of an item with others Generally want a measure that is normalized by document vector length Not clear that weighted document terms are better than binary ones in clustering C.Watters CS6403

General Measures Dice coefficient Jaccard Coefficient
Cosine Coefficient C.Watters CS6403

Dice Coefficient Binary weights
C= Terms in common, A terms in i, and B terms in j C.Watters CS6403

Jaccard Coefficient Binary Weights

Cosine Coefficient Binary weights

Now what? Need to be able to compare any doc to any other doc Need?
11 12 13 14 15 21 22 23 24 25 31 32 33 34 35 41 42 43 44 45 51 52 53 54 55 Doc-Doc Similarity Matrix C.Watters CS6403

Generating Similarity Matrix
Use inverted file Documents with no terms in common do not need similarity calculation Generally generate only one row at a time as needed C.Watters CS6403

Algorithms Problem: sort N things into M groups, where M=[1,N]
Choice of algorithm determines M membership C.Watters CS6403

General Classes of Algorithms
Hierarchical Nested groups Pairwise connections made Non-hierarchical No overlap Centroid C.Watters CS6403

Evaluation of results Was method appropriate for data set
Do the clusters represent the data well Are the docs in the right cluster C.Watters CS6403

How to test? Overlap test Run a known query set and evaluate against known results Randomly select docs and judge relevance to group members Examine distribution of docs in groups Density test = term occurrences docs x unique terms C.Watters CS6403

Concepts to keep in mind
Cluster hypothesis Nearest neighbour centroid C.Watters CS6403

Cluster Hypothesis Associations between documents are related to the relevance of documents to queries Van Rijsbergen, 1979 C.Watters CS6403

Nearest Neighbour Find the document most similar to the given one
This one is most likely closely related Works with terms, citations, & clusters C.Watters CS6403

Centroids Representative of a cluster
May be a document from that cluster May be a composite of doc features from that cluster Why: query-centroid calculations higher level representations of data set build ontologies and thesauri C.Watters CS6403

Visualization of Clusters
Kohonen Maps Star maps SOM (self organizing maps) Etc C.Watters CS6403

Samples C.Watters CS6403

Cluster Map C.Watters CS6403 19

Starfield C.Watters CS6403 21

C.Watters CS6403

Clustering C.Watters CS6403.

Similar presentations

Presentation on theme: "Clustering C.Watters CS6403."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering C.Watters CS6403.

Similar presentations

Presentation on theme: "Clustering C.Watters CS6403."— Presentation transcript:

Similar presentations

About project

Feedback