Download presentation
Presentation is loading. Please wait.
Published byPhebe Carson Modified over 8 years ago
1
More on Document Similarity and Clustering How similar are these two documents (Again) ? Are these two documents about the same topic ?
2
Reprise l Vector Model of IR l Mapping onto a space l Distance between documents
3
Objectives for this lecture l The cluster hypothesis l Clustering Methods l Non Text Retrieval
4
Vector Model Word Sunderla nd Word Football Word Club Word
5
1 2 3 4 5 6 7 8 9 10 11 12 Vector Model Implementation Word Sunderla nd Word Football Word Club Word 1 2 3 4 5 6 7 8 9 10 11 12
6
1 2 3 4 5 6 7 8 9 10 11 12 Query/Document Match 1 2 3 4 5 6 7 8 9 10 11 12 QueryDocument
7
Two sorts of vector model l Full Model –use counts of terms in documents rather than just whether they appear once or not –in fact it uses weights of terms to reflect their importance –Inverse Document Frequency and across collection Term Frequency
8
Similarity l Documents and Queries are similar if documents have entries in query word positions l Very good documents will have high counts in query positions – especially of infrequently occurring query terms.
9
Cluster Hypothesis l Closely associated documents tend to be relevant to the same request
10
Clustering Methods l Inside Out (Bottom Up) l Outside In (Top Down) l Both these methods are hierarchical l Non-hierarchical clustering is also possible
11
Inside Out (Bottom Up) l Minimum spanning tree l Cluster each document with its nearest neighbour l Merge with the nearest cluster l Repeat until “good enough” or sufficiently few clusters
12
Bottom up Clustering - dendrogram Similarity d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 0.9 0.5 0.1 After Chakrabarti 2003
13
Outside In (Top Down) l Divide the Whole space in two l Divide each subpart in two l Repeat
14
Outside in Clustering
15
Outside In (2)
16
Outside In (3)
17
Term Vectors as Feature Vectors l Documents don’t have to be text l Vectors don’t have to be term vectors l Term Vectors are a sort of feature vector l Features might be: –Colour –Melody –Shape
18
Conclusions l What a cluster is and why it might be useful l How cluster could be formed l How the vector model might be used in non textual domains
19
Reading l Soumen Chakrabarti –Mining the Web –Morgan Kaufmann Publishers –2003 »P84 on
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.