Presentation is loading. Please wait.

Presentation is loading. Please wait.

More on Document Similarity and Clustering How similar are these two documents (Again) ? Are these two documents about the same topic ?

Similar presentations


Presentation on theme: "More on Document Similarity and Clustering How similar are these two documents (Again) ? Are these two documents about the same topic ?"— Presentation transcript:

1 More on Document Similarity and Clustering How similar are these two documents (Again) ? Are these two documents about the same topic ?

2 Reprise l Vector Model of IR l Mapping onto a space l Distance between documents

3 Objectives for this lecture l The cluster hypothesis l Clustering Methods l Non Text Retrieval

4 Vector Model Word Sunderla nd Word Football Word Club Word        

5 1 2 3 4 5 6 7 8 9 10 11 12 Vector Model Implementation         Word Sunderla nd Word Football Word Club Word 1 2 3 4 5 6 7 8 9 10 11 12

6 1 2 3 4 5 6 7 8 9 10 11 12 Query/Document Match         1 2 3 4 5 6 7 8 9 10 11 12          QueryDocument

7 Two sorts of vector model l Full Model –use counts of terms in documents rather than just whether they appear once or not –in fact it uses weights of terms to reflect their importance –Inverse Document Frequency and across collection Term Frequency

8 Similarity l Documents and Queries are similar if documents have entries in query word positions l Very good documents will have high counts in query positions – especially of infrequently occurring query terms.

9 Cluster Hypothesis l Closely associated documents tend to be relevant to the same request

10 Clustering Methods l Inside Out (Bottom Up) l Outside In (Top Down) l Both these methods are hierarchical l Non-hierarchical clustering is also possible

11 Inside Out (Bottom Up) l Minimum spanning tree l Cluster each document with its nearest neighbour l Merge with the nearest cluster l Repeat until “good enough” or sufficiently few clusters

12 Bottom up Clustering - dendrogram Similarity d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 0.9 0.5 0.1 After Chakrabarti 2003

13 Outside In (Top Down) l Divide the Whole space in two l Divide each subpart in two l Repeat

14 Outside in Clustering

15 Outside In (2)

16 Outside In (3)

17 Term Vectors as Feature Vectors l Documents don’t have to be text l Vectors don’t have to be term vectors l Term Vectors are a sort of feature vector l Features might be: –Colour –Melody –Shape

18 Conclusions l What a cluster is and why it might be useful l How cluster could be formed l How the vector model might be used in non textual domains

19 Reading l Soumen Chakrabarti –Mining the Web –Morgan Kaufmann Publishers –2003 »P84 on


Download ppt "More on Document Similarity and Clustering How similar are these two documents (Again) ? Are these two documents about the same topic ?"

Similar presentations


Ads by Google