More on Document Similarity and Clustering How similar are these two documents (Again) ? Are these two documents about the same topic ?

Reprise l Vector Model of IR l Mapping onto a space l Distance between documents

Objectives for this lecture l The cluster hypothesis l Clustering Methods l Non Text Retrieval

Vector Model Word Sunderla nd Word Football Word Club Word        

1 2 3 4 5 6 7 8 9 10 11 12 Vector Model Implementation         Word Sunderla nd Word Football Word Club Word 1 2 3 4 5 6 7 8 9 10 11 12

1 2 3 4 5 6 7 8 9 10 11 12 Query/Document Match         1 2 3 4 5 6 7 8 9 10 11 12          QueryDocument

Two sorts of vector model l Full Model –use counts of terms in documents rather than just whether they appear once or not –in fact it uses weights of terms to reflect their importance –Inverse Document Frequency and across collection Term Frequency

Similarity l Documents and Queries are similar if documents have entries in query word positions l Very good documents will have high counts in query positions – especially of infrequently occurring query terms.

Cluster Hypothesis l Closely associated documents tend to be relevant to the same request

Clustering Methods l Inside Out (Bottom Up) l Outside In (Top Down) l Both these methods are hierarchical l Non-hierarchical clustering is also possible

Inside Out (Bottom Up) l Minimum spanning tree l Cluster each document with its nearest neighbour l Merge with the nearest cluster l Repeat until “good enough” or sufficiently few clusters

Bottom up Clustering - dendrogram Similarity d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 0.9 0.5 0.1 After Chakrabarti 2003

Outside In (Top Down) l Divide the Whole space in two l Divide each subpart in two l Repeat

Outside in Clustering

Outside In (2)

Outside In (3)

Term Vectors as Feature Vectors l Documents don’t have to be text l Vectors don’t have to be term vectors l Term Vectors are a sort of feature vector l Features might be: –Colour –Melody –Shape

Conclusions l What a cluster is and why it might be useful l How cluster could be formed l How the vector model might be used in non textual domains

Reading l Soumen Chakrabarti –Mining the Web –Morgan Kaufmann Publishers –2003 »P84 on

More on Document Similarity and Clustering How similar are these two documents (Again) ? Are these two documents about the same topic ?

Similar presentations

Presentation on theme: "More on Document Similarity and Clustering How similar are these two documents (Again) ? Are these two documents about the same topic ?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

More on Document Similarity and Clustering How similar are these two documents (Again) ? Are these two documents about the same topic ?

Similar presentations

Presentation on theme: "More on Document Similarity and Clustering How similar are these two documents (Again) ? Are these two documents about the same topic ?"— Presentation transcript:

Similar presentations

About project

Feedback