Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Retrieval

Similar presentations


Presentation on theme: "Information Retrieval"— Presentation transcript:

1 Information Retrieval
For the MSc Computer Science Programme Lecture 5 Introduction to Information Retrieval (Manning et al. 2007) Chapter 14 Dell Zhang Birkbeck, University of London

2 Recall: Vector Space Model
Docs  Vectors Each doc can now be viewed as a vector of TFxIDF values, one component for each term. So we have a vector space Terms are axes Docs live in this space May have 20,000+ dimensions even with stemming

3 k Nearest Neighbors (kNN)
Given a test doc d and the training data identify the set Sk of the k nearest neighbors of d, i.e., the k training docs most similar to d. for each class cjC compute N(Sk,cj) the number of Sk members that belong to class cj estimate Pr[cj|d] as N(Sk,cj)/k classify d to the majority class of Sk memebers.

4 kNN – Example 5NN (k=5) c( ) = ? Government Science Arts

5 kNN – Example Decision Boundaries Government Science Arts

6 kNN – Online Demo

7 Similarity Metric kNN depends on a similarity/distance metric
For text, cosine similarity of TFxIDF weighted vectors is usually most effective. 3NN c(d) = ?

8 Parameter k k = 1 Using only the nearest neighbor to determine classification is often error-prone due to: atypical training documents noise (i.e. error) in the class labels

9 Parameter k k > 1 k = N More robust with a larger k
The value of k is typically odd to avoid ties 3 and 5 are most common k = N N: the total number of training docs Every test doc would be classified into the largest class in spite of its content. Degenerate to classification using priori probabilities Pr[cj]

10 Efficiency kNN with Inverted Index
Naively finding the kNN of a test doc d requires a linear search through all training docs. But this is actually same as finding the top k retrieval results using d as a (long) query to the database of training docs. Therefore the standard inverted index method for vector space retrieval could be used to accelerate this process.

11 Discussion Lazy Learning Asymptotically close to optimal
Don’t need to learn until a test doc needs to be classified. Asymptotically close to optimal More training documents usually lead to better accuracy, though lower speed. Scales well with the number of classes

12 Take Home Messages kNN Classification


Download ppt "Information Retrieval"

Similar presentations


Ads by Google