Download presentation
1
K nearest neighbor and Rocchio algorithm
LING 572 Fei Xia 1/11/2007
2
Announcement Hw2 is online now. It is due on Jan 20.
Hw1 is due at 11pm on Jan 13 (Sat). Lab session after the class. Read DT Tutorial before next Tues’ class.
3
K-Nearest Neighbor (kNN)
4
Instance-based (IB) learning
No training: store all training instances. “Lazy learning” Examples: kNN Locally weighted regression Radial basis functions Case-based reasoning … The most well-known IB method: kNN
5
kNN
6
kNN For a new document d, Properties:
find k training documents that are closest to d. perform majority voting or weighted voting. Properties: A “lazy” classifier. No training. Feature selection and distance measure are crucial.
7
The algorithm Determine parameter K
Calculate the distance between query-instance and all the training instances Sort the distances and determine K nearest neighbors Gather the labels of the K nearest neighbors Use simple majority voting or weighted voting.
8
Picking K Use N-fold cross validation: pick the one that minimizes cross validation error.
9
Normalizing attribute values
Distance could be dominated by some attributes with large numbers: Ex: features: age, income Original data: x1=(35, 76K), x2=(36, 80K), x3=(70, 79K) Assume: age 2 [0,100], income 2 [0, 200K] After normalization: x1=(0.35, 0.38), x2=(0.36, 0.40), x3 = (0.70, 0.395).
10
The Choice of Features Imagine there are 100 features, and only 2 of them are relevant to the target label. kNN is easily misled in high-dimensional space. Feature weighting or feature selection
11
Feature weighting Stretch j-th axis by weight wj,
Use cross-validation to automatically choose weights w1, …, wn Setting wj to zero eliminates this dimension altogether.
12
Similarity measure Similarity measure: cosine
Euclidean distance: Weighted Euclidean distance: Similarity measure: cosine
13
Voting Majority voting: c* = arg maxc i (c, fi(x))
Weighted voting: weighting is on each neighbor c* = arg maxc i wi (c, fi(x)) wi = 1/dist(x, xi) We can use all the training examples.
14
Summary of kNN Strengths: Weakness: Simplicity (conceptual)
Efficiency at training: no training Handling multi-class Stability and robustness: averaging k neighbors Predication accuracy: when the training data is large Weakness: Efficiency at testing time: need to calc all distances Theoretical validity It is not clear which types of distance measure and features to use.
15
Rocchio Algorithm
16
Relevance Feedback for IR
The issue: “plane” vs. “aircraft” Take advantage of user feedback on relevance of docs to improve IR results: User issues a short, simple query The user marks returned documents as relevant or non-relevant. The system computes a better representation of the information need based on feedback. Relevance feedback can go through one or more iterations. Idea: it may be difficult to formulate a good query when you don’t know the collection well, so iterate
17
Rocchio Algorithm The Rocchio algorithm incorporates relevance feedback information into the vector space model. Want to maximize sim (Q, Cr) − sim (Q, Cnr) The optimal query vector for separating relevant and non-relevant documents (with cosine sim.): Qopt= optimal query; Cr = set of relevant doc vectors; N = collection size
18
Rocchio 1971 Algorithm (SMART)
qm = modified query vector; q0 = original query vector; , , : weights Dr = set of known relevant doc vectors; Dnr= set of known irrelevant doc vectors
19
Relevance feedback assumptions
Relevance prototypes are “well-behaved”. Term distribution in relevant documents will be similar Term distribution in non-relevant documents will be different from those in relevant documents: Either: All relevant documents are tightly clustered around a single prototype. Or: There are different prototypes, but they have significant vocabulary overlap. Similarities between relevant and irrelevant documents are small.
20
Rocchio Algorithm for text classification
Training time: construct a set of prototype vectors, one vector per class. Testing time: for a new document, find the most similar prototype vector.
21
Training time Cj: the set of positive examples for class j.
D: the set of positive and negative examples , : weights
22
Why this formula? Rocchio shows when ==1, each prototype vector maximizes How maximizing this formula connects to the classification accuracy?
23
Testing time Given a new document d
24
kNN vs. Rocchio kNN: Rocchio algorithm: Lazy learning: no training
Use all the training instances at testing time. Rocchio algorithm: At the training time, calculate prototype vectors. At the test time, instead of using all the training instances, use only prototype vectors. Linear classifier: not as expressive as kNN.
25
Summary of Rocchio Strengths: Weakness: Simplicity (conceptual)
Efficiency at training Efficiency at testing time Handling multi-class Weakness: Theoretical validity Stability and robustness Predication accuracy: it does not work well when the categories are not linearly separable.
26
Additional slides
27
Three major design choices
The weight of feature fk in document di: e.g., tf-idf The document length normalization The similarity measure: e.g., cosine
28
Extending Rocchio? Generalized instance set (GIS) algorithm (Lam and Ho, 1998). …
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.