K nearest neighbor and Rocchio algorithm

Name: K nearest neighbor and Rocchio algorithm
Uploaded: 2017-10-09T21:34:11+00:00
Duration: PTM8S5
Description: K nearest neighbor and Rocchio algorithm

K nearest neighbor and Rocchio algorithm
LING 572 Fei Xia 1/11/2007

Announcement Hw2 is online now. It is due on Jan 20.
Hw1 is due at 11pm on Jan 13 (Sat). Lab session after the class. Read DT Tutorial before next Tues’ class.

K-Nearest Neighbor (kNN)

Instance-based (IB) learning
No training: store all training instances.  “Lazy learning” Examples: kNN Locally weighted regression Radial basis functions Case-based reasoning … The most well-known IB method: kNN

kNN For a new document d, Properties:
find k training documents that are closest to d. perform majority voting or weighted voting. Properties: A “lazy” classifier. No training. Feature selection and distance measure are crucial.

The algorithm Determine parameter K
Calculate the distance between query-instance and all the training instances Sort the distances and determine K nearest neighbors Gather the labels of the K nearest neighbors Use simple majority voting or weighted voting.

Picking K Use N-fold cross validation: pick the one that minimizes cross validation error.

Normalizing attribute values
Distance could be dominated by some attributes with large numbers: Ex: features: age, income Original data: x1=(35, 76K), x2=(36, 80K), x3=(70, 79K) Assume: age 2 [0,100], income 2 [0, 200K] After normalization: x1=(0.35, 0.38), x2=(0.36, 0.40), x3 = (0.70, 0.395).

The Choice of Features Imagine there are 100 features, and only 2 of them are relevant to the target label. kNN is easily misled in high-dimensional space.  Feature weighting or feature selection

Feature weighting Stretch j-th axis by weight wj,
Use cross-validation to automatically choose weights w1, …, wn Setting wj to zero eliminates this dimension altogether.

Similarity measure Similarity measure: cosine
Euclidean distance: Weighted Euclidean distance: Similarity measure: cosine

Voting Majority voting: c* = arg maxc i (c, fi(x))
Weighted voting: weighting is on each neighbor c* = arg maxc i wi (c, fi(x)) wi = 1/dist(x, xi)  We can use all the training examples.

Summary of kNN Strengths: Weakness: Simplicity (conceptual)
Efficiency at training: no training Handling multi-class Stability and robustness: averaging k neighbors Predication accuracy: when the training data is large Weakness: Efficiency at testing time: need to calc all distances Theoretical validity It is not clear which types of distance measure and features to use.

Rocchio Algorithm

Relevance Feedback for IR
The issue: “plane” vs. “aircraft” Take advantage of user feedback on relevance of docs to improve IR results: User issues a short, simple query The user marks returned documents as relevant or non-relevant. The system computes a better representation of the information need based on feedback. Relevance feedback can go through one or more iterations. Idea: it may be difficult to formulate a good query when you don’t know the collection well, so iterate

Rocchio Algorithm The Rocchio algorithm incorporates relevance feedback information into the vector space model. Want to maximize sim (Q, Cr) − sim (Q, Cnr) The optimal query vector for separating relevant and non-relevant documents (with cosine sim.): Qopt= optimal query; Cr = set of relevant doc vectors; N = collection size

Rocchio 1971 Algorithm (SMART)
qm = modified query vector; q0 = original query vector; , ,  : weights Dr = set of known relevant doc vectors; Dnr= set of known irrelevant doc vectors

Relevance feedback assumptions
Relevance prototypes are “well-behaved”. Term distribution in relevant documents will be similar Term distribution in non-relevant documents will be different from those in relevant documents: Either: All relevant documents are tightly clustered around a single prototype. Or: There are different prototypes, but they have significant vocabulary overlap. Similarities between relevant and irrelevant documents are small.

Rocchio Algorithm for text classification
Training time: construct a set of prototype vectors, one vector per class. Testing time: for a new document, find the most similar prototype vector.

Training time Cj: the set of positive examples for class j.
D: the set of positive and negative examples , : weights

Why this formula? Rocchio shows when ==1, each prototype vector maximizes How maximizing this formula connects to the classification accuracy?

Testing time Given a new document d

kNN vs. Rocchio kNN: Rocchio algorithm: Lazy learning: no training
Use all the training instances at testing time. Rocchio algorithm: At the training time, calculate prototype vectors. At the test time, instead of using all the training instances, use only prototype vectors. Linear classifier: not as expressive as kNN.

Summary of Rocchio Strengths: Weakness: Simplicity (conceptual)
Efficiency at training Efficiency at testing time Handling multi-class Weakness: Theoretical validity Stability and robustness Predication accuracy: it does not work well when the categories are not linearly separable.

Additional slides

Three major design choices
The weight of feature fk in document di: e.g., tf-idf The document length normalization The similarity measure: e.g., cosine

Extending Rocchio? Generalized instance set (GIS) algorithm (Lam and Ho, 1998). …

K nearest neighbor and Rocchio algorithm

Similar presentations

Presentation on theme: "K nearest neighbor and Rocchio algorithm"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

K nearest neighbor and Rocchio algorithm

Similar presentations

Presentation on theme: "K nearest neighbor and Rocchio algorithm"— Presentation transcript:

Similar presentations

About project

Feedback