INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Lecture # 43 Classification

ACKNOWLEDGEMENTS The presentation of this lecture has been taken from the following sources “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D. Manning, and Hinrich Schütze “Managing gigabytes” by Ian H. Witten, ‎Alistair Moffat, ‎Timothy C. Bell “Modern information retrieval” by Baeza-Yates Ricardo, ‎ “Web Information Retrieval” by Stefano Ceri, ‎Alessandro Bozzon, ‎Marco Brambilla

Outline Rochio classification K Nearest neighbors
Nearest-Neighbor Learning kNN decision boundaries Bias vs. variance

Rocchio classification
00:02:50  00:02:55

Classification Using Vector Spaces
In vector space classification, training set corresponds to a labeled set of points (equivalently, vectors) Premise 1: Documents in the same class form a contiguous region of space Premise 2: Documents from different classes don’t overlap (much) Learning a classifier: build surfaces to delineate classes in the space 00:03:05  00:03:15 00:03:40  00:04:25

Documents in a Vector Space
Government 00:05:05  00:05:45 Science Arts

Test Document of what class?
Government 00:05:45  00:06:15 Science Arts

Test Document = Government
Is this similarity hypothesis true in general? Government 00:06:20  00:07:15 00:07:30  00:07:45 00:08:10  00:09:05 Science Arts Our focus: how to find good separators Decision Boundaries

Rocchio Algorithm Relevance feedback methods can be adapted for text categorization As noted before, relevance feedback can be viewed as 2-class classification Relevant vs. non-relevant documents Use standard tf-idf weighted vectors to represent text documents For training documents in each category, compute a prototype vector by summing the vectors of the training documents in the category. Prototype : centroid of members of class Assign test documents to the category with the closest prototype vector based on cosine similarity 00:09:42  00:10:25 00:11:25  00:12:00 00:12:40  00:13:20 00:13:25  00:13:50 00:15:00  00:15:15 (relevance) 00:16:25  00:17:00 (use standard & for training)

Rocchio Algorithm There may be many more red vectors along y-axis, and they will drift the RED centroid towards y-axis. Now an awkward Red documents may be near the blue centroid. 00:17:35  00:18:50

Little used outside text classification It has been used quite effectively for text classification But in general worse than Naïve Bayes Again, cheap to train and test documents 00:20:30  00:21:33 00:22:45  00:23:05 (little & again)

Rocchio forms a simple representative for each class: the centroid/prototype Classification: nearest prototype/centroid It does not guarantee that classifications are consistent with the given training data 00:24:20  00:24:30 (rocchio) 00:24:50  00:25:05 (classification) 00:26:10  00:26:35 (it does not)

K Nearest neighbors

k Nearest Neighbor Classification
kNN = k Nearest Neighbor To classify a document d: Define k-neighborhood as the k nearest neighbors of d Pick the majority class label in the k-neighborhood 00:27:40  00:28:10

Example: k=6 (6NN) P(science| )? Government Science Arts
00:31:10  00:32:47 Science Arts

Nearest-Neighbor Learning
Learning: just store the labeled training examples D Testing instance x (under 1NN): Compute similarity between x and all examples in D. Assign x the category of the most similar example in D. Does not compute anything beyond storing the examples Also called: Case-based learning (remembering every single example of each class) Memory-based learning (memorizing every instance of training set) Lazy learning Rationale of kNN: contiguity hypothesis (docs which are near to a given input doc will decide its class) 00:33:20  00:33:40 (learning & testing)

Nearest-Neighbor 00:34:14  00:35:30

k Nearest Neighbor Using only the closest example (1NN) subject to errors due to: A single atypical example. Noise (i.e., an error) in the category label of a single training example. More robust: find the k examples and return the majority category of these k k is typically odd to avoid ties; 3 and 5 are most common Assign weight (relevance) of neighbors to decide. 00:37:45  00:38:40

kNN decision boundaries
Boundaries are in principle arbitrary surfaces – but usually polyhedra Government 00:39:40  00:40:20 00:41:50  00:42:30 00:42:41  00:43:15 00:44:00  00:44:10 Science Arts kNN gives locally defined decision boundaries between classes – far away points do not influence each classification decision (unlike in Naïve Bayes, Rocchio, etc.)

3 Nearest Neighbor vs. Rocchio
Nearest Neighbor tends to handle polymorphic categories better than Rocchio/NB. 00:44:19  00:45:25 00:45:32  00:47:50

kNN: Discussion No feature selection necessary No training necessary
Scales well with large number of classes Don’t need to train n classifiers for n classes May be expensive at test time In most cases it’s more accurate than NB or Rocchio 00:47:55  00:49:10 (layering) 00:50:20  00:50:30 (in most cases)

Evaluating Categorization
Evaluation must be done on test data that are independent of the training data Sometimes use cross-validation (averaging results over multiple training and test splits of the overall data) Easy to get good performance on a test set that was available to the learner during training (e.g., just memorize the test set) 00:51:40  00:52:00 (evaluation) 00:53:44  00:54:00 (easy to get)

Bias vs. variance: Choosing the correct model capacity
00:55:00  00:57:00

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Similar presentations

Presentation on theme: "INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Similar presentations

Presentation on theme: "INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID"— Presentation transcript:

Similar presentations

About project

Feedback