Information Retrieval

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Text Categorization.
Chapter 5: Introduction to Information Retrieval
Information Retrieval Lecture 7 Introduction to Information Retrieval (Manning et al. 2007) Chapter 17 For the MSc Computer Science Programme Dell Zhang.
Introduction to Information Retrieval
1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.
Hinrich Schütze and Christina Lioma
Text Categorization CSC 575 Intelligent Information Retrieval.
1 Machine Learning: Lecture 7 Instance-Based Learning (IBL) (Based on Chapter 8 of Mitchell T.., Machine Learning, 1997)
Lazy vs. Eager Learning Lazy vs. eager learning
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
Classification and Decision Boundaries
K nearest neighbor and Rocchio algorithm
Chapter 7: Text mining UIC - CS 594 Bing Liu 1 1.
Chapter 2: Pattern Recognition
These slides are based on Tom Mitchell’s book “Machine Learning” Lazy learning vs. eager learning Processing is delayed until a new instance must be classified.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Recommender systems Ram Akella November 26 th 2008.
Nearest Neighbor Classifiers other names: –instance-based learning –case-based learning (CBL) –non-parametric learning –model-free learning.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Advanced Multimedia Text Retrieval/Classification Tamara Berg.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Advanced Multimedia Text Classification Tamara Berg.
1 Probabilistic Language-Model Based Document Retrieval.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Vector Space Text Classification
PrasadL11Classify1 Vector Space Classification Adapted from Lectures by Raymond Mooney and Barbara Rosario.
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang.
PrasadL14VectorClassify1 Vector Space Text Classification Adapted from Lectures by Raymond Mooney and Barbara Rosario.
Vector Space Classification (modified from Stanford CS276 slides on Lecture 11: Text Classification; Vector space classification [Borrows slides from Ray.
1 CS276 Information Retrieval and Web Search Lecture 13: Classifiers: kNN, Rocchio, etc. [Borrows slides from Ray Mooney and Barbara Rosario]
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Nearest Neighbor Classifier 1.K-NN Classifier 2.Multi-Class Classification.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Information Retrieval and Organisation Chapter 14 Vector Space Classification Dell Zhang Birkbeck, University of London.
CS Machine Learning Instance Based Learning (Adapted from various sources)
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 12 10/4/2011.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
KNN & Naïve Bayes Hongning Wang
بسم الله الرحمن الرحيم وقل ربي زدني علما.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Instance Based Learning
Information Retrieval Christopher Manning and Prabhakar Raghavan
K Nearest Neighbors and Instance-based methods
Intelligent Information Retrieval
Instance Based Learning (Adapted from various sources)
K Nearest Neighbor Classification
Nearest-Neighbor Classifiers
Text Categorization Rong Jin.
Text Categorization Assigning documents to a fixed set of categories
Instance Based Learning
COSC 4335: Other Classification Techniques
Nearest Neighbor Classifiers
Nearest Neighbors CSC 576: Data Mining.
Data Mining Classification: Alternative Techniques
Nearest Neighbor Classifiers
Information Retrieval
Naïve Bayes Text Classification
SVMs for Document Ranking
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Information Retrieval For the MSc Computer Science Programme Lecture 5 Introduction to Information Retrieval (Manning et al. 2007) Chapter 14 Dell Zhang Birkbeck, University of London

Recall: Vector Space Model Docs  Vectors Each doc can now be viewed as a vector of TFxIDF values, one component for each term. So we have a vector space Terms are axes Docs live in this space May have 20,000+ dimensions even with stemming

k Nearest Neighbors (kNN) Given a test doc d and the training data identify the set Sk of the k nearest neighbors of d, i.e., the k training docs most similar to d. for each class cjC compute N(Sk,cj) the number of Sk members that belong to class cj estimate Pr[cj|d] as N(Sk,cj)/k classify d to the majority class of Sk memebers.

kNN – Example 5NN (k=5) c( ) = ? Government Science Arts

kNN – Example Decision Boundaries Government Science Arts

kNN – Online Demo http://www.comp.lancs.ac.uk/~kristof/research/notes/nearb/cluster.html

Similarity Metric kNN depends on a similarity/distance metric For text, cosine similarity of TFxIDF weighted vectors is usually most effective. 3NN c(d) = ?

Parameter k k = 1 Using only the nearest neighbor to determine classification is often error-prone due to: atypical training documents noise (i.e. error) in the class labels

Parameter k k > 1 k = N More robust with a larger k The value of k is typically odd to avoid ties 3 and 5 are most common k = N N: the total number of training docs Every test doc would be classified into the largest class in spite of its content. Degenerate to classification using priori probabilities Pr[cj]

Efficiency kNN with Inverted Index Naively finding the kNN of a test doc d requires a linear search through all training docs. But this is actually same as finding the top k retrieval results using d as a (long) query to the database of training docs. Therefore the standard inverted index method for vector space retrieval could be used to accelerate this process.

Discussion Lazy Learning Asymptotically close to optimal Don’t need to learn until a test doc needs to be classified. Asymptotically close to optimal More training documents usually lead to better accuracy, though lower speed. Scales well with the number of classes

Take Home Messages kNN Classification