Introduction to Information Retrieval

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Information Retrieval in Practice
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
Evaluating the Performance of IR Sytems
Chapter 19: Information Retrieval
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Information Retrieval
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement and Relevance Feedback.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
IR Theory: Relevance Feedback. Relevance Feedback: Example  Initial Results Search Engine2.
Incremental Indexing Dr. Susan Gauch. Indexing  Current indexing algorithms are essentially batch processing  They start from scratch every time  What.
Vector Space Models.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
CS315 Introduction to Information Retrieval Boolean Search 1.
Automated Information Retrieval
Information Retrieval in Practice
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Indexing & querying text
Semantic Processing with Context Analysis
Information Retrieval
Information Retrieval and Web Search
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Implementation Issues & IR Systems
Multimedia Information Retrieval
Basic Information Retrieval
6. Implementation of Vector-Space Retrieval
Chapter 5: Information Retrieval and Web Search
Chapter 31: Information Retrieval
Information Retrieval and Web Design
Chapter 19: Information Retrieval
Introduction to Search Engines
Presentation transcript:

Introduction to Information Retrieval Slides by me, CIS 8590 – Fall 2008 NLP

The Inverted Index CIS 8590 – Fall 2008 NLP

Indexing Indexing is a technique borrowed from databases An index is a data structure that supports efficient lookups in a large data set E.g., hash indexes, R-trees, B-trees, etc. CIS 8590 – Fall 2008 NLP

Document Retrieval In search engines, the lookups have to find all documents that contain query terms. What’s the problem with using a tree-based index? A hash index? CIS 8590 – Fall 2008 NLP

Inverted Index An inverted index stores an entry for every word, and a pointer to every document where that word is seen. Vocabulary Postings List Word1  Document17, Document 45123 . WordN  Document991, Document123001 CIS 8590 – Fall 2008 NLP

Example Example: Vocabulary Postings List Document D1: “yes we got no bananas” Document D2: “what you got” Document D3: “yes I like what you got” Query “you got”: “you”  {D2, D3} “got”  {D1, D2, D3} Whole query gives the intersection: “you got”  {D2, D3} ^ {D1, D2, D3} = {D2, D3} Vocabulary Postings List Yes  D1, D3 we  D1 got  D1, D2, D3 no  D1 bananas  D1 What  D2, D3 You  D2, D3 I  D3 like  D3 CIS 8590 – Fall 2008 NLP

Variations Record-level index stores just document identifiers in the postings list Word-level index stores document IDs and offsets for the position of the words in each document Supports phrased based searches (why?) Real search engines add all kinds of other information to their postings lists (see below). CIS 8590 – Fall 2008 NLP

Index Construction Algorithm: Scan through each document, word by word Write term, docID pair for each word to TempIndex file 2. Sort TempIndex by terms 3. Iterate through sorted TempIndex: merge all entries for the same term into one postings list. CIS 8590 – Fall 2008 NLP

Efficient Index Construction Problem: Indexes can be huge. How can we efficiently build them? Blocked Sort-based Construction (BSBI) Single-Pass In-Memory Indexing (SPIMI) What’s the difference? CIS 8590 – Fall 2008 NLP

Ranking Results CIS 8590 – Fall 2008 NLP

Problem: Too many matching results for every query Using an inverted index is all fine and good, but if your document collection has 10^12 documents and someone searches for “banana”, they’ll get 90 million results. We need to be able to return the “most relevant” results. We need to rank the results. CIS 8590 – Fall 2008 NLP

Documents as Vectors Example: Document D1: “yes we got no bananas” Document D2: “what you got” Document D3: “yes I like what you got” yes we got no bananas what you I like Vector V1: Vector V2: Vector V3: 1 1 1 CIS 8590 – Fall 2008 NLP

What about queries? In the vector space model, queries are treated as (very short) documents. Example query: “bananas” yes we got no bananas what you I like Query Q1: 1 CIS 8590 – Fall 2008 NLP

Measuring Similarity Similarity metric: the size of the angle between document vectors. “Cosine Similarity”: CIS 8590 – Fall 2008 NLP

Ranking documents yes we got no bananas what you I like Query Q1: 1 1 Vector V1: Vector V2: Vector V3: 1 1 1 CIS 8590 – Fall 2008 NLP

All words are equal? The TF-IDF measure is used to weight different words by more or less, depending on how informative they are. CIS 8590 – Fall 2008 NLP

Compare Document Classification and Document Retrieval/Ranking Similarities: Differences: CIS 8590 – Fall 2008 NLP

Synonymy CIS 8590 – Fall 2008 NLP

Handling Synonymy in Retrieval Problem: Straightforward search for a term may miss the most relevant results, because those documents use a synonym of the term. Examples: Search for “Burma” will miss documents containing only “Myanmar” Search for “document classification” will miss results for “text classification” Search for “scientists” will miss results for “physicists”, “chemists”, etc. NLP

Two approaches Convert retrieval into a classification or clustering problem Relevance Feedback (classification) Pseudo-relevance Feedback (clustering) Expand the query to include synonyms or other relevant terms Thesaurus-based Automatic query expansion NLP

Relevance Feedback Algorithm: User issues a query q System returns initial results D1 User labels some results (relevant or not) System learns a classifier/ranker for relevance System returns new result set D2 NLP

Relevance Feedback as Text Classification The system gets a set of labeled documents (+ = relevant, - = not relevant) This is exactly the input to a standard text classification problem Solution: convert labeled documents into vectors, then apply standard learning Rocchio, Naïve Bayes, k-NN, SVM, … NLP

Details In relevance feedback, there are few labeled examples Efficiency is a concern user is waiting online during training and testing Output is ranking, not binary classification But most classifiers can be converted into rankers e.g., Naïve Bayes can rank according to the probability score, SVM can rank according to wTx + b CIS 8590 – Spring 2010 NLP

Pseudo Relevance Feedback IDEA: instead of waiting for user to provide relevance judgements, just use top-K documents to represent + (relevant) class It’s a somewhat mind-bending thought, but this actually works in practice. Essentially, this is like one iteration of K-means clustering! NLP

Clickstream Mining (Aka, “Indirect relevance feedback”) IDEA: use the clicks that users make as proxies for relevance judgments For example, if the search engine returns 10 documents for “bananas”, and users consistently click on the third link first, then increase the rank of that document and similar ones. CIS 8590 – Spring 2010 NLP

Query Expansion IDEA: help users formulate “better” queries “better” can mean More precise, to exclude more unrelated stuff More inclusive, to increase recall of documents that wouldn’t match a basic query CIS 8590 – Spring 2010 NLP

Query Term Suggestion Problem: Given a base query q, suggest a list of terms T={t1, …, tK} that could help the user refine the query. One common technique, is to suggest terms that frequently “co-occur” with terms already in the base query. CIS 8590 – Spring 2010 NLP

Co-occurrence Terms t1 and t2 “co-occur” if they occur near each other in the same document. There are many measures of co-occurrence, including: PMI, MI, LSI-based scores, and others CIS 8590 – Spring 2010 NLP

Computing Co-occurrence Example d1 d2 d3 d4 t1 2 1 t2 t3 t4 At,d = CIS 8590 – Spring 2010 NLP

Computing Co-occurrence Example Ct,t’ = ATA= t1 t2 t3 t4 d1 2 1 d2 d3 d4 d1 d2 d3 d4 t1 2 1 t2 t3 t4 * CIS 8590 – Spring 2010 NLP

Query Log Mining IDEA: use other people’s queries as suggestions for refinements of this query. Example: If I type “google” into the search bar, the search engine can suggest follow-up words that other people used, like: “maps”, “earth”, “translate”, “wave”, … CIS 8590 – Spring 2010 NLP