Download presentation
1
Introduction to Information Retrieval
Slides by me, CIS 8590 – Fall 2008 NLP
2
The Inverted Index CIS 8590 – Fall 2008 NLP
3
Indexing Indexing is a technique borrowed from databases
An index is a data structure that supports efficient lookups in a large data set E.g., hash indexes, R-trees, B-trees, etc. CIS 8590 – Fall 2008 NLP
4
Document Retrieval In search engines, the lookups have to find all documents that contain query terms. What’s the problem with using a tree-based index? A hash index? CIS 8590 – Fall 2008 NLP
5
Inverted Index An inverted index stores an entry for every word, and a pointer to every document where that word is seen. Vocabulary Postings List Word1 Document17, Document 45123 . WordN Document991, Document123001 CIS 8590 – Fall 2008 NLP
6
Example Example: Vocabulary Postings List
Document D1: “yes we got no bananas” Document D2: “what you got” Document D3: “yes I like what you got” Query “you got”: “you” {D2, D3} “got” {D1, D2, D3} Whole query gives the intersection: “you got” {D2, D3} ^ {D1, D2, D3} = {D2, D3} Vocabulary Postings List Yes D1, D3 we D1 got D1, D2, D3 no D1 bananas D1 What D2, D3 You D2, D3 I D3 like D3 CIS 8590 – Fall 2008 NLP
7
Variations Record-level index stores just document identifiers in the postings list Word-level index stores document IDs and offsets for the position of the words in each document Supports phrased based searches (why?) Real search engines add all kinds of other information to their postings lists (see below). CIS 8590 – Fall 2008 NLP
8
Index Construction Algorithm: Scan through each document, word by word
Write term, docID pair for each word to TempIndex file 2. Sort TempIndex by terms 3. Iterate through sorted TempIndex: merge all entries for the same term into one postings list. CIS 8590 – Fall 2008 NLP
9
Efficient Index Construction
Problem: Indexes can be huge. How can we efficiently build them? Blocked Sort-based Construction (BSBI) Single-Pass In-Memory Indexing (SPIMI) What’s the difference? CIS 8590 – Fall 2008 NLP
10
Ranking Results CIS 8590 – Fall 2008 NLP
11
Problem: Too many matching results for every query
Using an inverted index is all fine and good, but if your document collection has 10^12 documents and someone searches for “banana”, they’ll get 90 million results. We need to be able to return the “most relevant” results. We need to rank the results. CIS 8590 – Fall 2008 NLP
12
Documents as Vectors Example: Document D1: “yes we got no bananas”
Document D2: “what you got” Document D3: “yes I like what you got” yes we got no bananas what you I like Vector V1: Vector V2: Vector V3: 1 1 1 CIS 8590 – Fall 2008 NLP
13
What about queries? In the vector space model, queries are treated as (very short) documents. Example query: “bananas” yes we got no bananas what you I like Query Q1: 1 CIS 8590 – Fall 2008 NLP
14
Measuring Similarity Similarity metric: the size of the angle between document vectors. “Cosine Similarity”: CIS 8590 – Fall 2008 NLP
15
Ranking documents yes we got no bananas what you I like Query Q1: 1
1 Vector V1: Vector V2: Vector V3: 1 1 1 CIS 8590 – Fall 2008 NLP
16
All words are equal? The TF-IDF measure is used to weight different words by more or less, depending on how informative they are. CIS 8590 – Fall 2008 NLP
17
Compare Document Classification and Document Retrieval/Ranking
Similarities: Differences: CIS 8590 – Fall 2008 NLP
18
Synonymy CIS 8590 – Fall 2008 NLP
19
Handling Synonymy in Retrieval
Problem: Straightforward search for a term may miss the most relevant results, because those documents use a synonym of the term. Examples: Search for “Burma” will miss documents containing only “Myanmar” Search for “document classification” will miss results for “text classification” Search for “scientists” will miss results for “physicists”, “chemists”, etc. NLP
20
Two approaches Convert retrieval into a classification or clustering problem Relevance Feedback (classification) Pseudo-relevance Feedback (clustering) Expand the query to include synonyms or other relevant terms Thesaurus-based Automatic query expansion NLP
21
Relevance Feedback Algorithm: User issues a query q
System returns initial results D1 User labels some results (relevant or not) System learns a classifier/ranker for relevance System returns new result set D2 NLP
22
Relevance Feedback as Text Classification
The system gets a set of labeled documents (+ = relevant, - = not relevant) This is exactly the input to a standard text classification problem Solution: convert labeled documents into vectors, then apply standard learning Rocchio, Naïve Bayes, k-NN, SVM, … NLP
23
Details In relevance feedback, there are few labeled examples
Efficiency is a concern user is waiting online during training and testing Output is ranking, not binary classification But most classifiers can be converted into rankers e.g., Naïve Bayes can rank according to the probability score, SVM can rank according to wTx + b CIS 8590 – Spring 2010 NLP
24
Pseudo Relevance Feedback
IDEA: instead of waiting for user to provide relevance judgements, just use top-K documents to represent + (relevant) class It’s a somewhat mind-bending thought, but this actually works in practice. Essentially, this is like one iteration of K-means clustering! NLP
25
Clickstream Mining (Aka, “Indirect relevance feedback”) IDEA: use the clicks that users make as proxies for relevance judgments For example, if the search engine returns 10 documents for “bananas”, and users consistently click on the third link first, then increase the rank of that document and similar ones. CIS 8590 – Spring 2010 NLP
26
Query Expansion IDEA: help users formulate “better” queries
“better” can mean More precise, to exclude more unrelated stuff More inclusive, to increase recall of documents that wouldn’t match a basic query CIS 8590 – Spring 2010 NLP
27
Query Term Suggestion Problem: Given a base query q, suggest a list of terms T={t1, …, tK} that could help the user refine the query. One common technique, is to suggest terms that frequently “co-occur” with terms already in the base query. CIS 8590 – Spring 2010 NLP
28
Co-occurrence Terms t1 and t2 “co-occur” if they occur near each other in the same document. There are many measures of co-occurrence, including: PMI, MI, LSI-based scores, and others CIS 8590 – Spring 2010 NLP
29
Computing Co-occurrence Example
d1 d2 d3 d4 t1 2 1 t2 t3 t4 At,d = CIS 8590 – Spring 2010 NLP
30
Computing Co-occurrence Example
Ct,t’ = ATA= t1 t2 t3 t4 d1 2 1 d2 d3 d4 d1 d2 d3 d4 t1 2 1 t2 t3 t4 * CIS 8590 – Spring 2010 NLP
31
Query Log Mining IDEA: use other people’s queries as suggestions for refinements of this query. Example: If I type “google” into the search bar, the search engine can suggest follow-up words that other people used, like: “maps”, “earth”, “translate”, “wave”, … CIS 8590 – Spring 2010 NLP
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.