CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Introduction to Information Retrieval
Basic IR: Modeling Basic IR Task: Slightly more complex:
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
TF-IDF David Kauchak cs160 Fall 2009 adapted from:
Learning for Text Categorization
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Modeling Modern Information Retrieval
CS276 Information Retrieval and Web Mining
Hinrich Schütze and Christina Lioma
Evaluating the Performance of IR Sytems
Vector Space Model CS 652 Information Extraction and Integration.
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Vector Space Model : TF - IDF
Term weighting and vector representation of text Lecture 3.
Chapter 5: Information Retrieval and Web Search
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002.
Documents as vectors Each doc j can be viewed as a vector of tf.idf values, one component for each term So we have a vector space terms are axes docs live.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Boolean and Vector Space Models
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Scoring, Term Weighting, and Vector Space Model Lecture 7: Scoring, Term Weighting and the Vector Space Model Web Search and Mining 1.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Basic ranking Models Boolean and Vector Space Models.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Information Retrieval and Web Search IR models: Vectorial Model Instructor: Rada Mihalcea Class web page: [Note: Some.
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
Introduction to Information Retrieval Introduction to Information Retrieval COMP4210: Information Retrieval and Search Engines Lecture 5: Scoring, Term.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Chapter 23: Probabilistic Language Models April 13, 2004.
Vector Space Models.
Lecture 6: Scoring, Term Weighting and the Vector Space Model
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.
Natural Language Processing Topics in Information Retrieval August, 2002.
Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 9: Scoring, Term Weighting and the Vector Space Model.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
CS315 Introduction to Information Retrieval Boolean Search 1.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
IR 6 Scoring, term weighting and the vector space model.
אחזור מידע, מנועי חיפוש וספריות
Basic Information Retrieval
Representation of documents and queries
موضوع پروژه : بازیابی اطلاعات Information Retrieval
CS 430: Information Discovery
Chapter 5: Information Retrieval and Web Search
CS246: Information Retrieval
Information Retrieval and Web Design
CS276: Information Retrieval and Web Search
Presentation transcript:

CS246 Basic Information Retrieval

Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space model  Document-term matrix  TF-IDF vector and cosine similarity  Phrase queries  Spell correction

Information-Retrieval System  Information source: Existing text documents  Keyword-based/natural-language query  The system returns best-matching documents given the query  Challenge  Both queries and data are “fuzzy”  Unstructured text and “natural language” query  What documents are good matches for a query?  Computers do not “understand” the documents or the queries  Developing a computerizable “model” is essential to implement this approach

Bag of Words: Major Simplification  Consider each document as a “bag of words”  “bag” vs “set”  Ignore word ordering, but keep word count  Consider queries as bag of words as well  Great oversimplification, but works adequately in many cases  “John loves only Jane” vs “Only John loves Jane”  The limitation still shows up on current search engines  Still how do we match documents and queries?

Boolean Model  Return all documents that contain the words in the query  Simplest model for information retrieval  No notion of “ranking”  A document is either a match or non-match  Q: How to find and return matching documents?  Basic algorithm?  Useful data structure?

Inverted Index  Allows quick lookup of document ids with a particular word  Q: How can we use this to answer “UCLA Physics”? lexicon/dictionary DIC Stanford UCLA MIT … PL(Stanford) PL(UCLA) Postings list PL(MIT)

Inverted Index  Allows quick lookup of document ids with a particular word lexicon/dictionary DIC Stanford UCLA MIT … PL(Stanford) PL(UCLA) Postings list PL(MIT)

Size of Inverted Index (1)  100M docs, 10KB/doc, 1000 unique words/doc, 10B/word, 4B/docid  Q: Document collection size?  Q: Inverted index size?  Heap’s Law: Vocabulary size = k n b with 30 < k < 100 and 0.4 < b < 1  k = 50 and b = 0.5 are good rule of thumb

Size of Inverted Index (2)  Q: Between dictionary and postings lists, which one is larger?  Q: Lengths of postings lists?  Zipf’s law: collection term frequency  1/frequency rank  Q: How do we construct an inverted index?

Inverted Index Construction C: set of all documents (corpus) DIC: dictionary of inverted index PL( w ): postings list of word w 1:For each document d  C : 2:Extract all words in content( d ) into W 3: For each w  W : 4:If w  DIC, then add w to DIC 5:Append id( d ) to PL( w ) Q: What if the index is larger than main memory?

Inverted-Index Construction  For large text corpus  Block-sorted based construction  Partition and merge

Evaluation: Precision and Recall  Q: Are all matching documents what users want?  Basic idea: a model is good if it returns document if and only if it is “relevant”.  R: set of “relevant” document D: set of documents returned by a model

Vector-Space Model  Main problem of Boolean model  Too many matching documents when the corpus is large  Any way to “rank” documents?  Matrix interpretation of Boolean model  Document – Term matrix  Boolean 0 or 1 value for each entry  Basic idea  Assign real-valued weight to the matrix entries depending on the importance of the term  “the” vs “UCLA”  Q: How should we assign the weights?

TF-IDF Vector  A term t is important for document d  If t appears many times in d or  If t is a “rare” term  TF: term frequency  # occurrence of t in d  IDF: inverse document frequency  # documents containing t  TF-IDF weighting  TF X Log(N/IDF)  Q: How to use it to compute query-document relevance?

Cosine Similarity  Represent both query and document as a TF-IDF vector  Take the inner product of the two normalized vectors to compute their similarity  Note: |Q| does not matter for document ranking. Division by |D| penalizes longer document.

Cosine Similarity: Example  idf(UCLA)=10, idf(good)=0.1, idf(university) = idf(car) = idf(racing) = 1  Q = (UCLA, university), D = (car, racing)  Q = (UCLA, university), D = (UCLA, good)  Q = (UCLA, university), D = (university, good)

Finding High Cosine-Similarity Documents  Q: Under vector-space model, does precision/recall make sense?  Q: How to find the documents with highest cosine similarity from corpus?  Q: Any way to avoid complete scan of corpus?

Inverted Index for TF-IDF  Q · d i = 0 if d i has no query words  Consider only the documents with query words  Inverted Index: Word  Document 18 Word IDF Stanford UCLA MIT … 1/3530 1/9860 1/937 docid TF D1 D14 D (TF may be normalized by document size) Posting list Lexicon

Phrase Queries  “Havard University Boston” exactly as a phrase  Q: How can we support this query?  Two approaches  Biword index  Positional index  Q: Pros and cons of each approach?  Rule of thumb: x2 – x4 size increase for positional index compared to docid only

Spell correction  Q: What is the user’s intention for the query “Britnie Spears”? How can we find the correct spelling?  Given a user-typed word w, find its correct spelling c.  Probabilistic approach: Find c with the highest probability P(c|w).  Q: How to estimate it?  Bayes’ rule: P(c|w) = P(w|c)P(c)/P(w)  Q: What are these probabilities and how can we estimate them?  Rule of thumb: 75% misspells are within edit distance 1. 98% are within edit distance 2.

Summary  Boolean model  Vector-space model  TF-IDF weight, cosine similarity  Inverted index  Boolean model  TF-IDF model  Phrase queries  Spell correction