Chapter 23: Probabilistic Language Models April 13, 2004.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Chapter 5: Introduction to Information Retrieval
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Text Similarity David Kauchak CS457 Fall 2011.
Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,
1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.
Information Retrieval in Practice
Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR
Chapter 7 Retrieval Models.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Information Retrieval
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
Language Modeling Approaches for Information Retrieval Rong Jin.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
Albert Gatt Corpora and Statistical Methods Lecture 9.
Advanced Multimedia Text Classification Tamara Berg.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Text Clustering.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Survey of Approaches to Information Retrieval of Speech Message Kenney Ng Spoken Language Systems Group Laboratory for Computer Science Massachusetts Institute.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems Saeedeh Momtazi, Dietrich Klakow University of Saarland,Germany.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.
CpSc 881: Information Retrieval. 2 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Language Model for Machine Translation Jang, HaYoung.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
Lecture 13: Language Models for IR
CSCI 5417 Information Retrieval Systems Jim Martin
Multimedia Information Retrieval
Basic Information Retrieval
موضوع پروژه : بازیابی اطلاعات Information Retrieval
INF 141: Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

Chapter 23: Probabilistic Language Models April 13, 2004

Corpus-Based Learning Information Retrieval Information Extraction Machine Translation

23.1 Probabilistic Language Models There are several advantages –Can be trained from data –Robust (accept any sentence) –Reflect fact that not all speakers agree on which sentences are part of a language –Can be used for disambiguation

Unigram Model  P(w i ) Bigram Model  P(w i | w i-1 ) Trigram Model  P(w i | w i-2, w i-1 )

Smoothing Problem: many pairs (triples, etc.) of words never occur in the training text. N: words in corpus B: possible bigrams c: actual count of bigram Add-One Smoothing (c + 1) / (N + B)

Smoothing Linear Interpolation Smoothing P(w i | w i-2, w i-1 ) = c 3 P(w i | w i-2, w i-1 ) + c 2 P(w i | w i-1 ) + c 1 P(w i ) c 1 + c 2 + c 3 = 1

Segmentation The task is to find the word boundaries in a text with no spaces P(“with”) =.2 P(“out”) =.1 P(“with out”) =.02 (unigram model) P(“without”) =.05 Figure 23.1, Viterbi-based segmentation algorithm

Probabilistic CFG (PCFG) N-Gram models have no notion of grammar at distances greater than n Figure 23.2, PCFG example Figure 23.3, PCFG parse Problem: context-free Problem: preference for short sentences

Learning PCFG Probabilities Parsed Data: straight forward Unparsed Data: two challenges –Learning the structure of the grammar rules. A Chomsky Normal Form bias can be used (X  Y Z, X  t). Something similar to SEQUITUR can be used. –Learning the probabilities associated with each rule (inside-outside algorithm, based on dynamic programming)

23.2 Information Retrieval Components of IR System: –Document Collection –Query Posed in Query Language –Result Set –Presentation of Result Set

Boolean Keyword Model Boolean queries Each word in a document is treated as a boolean feature Drawbacks –Each word is a single bit of relevance –Boolean logic can be difficult to use correctly for the average user

General Framework r: Boolean random variable indicating relevance that has the value true D: Document Q: Query P( r | D, Q) Order results by decreasing probability

Language Modeling P(r | D, Q) = P(D, Q | r) * P(r) / P(D, Q) Baye’s = P(Q | D, r) * P(D | r) * P(r) / P(D, Q) chain rule = P(Q | D, r) *  * P(r | D) * P(r) / P(D, Q) Baye’s rule, fixed D maximize P(r | D, Q) / P(  r | D, Q)

Language Modeling = P(Q | D, r) * P(r | D) / P(Q | D,  r) * P(  r | D) Eliminate P(Q | D,  r). If a document is irrelevant to a query, then knowing the document won’t help determine the query. = P(Q | D, r) * P(r | D) / P(  r | D)

Language Modeling P(r | D) / P(  r | D) is a query independent measure of document quality. This can be estimated by references to the document, the recency of the document, etc. P(Q | D, r) =  j P(Q j | D, r) where each Q j is a words in the query. Figure 23.4.

Evaluating IR Systems Precision. Proportion of documents in result set that are actually relevant. Recall. Proportion of relevant documents in the collection that are in the result set. Average Reciprocal Rank. Time to Answer. Length of time for user to find desired answer

IR Refinements Stemming. Can help recall, can hurt precision. Case Folding. Synonyms. Use a bigram model. Spelling Corrections. Metadata.

Result Sets Relevance feedback from user. Document classification. Document clustering. –K-Means clustering 1. Pick k documents at random as category seeds 2. Assign every document to the closest category 3. Computer the mean of each cluster and uses these means as the new seeds. 4. Go to step 2 until convergence occurs.

Implementing IR Systems Lexicon. Given a word, return the location in the inverted index. Stop words are often omitted. Inverted Index. Might be a list of (document, count) pairs.

Vector Space Model Used more often in practice than the probabilistic model Documents are represented as vectors of unigram word frequencies. A query is represented as a vector consisting of 0s and 1s, e.g. [ ].