COMP791A: Statistical Language Processing

COMP791A: Statistical Language Processing
Information Retrieval [M&S] [J&M] 17.3

The problem The standard information retrieval (IR) scenario
The user has an information need The user types a query that describes the information need The IR system retrieves a set of documents from a document collection that it believes to be relevant The documents are ranked according to their likelihood of being relevant input: a (large) set/collection of documents a user query output: a (ranked) list of relevant documents

Example of IR

IR within NLP IR needs to process the large volumes of online text
And (traditionally), NLP methods were not robust enough to work on thousands of real world texts. so IR: not based on NLP tools (ex. syntactic/semantic analysis) uses (mostly) simple (shallow) techniques based mostly on word frequencies in IR, meaning of documents: is the composition of meaning of individual words ordering & constituency of words play are not taken into account bag of word approach I see what I eat. I eat what I see. same meaning

2 major topics Indexing Retrieval methods
representing the document collection using words/terms for fast access to documents Retrieval methods matching a user query to indexed documents 3 major models: boolean model vector-space model probabilistic model

Indexing Most IR systems use an inverted file to represent the texts in the collection Inverted file = a table of terms with a list of texts that contain these terms assassination {d1, d4, d95, d5, d90…} murder {d3, d7, d95…} Kennedy {d24, d7, d44…} conspiracy {d3, d55, d90, d98…}

Example of an inverted file
For each term: DocCnt: how many documents the term occurs in (used to compute IDF) FreqCnt: how many times the term occurs in all documents For each document: Freq: how many times the term occurs in this doc WordPosition: the offsets where these occurrences are found in the document useful to search for terms within n words of each other to approximate phrases (ex. “car insurance”) but… primitive notion of phrases… just word/byte position in document “car insurance”  “ insurance for car” to generate word-in-context snippets to highlight terms in the retrieved document …

Basic Concept of a Retrieval Model
documents and queries are represented by vectors of pairs <term-value> term: all possible terms that occur in the query/document value: presence or absence of term in query/document value can be binary (0, if term is absent ; 1, if term is present) some weigh (term frequency, tf.idf, or other)

Vector-Space Model binary values do not tell if a term is more important than others so we should weight the terms by importance weight of terms (for document & query) can be their raw frequency or other measure

Term-by-document matrix
the collection of documents is represented by a matrix of weights called a term-by-document matrix 1 column = representation of one document 1 row = representation of 1 term across all documents cell wij = weight of term i in document j note: the matrix is sparse !!!

An example The collection: The query:
d1 = {introduction knowledge in speech and language processing ambiguity models and algorithms language thought and understanding the state of the art and the near-term future some brief history summary} d2 = {hmms and speech recognition speech recognition architecture overview of the hidden markov models the viterbi algorithm revisited advanced methods in decoding acoustic processing of speech computing acoustic probabilities training a speech recognizer waveform generation for speech synthesis human speech recognition summary} d3 = {language and complexity the chomsky hierarchy how to tell if a language isn’t regular the pumping lemma are English and other languages regular languages ? is natural language context-free complexity and human processing summary} The query: Q = {speech language processing}

An example (con’t) The collection: The query:
d1 = {introduction knowledge in speech and language processing ambiguity models and algorithms language thought and understanding the state of the art and the near-term future some brief history summary} d2 = {hmms and speech recognition speech recognition architecture overview of the hidden markov models the viterbi algorithm revisited advanced methods in decoding acoustic processing of speech computing acoustic probabilities training a speech recognizer waveform generation for speech synthesis human speech recognition summary} d3 = {language and complexity the chomsky hierarchy how to tell if a language isn’t regular the pumping lemma are English and other language regular language ? is natural language context-free complexity and human processing summary} The query: Q = {speech language processing}

An example (con’t) using raw term frequencies
vectors for the documents and the query can be seen as a point in a multi-dimensional space where each dimension is a term from the query d2 (6,0,1) Term 1 (speech) d1 (1,2,1) Term 2 (language) q (1,1,1) d3 (0,5,1) Term 3 (processing)

Document similarity The longer the document, the more chances it will be retrieved: it makes sense, because it may contain many of the query's terms but then again, it may also contain lots of non-pertinent terms… we want to consider: vector (1, 2, 1)  vector (2, 4, 2) (same distribution of words) we can normalise raw term frequencies to convert all vectors to a standard length (ex. 1)

Example Query = speech language original representation:
d2 (6, 0) language d1 (1, 2) q (1, 1) speech d3 (0, 5) Normalization: - length of vector does not matter, - angle does.

The cosine measure similarity between two documents (or doc & query) is actually the cosine of the angle (in N-dimensions) between the 2 vectors if 2 document-vectors are identical, they will have a cosine of 1 if 2 document-vectors are orthogonal (i.e. share no common term), they will have a cosine of 0 (D) Document (D) Document (D) Document (Q) Query (Q) Query (Q) Query

The cosine measure (con’t)
The cosine of 2 vectors (in N dimensions) also known as the normalized inner product inner product lengths of the vectors

If you want proof… in 2-D space
can be skipped to have vectors of length 1 (normalized vectors) divide all its components by the length of the vector in 2 dimensional space:

Normalized vectors Query = speech language language d1’(0.45, 0.89)
can be skipped Query = speech language language 1 d1’(0.45, 0.89) d2’(1, 0) Q’(0.71, 0.71) speech d3’(0, 1) 1 Q(1,1) > normalized Q’ (0.71, 0.71) d1(1,2) > normalized d1’ (0.45, 0.89) d2(6,0) > normalized d2’ (1, 0) d3(0,5) > normalized d3’ (0, 1)

Similarity between 2 vectors (2-D)
can be skipped In 2-D (ie. N= 2; nb of terms = 2) with the original vectors: Q = (Xq, Yq) D = (Xd, Yd) with the normalized vectors:

Similarity in the general case (N-D)
can be skipped in the general case of N-dimensions (N-terms) which is the cosine of the angle between the vector D and vector Q in N-dimensions but for normalized vectors

The example again Q = {speech language processing} query (1,1,1)
d1 (1,2,1) d2 (6,0,1) d3 (0,5,1)

Term weights tfij term frequency dfi document frequency
so far, we have used term frequency as the weights core of most weighting functions: tfij term frequency frequency of a term i in document j if a term appears often in a document, then it describes well the document contents intra-document characterization dfi document frequency number of documents in the collection containing the term i if a term appears in many documents, then it is not useful for distinguishing a document inter-document characterization used to compute idf

tf.idf weighting functions
most widely used family of weighting functions let: M = number of documents in the collection Inverse Document Frequency for term i // measures weight // of term i for the // query intuitively, if M = 1000 if dfi = > log(1) = > term i is ignored ! (it appears in all docs) if dfi = > log(100) = > term i has weight of 2 in the query if dfi = > log(1000) = > term i has weight of 3 in the query weight of term i in document d is: wid = tfid x idfi family of tf.idf functions frequency of most frequent term j in document d

Evaluation: Precision & Recall
Recall and precision measure how good a set of retrieved documents is compared with an ideal set of relevant documents Recall: What proportion of relevant documents are actually retrieved? Precision: What proportion of retrieved documents are really relevant? Pertinent docs that were retrieved Pertinent docs that were retrieved A A+C Recall= A+B A Precision = All pertinent docs (that should have been retrieved) All docs that were retrieved

Evaluation: Example of P&R
Relevant: d3 d5 d9 d25 d39 d44 d56 d71 d123 d389 system1: d123 d84 d56 Precision : ?? Recall : ?? system2: d123 d84 d56 d6 d8 d9

Evaluation: Example of P&R
Relevant: d3 d5 d9 d25 d39 d44 d56 d71 d123 d389 system1: d123 d84  d56 Precision: 66% (2/3) Recall: 20% (2/10) system2: d123 d84 d56 d6 d8 d9 Precision: 50% (3/6) Recall: % (3/10)

Evaluation: Problems with P&R
P&R do not evaluate the ranking d123 d84  d84 d123 so other measures are often used: Document cutoff levels P&R curves ...

Evaluation: Document cutoff levels
fix the number of documents retrieved at several levels ex. top 5, top 10, top 20, top 100, top 500… measure precision at each of these levels Ex:

Evaluation: P&R curve measure precision at different levels of recall
usually, precision at 11 recall levels (0%, 10%, 20%, …, 100%) 100% 80% precision 60% 40% 20% 0% recall 0% 20% 40% 60% 80% 100%

Which system performs better?
100% 80% precision 60% 40% 20% 0% recall 0% 20% 40% 60% 80% 100%

Evaluation: A Single Value Measure
cannot take mean of P&R if R = 50% P = 50% M = 50% if R = 100% P = 10% M = 55% (not fair) take harmonic mean HM is high only when both P&R are high if R = 50% and P = 50% HM = 50% if R = 100% and P = 10% HM = 18.2% take weighted harmonic mean wr: weight of R wp: weight of P a = 1/wr b= 1/wp let β2 = a/b … which is called the F-measure

Evaluation: the F-measure
A weighted combination of precision and recall  represents the relative importance of precision and recall when  = 1, precision & recall have same importance when  > 1, precision is favored when  < 1, recall is favored

Evaluation: How to evaluate
Need a test collection document collection (few thousand - few million documents) set of queries set of relevance judgements humans must check all documents ??? use pooling (TREC) take top 100 from every submission/system remove duplicates manually assess these only

Evaluation: TREC Text Retrieval Conference/Competition
run by NIST (National Institute of Standards and Technology) 13th edition in 2004 Collection: about 3 Gigabytes > 1 million documents newswire & text news (AP, WSJ, …) Queries + relevance judgments queries devised and judged by annotators Participants various research and commercial groups compete Tracks cross-lingual, Filtering Track, Genome Track video-track, Web Track, QA, ...

COMP791A: Statistical Language Processing

Similar presentations

Presentation on theme: "COMP791A: Statistical Language Processing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COMP791A: Statistical Language Processing

Similar presentations

Presentation on theme: "COMP791A: Statistical Language Processing"— Presentation transcript:

Similar presentations

About project

Feedback