Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Retrieval

Similar presentations


Presentation on theme: "Information Retrieval"— Presentation transcript:

1 Information Retrieval
2018/11/14

2 Information Retrieval Process
need Collections How is the query constructed? How is the text processed? Pre-process text input Query Index Parse Rank 2018/11/14

3 Example: Information Needs
Sometimes very specific <title> Falkland petroleum exploration <desc> Description: What information is available on petroleum exploration in the South Atlantic near the Falkland Islands? <narr> Narrative: Any document discussing petroleum exploration in the South Atlantic near the Falkland Islands is considered relevant. Documents discussing petroleum exploration in continental South America are not relevant. Sometimes very vague I am going to Kyoto, Japan for a conference in two months. What should I know? 2018/11/14

4 Relevance In what ways can a document be relevant to a query?
Answer precise question precisely. Partially answer question. Suggest a source for more information. Give background information. Remind the user of other knowledge. Others ... 2018/11/14

5 Relevance How relevant is the document Subjective, but
for this user for this information need. Subjective, but Measurable to some extent How often do people agree a document is relevant to a query How well does it answer the question? Complete answer? Partial? Background Information? Hints for further exploration? 2018/11/14

6 Document Representation
Information needs and documents are usually represented as sets/bags of terms. Bag: allow multiple instances of the same element Terms: words, phrases To stem or not to stem Annotation with location information: title, heading 2018/11/14

7 Bag of Words Example Stop Word Indexed List Term Document 1 Document 2
courtesy of Phillip Resnik Bag of Words Example Stop Word List Indexed Term Document 1 Document 2 Document 1 aid 1 for The quick brown fox jumped over the lazy dog’s back. all 1 is back 1 of brown 1 ‘s come 1 the dog 1 to fox 1 Document 2 good 1 Start: 1:35 Questions: What is the consistent order in this case? Why is a consistent order needed? How are the terms chosen in this case? Stopword lists eliminate terms that only convey meaning through word order Observations: No need to keep a position for terms that never occur jump 1 lazy 1 Now is the time for all good men to come to the aid of their party. men 1 now 1 over 1 party 1 quick 1 their 1 time 1 2018/11/14 9

8 Types of Queries Boolean Query Vector Query Probabilistic Query
Does the document satisfy the Boolean expression? “java” AND “compilers” AND (“unix” OR “linux”) Vector Query How similar is the document to the query? [(java 3) (compiler 2) (unix 1) (linus 1)] Probabilistic Query What is the probability that the document is generated by the query? 2018/11/14

9 Boolean Model of Retrieval
Pros Easy to understand/clear semantics AND means ‘all’, OR means ‘any’ Usually computationally efficient Cons Difficult to rank results Rigid: either get too much or too little When the information need is complex, it is hard to formulate it as a Boolean query. 2018/11/14

10 Vector Space Model A collection of n documents with t distinct terms can be represented by a (sparse) matrix. A query can also be represented as a vector like a document T1 T2 … Tt D1 w11 w21 … wt1 D2 w12 w22 … wt2 : : : : Dn w1n w2n … wtn 2018/11/14

11 Docs as Vectors Star Diet Doc about movie stars Doc about astronomy
Doc about mammal behavior 2018/11/14

12 Geometric Interpretation
D1 = 2T1+ 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 7 3 2 5 Example: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 Is D1 or D2 more similar to Q? How to measure the degree of similarity? Distance? Angle? Projection? Assumption: Documents that are “close together” in space are similar in meaning. 2018/11/14

13 Term Weights The weight wij reflects the importance of the term Ti in document Dj. Intuitions: A term that appears in many documents is not important: e.g., the, going, come, … If a term is frequent in a document, it is probably important in that document. 2018/11/14

14 Assigning Weights to Terms
Binary Weights Raw term frequency tf x idf Recall the Zipf distribution Want to weight terms highly if they are frequent in relevant documents … BUT infrequent in the collection as a whole Pointwise Mutual Information 2018/11/14

15 Binary Weights Only the presence (1) or absence (0) of a term is included in the vector 2018/11/14

16 Raw Term Weights The frequency of occurrence for the term in each document is included in the vector 2018/11/14

17 Inverse Document Frequency
IDF provides high values for rare words and low values for common words For a collection of documents 2018/11/14

18 Term Weights: tf x idf Term frequency (tf)
the frequency count of a term in a document Inverse document frequency (idf) The amount of information contained in the statement “Document X contains the term Ti”. Assign a tf * idf weight to each term in each document 2018/11/14

19 tf x idf 2018/11/14

20 Term Weights: Pointwise Mutual Information
Pointwise Mutual Information measures the strength of association between two elements (a document and a term). Observed frequency vs. expected frequency MI weight is insensitive to stemming and the use of stop word list [Pantel and Lin 02] 2018/11/14

21 2018/11/14

22 What Else can be Terms? Letter n-grams Phrases Relations
Semantic categories 2018/11/14

23 Similarity Measure Define a similarity measure between a query and a document Cosine Dice Return the documents that are the most similar to the query 2018/11/14

24 Similarity Measures Simple matching (coordination level match)
Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient 2018/11/14

25 Implementation of VS Model
Does the query vector need to be compared with EVERY document vector? Does the query vector need to be compared with the vectors of documents that contain any of query terms? 2018/11/14

26 What to Evaluate? What can be measured that reflects users’ ability to use system? (Cleverdon 66) Coverage of Information Form of Presentation Effort required/Ease of Use Time and Space Efficiency Recall proportion of relevant material actually retrieved Precision proportion of retrieved material actually relevant effectiveness 2018/11/14

27 Relevant vs. Retrieved All docs Retrieved Relevant 2018/11/14

28 Precision vs. Recall All docs Retrieved Relevant 2018/11/14

29 Why Precision and Recall?
Get as much good stuff while at the same time getting as little junk as possible. 2018/11/14

30 Retrieved vs. Relevant Documents
Very high precision, very low recall Relevant 2018/11/14

31 Retrieved vs. Relevant Documents
High recall, but low precision Relevant 2018/11/14

32 Retrieved vs. Relevant Documents
High precision, high recall (at last!) Relevant 2018/11/14

33 Precision/Recall Curves
There is a tradeoff between Precision and Recall So measure Precision at different levels of Recall Note: this is an AVERAGE over MANY queries precision x x x x recall 2018/11/14

34 Precision/Recall Curves
Difficult to determine which of these two hypothetical results is better: x precision x x x recall 2018/11/14

35 Average Precision IR systems typically output a ranked list of documents For each relevant document, compute the precision up to that point Average over all precision values computed this way. 2018/11/14

36 Interpolated Average Precision
Precision may go up when going down the ranked list. Intuitively, this should only go down. Interpolated Average Precision for each recall level in 0%, 10%, 20%, … compute the highest precision after recall reached that point take the average of the max precision scores 2018/11/14

37 F-Measure Sometime only one pair of precision and recall is available.
e.g., filtering task F-Measure >1: precision is more important <1: recall is more important Usually =1 2018/11/14

38 Text Categorization Goal: classify documents into predefined categories Approaches: Naïve Bayes Nearest Neighbor SVM 2018/11/14

39 Naïve Bayes Method Knowledge Base contains Given Find
A set of hypotheses A set of evidences Probability of an evidence given a hypothesis Given A sub set of the evidences known to be present in a situation Find the hypothesis with the highest posterior probability: P(H|E1, E2, …, Ek). The probability itself does not matter so much. 2018/11/14

40 Naïve Bayes Method Assumptions
Hypotheses are exhaustive and mutually exclusive H1 v H2 v … v Hk ¬ (Hi ^ Hj) for any i≠j Evidences are conditionally independent given a hypothesis P(E1, E2,…, Ek|H) = P(E1|H)…P(Ek|H) P(H | E1, E2,…, Ek) = P(E1, E2,…, Ek, H)/P(E1, E2,…, Ek) = P(E1, E2,…, Ek|H)P(H)/P(E1, E2,…, Ek) 2018/11/14

41 Naïve Bayes Method The goal is to find H that maximize P(E1, E2,…, Ek|H) Since P(E1, E2,…, Ek|H) = P(E1, E2,…, Ek|H)P(H)/P(E1, E2,…, Ek) and P(E1, E2,…, Ek) is the same for different hypotheses, Maximizing P(E1, E2,…, Ek|H) is equivalent to maximizing P(E1, E2,…, Ek|H)P(H)= P(E1|H)…P(Ek|H)P(H) Naïve Bayes Method Find a hypothesis that maximizes P(E1|H)…P(Ek|H)P(H) 2018/11/14

42 Example: Play Tennis? Predict playing tennis when <sunny, cool, high, strong> What probability should be used to make the prediction? How to compute the probability? 2018/11/14

43 Probabilities of Individual Attributes
Given the training set, we can compute the probabilities P(+) = 9/14 P(−) = 5/14 2018/11/14

44 Example: Play Tennis P(+| sunny, cool, high, strong) vs.
P(sunny|+)P(cool|+)P(high|+)P(strong|+)P(+) P(sunny|−)P(cool|−)P(high|−)P(strong|−)P(−) 2018/11/14

45 Application: Spam Detection
Dear sir, We want to transfer to overseas ($ 126, USD) One hundred and Twenty six million United States Dollars) from a Bank in Africa, I want to ask you to quietly look for a reliable and honest person who will be capable and fit to provide either an existing …… Legitimate Ham: for lack of better name. 2018/11/14

46 Hypotheses: {Spam, Ham} Evidence: a document Knowledge
The document is treated as a set (or bag) of words Knowledge P(Spam) The prior probability of an message being a spam. How to estimate this probability? P(w|Spam) the probability that a word is w if we know w is chosen from a spam. 2018/11/14

47 Other Text Categorization Algorithms
Support Vector Machine often has the best performance. K-Nearest Neighbor 2018/11/14


Download ppt "Information Retrieval"

Similar presentations


Ads by Google