Download presentation
Presentation is loading. Please wait.
Published byRandolph Roberts Modified over 8 years ago
1
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL
2
2 Components of a Retrieval Model User Search expert (e.g., librarian) vs non expert Background of the user (knowledge of the topic) Documents Different language Semi-structured (e.g. HTML or XML) vs plain text
3
3 Retrieval Models A retrieval model is an idealization or abstraction of an actual retrieval process Conclusions derived from a model depend on whether the model is a good approximation of the retrieval situation A retrieval model is not the same as a retrieval implementation
4
4 IR Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance Notion of relevance can be binary (0 or 1) or continuous (i.e. ranked retrieval)
5
5 Ranking A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query A ranking is based on fundamental premisses regarding the notion of relevance, such as: common sets of index terms sharing of weighted terms likelihood of relevance Each set of premisses leads to a distinct IR model
6
6 IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector probabilistic Set Theoretic Fuzzy Extended Boolean Probabilistic Inference Network Belief Network Algebraic Generalized Vector Lat. Semantic Index Neural Networks Browsing Flat Structure Guided Hypertext
7
7 Document Representation Meta-descriptions Field information (author, title, date) Keyword Predefined Manually extracted by author/editor Content: automatically identifying what the document is about
8
8
9
9
10
10
11
11 Manual vs. Automatic Indexing Pros. of manual indexing Human judgments are most reliable Searching controlled vocabularies is more efficient Cons. of manual indexing Time consuming The person using the retrieval system has to be familiar with the classification system Classification systems are sometimes incoherent
12
12 Automatic Content Representation Using natural language understanding? Computationally is too expensive in real-world setting Language dependence The resulting representations may be too explicit to deal with the vagueness of a user’s information need Alternative: a document is simply an unstructured set of words appearing in it :bag of word
13
13 Basic Approach to IR Most successful approaches are statistical. Directly, or an effort to capture and use probabilities Why not natural language understanding? Computer understands documents and query and matches them Can be highly successful in predictable settings E.g. Medical or legal settings with restricted vocabulary Could use manually assigned headings E.g. Library of Congress heading Human agreement is not good Hard to predict what heading are interesting Expensive
14
14 Relevance Much of IR depends upon idea that Similar vocabulary -> relevant to same queries Usually look for documents matching query words “Similar” can be measured in many ways String matching/comparison Same vocabulary used Probability that documents arise from same model Same meaning of text
15
15 Bag of Words An effective and popular approach Compares words without regard to order Consider reordering words in a headline Random: นะ ง่าย IR ดี วิชา นี่ เรียน Alphabetical: IR ง่าย ดี นะ นี่ เรียน วิชา Interesting: IR วิชา เรียน ง่าย ดี นี่ นะ Actual: วิชา IR นี่ เรียน ง่าย ดี นะ
16
16 Bag of word Approach A document is an unordered list of words (Grammatical information is lost) Tokenization: What is a word? ( Is “White House” one or two words?) Stemming or lemmatization (Morphological information is thrown away: “agreements” becomes “agreement” (lemmatization) or even “agree” (stemming)
17
17 Simple model of IR Simple flow of retrieval process
18
18 Common Preprocessing Steps Strip unwanted characters/markup (e.g. HTML tags, punctuation, numbers, etc.) Break into tokens (keywords) on white space Stem tokens to “ root ” words computational compute Remove common stopwords (e.g. a, the, it, etc.) Detect common phrases (possibly using a domain specific dictionary) Build inverted index (keyword list of docs containing it)
19
19 Statistical Language Model Document comes from a topic Topic describes how word appear in documents on the topic Use document to guess what the topic looks like Words common in document are common in topic Words not in document much less likely Index estimated topics
20
20 Statistical Retrieval Retrieval based on similarity between query and documents Output documents are ranked according to similarity to query Similarity based on occurrence frequencies of keywords in query and document Automatic relevance feedback can be supported: Relevant documents “added” to query Irrelevant documents “subtracted” from query
21
21 Example: Small Document D = one fish, two fish, red fish, blue fish,black fish, blue fish, old fish, new fish Len(D) = 16 P(fish|D) = 8/16 = 0.5 P(blue|D) = 2/16 = 0.125 P(one|D) = 1/16 = 0.0625 P(eggs|D) = 0/16 = 0
22
22 Classes of Retrieval Models Boolean models (set theoretic) Extended Boolean Vector space models (statistical/algebraic) Generalized VS Latent Semantic Indexing Probabilistic models
23
23 Boolean Model A document is represented as a set of keywords Queries are Boolean expressions of keywords, connected by AND, OR, and NOT, including the use of brackets to indicate scope ((Rio and Brazil) or (Hilo and Hawaii) and hotel and not Hilton] Output: Document is relevant or not. No partial matches or ranking
24
24 Boolean Retrieval Model Popular retrieval model because: Easy to understand for simple queries Clean formalism Boolean models can be extended to include ranking Reasonably efficient implementations possible for normal queries
25
25 Boolean Models Problems Very rigid: AND means all; OR means any Difficult to express complex user requests Difficult to control the number of documents retrieved All matched documents will be returned Difficult to rank output All matched documents logically satisfy the query Difficult to perform relevance feedback If a document is identified by the user as relevant or irrelevant, how should the query be modified?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.