Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.

Slides:



Advertisements
Similar presentations
Boolean and Vector Space Retrieval Models
Advertisements

Chapter 5: Introduction to Information Retrieval
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Multimedia Database Systems
Basic IR: Modeling Basic IR Task: Slightly more complex:
Modern Information Retrieval Chapter 1: Introduction
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval.
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
IR Models: Overview, Boolean, and Vector
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
ISP 433/533 Week 2 IR Models.
IR Models: Structural Models
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Vector Space Model CS 652 Information Extraction and Integration.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Basic IR Concepts & Techniques ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
IR Models: Review Vector Model and Probabilistic.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Chapter 5: Information Retrieval and Web Search
 IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Information Retrieval Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto.
Information Retrieval Chapter 2: Modeling 2.1, 2.2, 2.3, 2.4, 2.5.1, 2.5.2, Slides provided by the author, modified by L N Cassel September 2003.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CSCE 5300 Information Retrieval and Web Search Introduction to IR models and methods Instructor: Rada Mihalcea Class web page:
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Vector Space Models.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval
The Boolean Model Simple model based on set theory
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 Boolean Model. 2 A document is represented as a set of keywords. Queries are Boolean expressions of keywords, connected by AND, OR, and NOT, including.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
Information Retrieval in Practice
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Multimedia Information Retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Introduction to Information Retrieval
4. Boolean and Vector Space Retrieval Models
Boolean and Vector Space Retrieval Models
Recuperação de Informação B
Recuperação de Informação B
Information Retrieval and Web Design
Advanced information retrieval
Introduction to Search Engines
Presentation transcript:

Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL

2 Components of a Retrieval Model User Search expert (e.g., librarian) vs non expert Background of the user (knowledge of the topic) Documents Different language Semi-structured (e.g. HTML or XML) vs plain text

3 Retrieval Models A retrieval model is an idealization or abstraction of an actual retrieval process Conclusions derived from a model depend on whether the model is a good approximation of the retrieval situation A retrieval model is not the same as a retrieval implementation

4 IR Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance Notion of relevance can be binary (0 or 1) or continuous (i.e. ranked retrieval)

5 Ranking A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query A ranking is based on fundamental premisses regarding the notion of relevance, such as: common sets of index terms sharing of weighted terms likelihood of relevance Each set of premisses leads to a distinct IR model

6 IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector probabilistic Set Theoretic Fuzzy Extended Boolean Probabilistic Inference Network Belief Network Algebraic Generalized Vector Lat. Semantic Index Neural Networks Browsing Flat Structure Guided Hypertext

7 Document Representation Meta-descriptions Field information (author, title, date) Keyword Predefined Manually extracted by author/editor Content: automatically identifying what the document is about

8

9

10

11 Manual vs. Automatic Indexing Pros. of manual indexing Human judgments are most reliable Searching controlled vocabularies is more efficient Cons. of manual indexing Time consuming The person using the retrieval system has to be familiar with the classification system Classification systems are sometimes incoherent

12 Automatic Content Representation Using natural language understanding? Computationally is too expensive in real-world setting Language dependence The resulting representations may be too explicit to deal with the vagueness of a user’s information need Alternative: a document is simply an unstructured set of words appearing in it :bag of word

13 Basic Approach to IR Most successful approaches are statistical. Directly, or an effort to capture and use probabilities Why not natural language understanding? Computer understands documents and query and matches them Can be highly successful in predictable settings E.g. Medical or legal settings with restricted vocabulary Could use manually assigned headings E.g. Library of Congress heading Human agreement is not good Hard to predict what heading are interesting Expensive

14 Relevance Much of IR depends upon idea that Similar vocabulary -> relevant to same queries Usually look for documents matching query words “Similar” can be measured in many ways String matching/comparison Same vocabulary used Probability that documents arise from same model Same meaning of text

15 Bag of Words An effective and popular approach Compares words without regard to order Consider reordering words in a headline Random: นะ ง่าย IR ดี วิชา นี่ เรียน Alphabetical: IR ง่าย ดี นะ นี่ เรียน วิชา Interesting: IR วิชา เรียน ง่าย ดี นี่ นะ Actual: วิชา IR นี่ เรียน ง่าย ดี นะ

16 Bag of word Approach A document is an unordered list of words (Grammatical information is lost) Tokenization: What is a word? ( Is “White House” one or two words?) Stemming or lemmatization (Morphological information is thrown away: “agreements” becomes “agreement” (lemmatization) or even “agree” (stemming)

17 Simple model of IR Simple flow of retrieval process

18 Common Preprocessing Steps Strip unwanted characters/markup (e.g. HTML tags, punctuation, numbers, etc.) Break into tokens (keywords) on white space Stem tokens to “ root ” words computational  compute Remove common stopwords (e.g. a, the, it, etc.) Detect common phrases (possibly using a domain specific dictionary) Build inverted index (keyword  list of docs containing it)

19 Statistical Language Model Document comes from a topic Topic describes how word appear in documents on the topic Use document to guess what the topic looks like Words common in document are common in topic Words not in document much less likely Index estimated topics

20 Statistical Retrieval Retrieval based on similarity between query and documents Output documents are ranked according to similarity to query Similarity based on occurrence frequencies of keywords in query and document Automatic relevance feedback can be supported: Relevant documents “added” to query Irrelevant documents “subtracted” from query

21 Example: Small Document D = one fish, two fish, red fish, blue fish,black fish, blue fish, old fish, new fish Len(D) = 16 P(fish|D) = 8/16 = 0.5 P(blue|D) = 2/16 = P(one|D) = 1/16 = P(eggs|D) = 0/16 = 0

22 Classes of Retrieval Models Boolean models (set theoretic) Extended Boolean Vector space models (statistical/algebraic) Generalized VS Latent Semantic Indexing Probabilistic models

23 Boolean Model A document is represented as a set of keywords Queries are Boolean expressions of keywords, connected by AND, OR, and NOT, including the use of brackets to indicate scope ((Rio and Brazil) or (Hilo and Hawaii) and hotel and not Hilton] Output: Document is relevant or not. No partial matches or ranking

24 Boolean Retrieval Model Popular retrieval model because: Easy to understand for simple queries Clean formalism Boolean models can be extended to include ranking Reasonably efficient implementations possible for normal queries

25 Boolean Models  Problems Very rigid: AND means all; OR means any Difficult to express complex user requests Difficult to control the number of documents retrieved All matched documents will be returned Difficult to rank output All matched documents logically satisfy the query Difficult to perform relevance feedback If a document is identified by the user as relevant or irrelevant, how should the query be modified?