INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Slides:



Advertisements
Similar presentations
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Advertisements

Basic IR: Modeling Basic IR Task: Slightly more complex:
INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
Modern Information Retrieval Chapter 1: Introduction
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
IR Models: Overview, Boolean, and Vector
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
ISP 433/533 Week 2 IR Models.
IR Models: Structural Models
Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
Vector Space Model CS 652 Information Extraction and Integration.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
IR Models: Review Vector Model and Probabilistic.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.
CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 32-33: Information Retrieval: Basic concepts and Model.
Information Retrieval Chapter 2: Modeling 2.1, 2.2, 2.3, 2.4, 2.5.1, 2.5.2, Slides provided by the author, modified by L N Cassel September 2003.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval
The Boolean Model Simple model based on set theory
Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page:
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
1 Boolean Model. 2 A document is represented as a set of keywords. Queries are Boolean expressions of keywords, connected by AND, OR, and NOT, including.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
CS315 Introduction to Information Retrieval Boolean Search 1.
Lecture 1: Introduction and the Boolean Model Information Retrieval
Slides from Book: Christopher D
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CSCE 561 Information Retrieval System Models
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Information Retrieval and Web Search Lecture 1: Boolean retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Recuperação de Informação B
Information Retrieval and Web Design
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Recuperação de Informação B
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Information Retrieval and Web Design
Advanced information retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID Lecture # 2 Introduction Information Retrieval Models Boolean Retrieval Model 00:00:40  00:01:10

ACKNOWLEDGEMENTS The presentation of this lecture has been taken from the following sources “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D. Manning, and Hinrich Schütze “Managing gigabytes” by Ian H. Witten, ‎Alistair Moffat, ‎Timothy C. Bell “Modern information retrieval” by Baeza-Yates Ricardo, ‎  “Web Information Retrieval” by Stefano Ceri, ‎Alessandro Bozzon, ‎Marco Brambilla

Outline What is Information Retrieval ? ? ? IR Models The Boolean Model Considerations on the Boolean Model

Introduction

What is Information Retrieval ? ? ? Information retrieval (IR) deals with the representation, storage, organization of, and access to information items. Baeza-Yates, Ribeiro-Nieto, 1999 Information retrieval (IR) is devoted to finding relevant documents, not finding simple match to patterns. Grossman - Frieder, 2004 Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfy an information need from within large collections (usually stored on computers). Manning et al., 2007 00:01:30  00:02:30

Information Retrieval Document corpus IR System Query String 00:04:48  00:07:10 Ranked Documents 1. Doc1 2. Doc2 3. Doc3 .

IR Models

IR Models Modeling in IR is a complex process aimed at producing a ranking function Ranking function: a function that assigns scores to documents with regard to a given query This process consists of two main tasks: The conception of a logical framework for representing documents and queries The definition of a ranking function that allows quantifying the similarities among documents and queries 00:07:20  00:08:24

IR Models An IR model is a quadruple [D, Q, F, R(qi, dj)] where 1. D is a set of logical views for the documents in the collection 2. Q is a set of logical views for the user queries 3. F is a framework for modeling documents and queries 4. R(qi, dj) is a ranking function 00:09:55  00:12:28

Retrieval: Ad Hoc x Filtering Ad Hoc Retrieval: 00:17:49  00:18:30

Retrieval: Ad Hoc x Filtering 00:19:17  00:19:35

A Taxonomy of IR Models Set Theoretic Classic Models Algebraic Boolean Fuzzy Extended Boolean Classic Models Algebraic Generalized Vector Lat.Semantic Index Neural Networks Boolean Vector Probabilistic Ad-Hoc Retrieval Filtering Probabilistic Inference Network Belief Network 00:22:21  00:23:07 Structured Models Non-Overlapping Lists Proximal Nodes

The Boolean Model

The Boolean Model The Boolean retrieval model is a simple retrieval model based on set theory and Boolean algebra Index term’s Significance represented by binary weights Wkj ∈{0,1} is associated to the tuple (tk, dj) Rdj  set of index terms for a document Rti set of document for an index term Queries are defined as Boolean expressions over index terms (using Boolean operators AND, OR and NOT) Brutus AND Caesar but NOT Calpurnia? The Boolean retrieval model is a simple retrieval model based on set theory and Boolean algebra Index term’s Significance represented by binary weights (00:24:13  00:25:15) Wkj ∈{0,1} is associated to the tuple (tk, dj) Rdj  set of index terms for a document Rti set of document for an index term (00:28:10  00:29:08) Queries are defined as Boolean expressions over index terms (using Boolean operators AND, OR and NOT) (00:30:25  00:31:10) Brutus AND Caesar but NOT Calpurnia? (00:31:14  00:31:35)

The Boolean Model (Cont.….) Relevance is modeled as a binary property of the documents (SC = 0 or SC=1) Closed world assumption: the absence of a term t in the representation of a document d is equivalent to the presence of (not t) in the same representation (1,1,1) (1,0,0) (1,1,0) ta tb tc (00:32:05  00:32:15) Relevance is modeled as a binary property of the documents (SC = 0 or SC=1) (00:34:29  00:35:25) Closed world assumption: the absence of a term t in the representation of a document d is equivalent to the presence of (not t) in the same representation

The Boolean Model (00:36:25  00:40:15) d1 = [1, 1,1]T d2 = [1, 0,0]T Rt1 = {d1, d2} Rt2 = {d1, d3} Rt3 = {d1} (00:36:25  00:40:15) q = t1 q = t1 AND t2 q = t1 OR t2 q = NOT t1 Rt1 = {d1, d2} Rt1 ∩ Rt2 = d1 Rt1 ∪ Rt2 = {d1, d2,d3} ⌐Rt1 = d3

The Boolean Model ta tb Consider processing the query: q = ta  tb Locate ta in the Dictionary; Retrieve its postings. Locate tb in the Dictionary; “Merge” the two postings: 00:46:15  00:47:03 2 4 8 16 32 64 128 1 3 5 13 21 34 ta tb

Posting list intersection The intersection operation is the crucial one: we need to efficiently intersect postings lists so as to be able to quickly find documents that contain both terms. If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID. 00:48:50  00:49:10

Evaluation of Boolean queries keyword: retrieval of all the documents containing a keyword in the inverted list and build of a document list OR: creation of a list associated with the node which is the union of the lists associated with the left and right sub- trees AND: creation of a list associated with the node which is the intersection of the lists associated with the left and right sub-trees BUT = AND NOT: creation of a list associated with the difference between the lists related with the left and right sub-trees 00:51:20  00:51:50

Query optimization Is the process of selecting how to organize the work of answering a query GOAL: minimize the total amount of work performed by the system. 00:52:00  00:52:15

Order of evaluation t1 = {d1, d3, d5, d7} t2 = {d2, d3, d4, d5, d6} q = t1 AND t2 OR t3 t3 = {d4, d6, d8} From the left: {d3, d5}  {d4, d6, d8}= {d3, d5,d4, d6, d8} From the right: {d2,d3,d4,d5,d6,d8}  {d1, d3, d5, d7}={d3,d5} Standard evaluation priority: and, not, or A and B or C and B  (A and B) or (C and B) 00:56:31  00:58:15 AND OR t1 t2 t3

Considerations on the Boolean Model

Considerations on the Boolean Model Strategy is based on a binary decision criterion (i.e., a document is predicted to be either relevant or non-relevant) without any notion of a grading scale The Boolean model is in reality much more a data (instead of information) retrieval model Pros: Boolean expressions have precise semantics Structured queries For expert users, intuitivity Simple and neat formalism  great attention in past years and was adopted by many of the early commercial bibliographic systems 00:59:00  01:00:00

Drawbacks of the Boolean Model Retrieval based on binary decision criteria with no notion of partial matching No ranking of the documents is provided (absence of a grading scale) Information need has to be translated into a Boolean expression, which most users find awkward The Boolean queries formulated by the users are most often too simplistic The model frequently returns either too few or too many documents in response to a user query 01:00:00  01:00:56

Resources Modern Information Retrieval Chapter 1 of IIR Resources at http://ifnlp.org/ir Boolean Retrieval