Download presentation
Presentation is loading. Please wait.
Published byJohnathan Richardson Modified over 6 years ago
1
Lecture 1: Introduction and the Boolean Model Information Retrieval
By Dr. Huda Abdulaali Introduction to Information Retrieval. Manning et al.,
2
nature . . . that satisfies an information need from within large
INTRODUCTION Information retrieval (IR) is finding material of an unstructured nature that satisfies an information need from within large collections
3
Information retrieval (IR) is finding material (usually documents)
INTRODUCTION Information retrieval (IR) is finding material (usually documents) of an unstructured nature that satisfies an information need from within large collections (usually stored on computers). Document Collection: text units we have built an IR system. Usually documents But could be book chapters paragraphs scenes of a movie turns in a conversation...
4
Structured vs Unstructured Data
INTRODUCTION Structured vs Unstructured Data structured data refers to information with a high degree of organization, such that inclusion in a relational database is seamless and readily searchable by simple, straightforward search engine algorithms or other search operations; unstructured data is essentially the opposite.
5
Information Needs and Relevance
INTRODUCTION Information Needs and Relevance An information need is the topic about which the user desires to know more about. A query is what the user conveys to the computer in an attempt to communicate the information need. A document is relevant if the user perceives that it contains information of value with respect to their personal information need.
6
INTRODUCTION The field of IR also covers supporting users in browsing or filtering document collections or further processing a set of retrieved documents. Given a set of documents, clustering is the task of coming up with a good grouping of the documents based on their contents.
7
INTRODUCTION In web search, the system has to provide search over billions of documents stored on millions of computers. Distinctive issues are needing to gather documents for indexing, being able to build systems that work efficiently In Enterprise and institutional search e.g company’s documentation, patents, research articles often domain-specific, centralized storage; dedicated machines for search In personal information retrieval, operating systems have integrated information retrieval (such as Apple’s Mac OS X Spotlight or Windows Vista’s Instant Search). programs usually not only provide search but also text classification: they at least provide a spam (junk mail) filter, and commonly also provide either manual or automatic means for classifying mail so that it can be placed directly into particular folders
8
A short history of IR
9
Boolean Retrieval The Boolean model of information retrieval (BIR) is a classical information retrieval (IR) model and, at the same time, the first and most adopted one. It is used by many IR systems to this day In the Boolean retrieval model we can pose any query in the form of a Boolean expression of term i.e., one in which terms are combined with the operators and, or, and not. Shakespeare example
10
An index term is either present(1) or absent(0) in the document
Basic Assumption of Boolean Model An index term is either present(1) or absent(0) in the document All index terms provide equal evidence with respect to information needs. Queries are Boolean combinations of index terms. X AND Y: represents doc that contains both X and Y X OR Y: represents doc that contains either X or Y NOT X: represents the doc that do not contain X
11
An example information retrieval problem
Brutus AND Caesar AND NOT Calpurnia fat book that many people own is Shakespeare’s Collected Works. Suppose you wanted to determine which plays of Shakespeare contain the words Brutus and Caesar and not Calpurnia. One way to do that is to start at the beginning and to read through all the text, The simplest form of document retrieval is for a computer to do this sort of linear scan through documents. This process is commonly referred to as GREPPING through text
12
An example information retrieval problem
GREP is a command-line utility for searching plain-text data sets for lines that match a regular expression. Its name comes from the command g/re/p (globally search a regular expression and print), which has the same effect: doing a global search with the regular expression and printing all matching lines. GREP was originally developed for the Unix operating system, but later available for all Unix-like systems.
13
An example information retrieval problem
1. To process large document collections quickly. The amount of online data has grown at least as quickly as the speed of computers, and we would now like to be able to search collections that total in the order of billions to trillions of words. 2. To allow more flexible matching operations. For example, it is impractical to perform the query Romans near countrymen with GREP, where near might be defined as “within 5 words” or “within the same sentence.” 3. To allow ranked retrieval. In many cases, you want the best answer to an information need among many documents that contain certain words.
14
An example information retrieval problem
The way to avoid linearly scanning the texts for each query is to index the documents in advance The result is a binary term-document incidence matrix the information retrieval literature normally speaks of terms (NOT WORDS) The result is a vector for each term Retrieval model can be categorize as Boolean retrieval model Vector space model Probabilistic model Model based on belief net
15
result of binary term-document
An example information retrieval problem result of binary term-document incidence matrix Main idea: record for each document whether it contains each word out of all the different words Shakespeare used. Matrix element (t, d) is 1 if the play in column d contains the word in row t, and is 0 otherwise.
16
An example information retrieval problem
We compute the results for our query as the bitwise AND between vectors for Brutus, Caesar and complement (Calpurnia): To answer the query Brutus and Caesar and not Calpurnia, we take the vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a bitwise and: and and = The answers for this query are thus Antony and Cleopatra and Hamlet
17
A first take at building an inverted index
The inverted index consists of a dictionary of terms (also: lexicon, vocabulary) and a postings list for each term, i.e., a list that records which documents the term occurs in. The inverted index data structure is a central component of a typical search engine indexing algorithm. A goal of a search engine implementation is to optimize the speed of the query: find the documents where word X occurs. To gain the speed benefits of indexing at retrieval time, we have to build the index in advance.
18
A first take at building an inverted index
19
A first take at building an inverted index
20
A first take at building an inverted index
Within a document collection, we assume that each document has a unique docID serial number, known as the document identifier (docID). Then, collect the documents to be indexed. The core indexing step is sorting this list so that the terms are alphabetical.
21
A first take at building an inverted index
Multiple occurrences of the same term from the same document are then merged. Instances of the same term are then grouped, and the result is split into a dictionary and postings The dictionary records some statistics, such as the number of document documents which contain each term The postings are secondarily sorted by docID. This provides the basis for efficient query processing.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.