INFORMATION RETRIEVAL Introduction. Search and Information Retrieval  Search on the Web is a daily activity for many people throughout the world  Search.

Slides:



Advertisements
Similar presentations
Internet Search Lecture # 3.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Chapter 5: Introduction to Information Retrieval
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Web Intelligence Text Mining, and web-related Applications
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.
CS 430 / INFO 430 Information Retrieval
IR Models: Overview, Boolean, and Vector
Information Retrieval in Practice
ISP 433/533 Week 2 IR Models.
Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.
Compression Word document: 1 page is about 2 to 4kB Raster Image of 1 page at 600 dpi is about 35MB Compression Ratio, CR =, where is the number of bits.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Learning Techniques for Information Retrieval We cover 1.Perceptron algorithm 2.Least mean square algorithm 3.Chapter 5.2 User relevance feedback (pp )
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
ITGS Databases.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Web- and Multimedia-based Information Systems Lecture 2.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Statistical Properties of Text
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
1 CS 430: Information Discovery Lecture 5 Ranking.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
General Architecture of Retrieval Systems 1Adrienn Skrop.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
Automated Information Retrieval
Information Retrieval in Practice
Indexing & querying text
Text Based Information Retrieval
CS 430: Information Discovery
CS 430: Information Discovery
Database Vocabulary Terms.
Multimedia Information Retrieval
CSCE 561 Information Retrieval System Models
Representation of documents and queries
Text Categorization Assigning documents to a fixed set of categories
Data Mining Chapter 6 Search Engines
Chapter 5: Information Retrieval and Web Search
The ultimate in data organization
Information Retrieval and Web Design
Presentation transcript:

INFORMATION RETRIEVAL Introduction

Search and Information Retrieval  Search on the Web is a daily activity for many people throughout the world  Search and communication are most popular uses of the computer  Applications involving search are everywhere  The field of computer science that is most involved with R&D for search is information retrieval (IR) Adrienn Skrop2

Motto of IR  Finding relevant information in a store of information. Adrienn Skrop3

IR does not mean  Finding any information which we happen to come across,  or information we are fortunate enough to discover by chance without having in mind anything particular. Adrienn Skrop4

IR means  We already have a need for information that we are able to formulate, and then relevant items are found in a store (collection) of items. Adrienn Skrop5

Definition of IR  IR is concerned with  the organisation,  storage,  retrieval and  evaluation  of information relevant to a user’s information need. Adrienn Skrop6

Information need  The user has an information need:  articles published on a certain subject  travel agencies with last minute offers,  etc. Adrienn Skrop7

Retrieval  The information need is expressed in the form of a query, i.e., in a form which is required by a computer program.  The program then retrieves information (journal articles, Web pages, etc.) in response to the query. Adrienn Skrop8

Formally formulated IR IR = (U, IN, Q, O ) → R, where  U = user,  IN = information need,  Q = query,  O = collection of objects to be searched,  R = collection of retrieved objects in response to Q. Adrienn Skrop9

User specific information  The IN is more than its expression in a query Q.  IN comprises the query Q plus additional information about the user U.  The additional information is specific to the user. Adrienn Skrop10

Implicit information  The additional information is obvious for the user but not for the computerized retrieval system. ⇓  Thus the additional information is an implicit (i.e., not expressed in Q) information I specific to the user U, and we may write IN = (Q, I). Adrienn Skrop11

Strict re-formulation of IR IR is being concerned with finding a relevance relationship ℜ between objects O and information need IN. Formally: IR = ℜ (O, IN) = ℜ (O, (Q, I)). Adrienn Skrop12

Find a relationship ℜ  It should be made possible to take into account the implicit information I as well,  and ideally the information which can be inferred from I to obtain as complete a picture of user U as possible.  Finding an appropriate relationship ℜ would mean to obtain (derive, infer) those objects O:  which match the query Q and  satisfy the implicit information I. Adrienn Skrop13

Formal relationship ℜ IR = ℜ (O, (Q,  I, | 〉 )),  where 〈 I, | 〉 means I together with information inferred (e.g., in some formal language or logic) from I.  Relationship ℜ is established with some (un)certainty m, and thus we may write that: IR = m[ ℜ (O, (Q, 〈 I, | 〉 ))]. Adrienn Skrop14

IR is a kind of measurement  measuring the relevance of an item stored in computer memory to an user’s information request (and then returning the items sorted descending on their measure of relevance).  All IR frameworks, methods, and algorithms aim at as good a measurement as possible. Adrienn Skrop15

IR and Search Engines  A search engine is the practical application of information retrieval techniques to large scale text collections  Web search engines are best-known examples, but many others. Adrienn Skrop16

Google Adrienn Skrop17

Bing Adrienn Skrop18

YAHOO! Adrienn Skrop19

INFORMATION RETRIEVAL TECHNOLOGY

Entities Let E 1,…,E j,…,E m denote entities in general. E.g.,  texts (books, journal articles, newspaper articles, papers, lecture notes, abstracts, titles),  images (photographs, pictures, drawings),  sounds (musical pieces, songs, speeches),  multimedia (a collection of texts, images and sounds),  a collection of Web pages,  etc.. Adrienn Skrop21

Documents  For retrieval purposes each entity E j is described by a piece of text D j.  Obviously, D j may coincide with E j itself (for example, when E j is itself a piece of text).  D j is traditionally called a document. Adrienn Skrop22

Lexical units  From a computational point of view the documents consist of words as automatically identifiable lexical units. lexical unit = word = string of characters preceded and followed by “space” (or some other character, e.g.: !,., ?).  Thus, words can be recognised automatically using a computer program. Adrienn Skrop23

Power law  The number f of occurrences of words in an English text (corpus) obeys a Power Law, i.e., f(r) = Cr −α,  where C is a corpus-dependent constant,  r is the rank of words.  α is referred to as the exponent of the Power Law.  The Power Law f(r) = Cr  1 is called Zipf Law. Adrienn Skrop24

Power Law  For visualisation purposes, the Power Law is represented in a log-log plot, i.e., as a straight line obtained by taking the logarithm: log f(r) = log C    log r, where  log r is represented on the abscissa,  log f(r) on the ordonata,   is the slope of the line,  and log C is the intercept of the line. 25Adrienn Skrop

Power Law Fitting Using Regression Method In practice, the following regression method can be applied to fit a Power Law to data: 1. Given a sequence of values X = (x 1,...,x i,...,x n ) on the horizontal axis, and another sequence of corresponding values Y = (y 1,...,y i,...,y n ) on the vertical axis (y i corresponds to x i ). 26Adrienn Skrop

Power Law Fitting Using Regression Method 2. If the correlation coefficient suggests a fairly strong correlation  it is close to +1 or  1  between X and Y at a log scale, then a regression line can be drawn to exhibit a relationship between the data X and Y. 27Adrienn Skrop

Power Law Fitting Using Regression Method 3. Using the  slope = and the  intercept = of the regression line, the corresponding Power Law can be written. 28Adrienn Skrop

Power Law Fitting Using Least Squares Method 1.Given a sequence of values X = (x 1,...,x i,...,x n ) on the horizontal axis, and another sequence of corresponding values Y = (y 1,...,y i,...,y n ) on the vertical axis (y i corresponds to x i ). 2.The parameters  and C should be so computed as to minimize the squared error: i.e., the partial derivatives with respect to C and  should vanish. 29Adrienn Skrop

Example Let us assume that the data we want to approximate by a Power Law is X and Y, n = 150. Fragments of X and Y are shown below: 30Adrienn Skrop

Example  The correlation coefficient is corr(X, Y) =   Using the regression method, the following power law is obtained: f(x) = x  3.  Using the least squares method, the following power law is obtained: f(x) = x   The approximation error is:  2.8  10 8 in the regression method, and  3.6  10 6 in the curve fitting method.  Thus, we should accept the Power Law obtained by the curve fitting method. 31Adrienn Skrop

32Adrienn Skrop

Word occurence  In a document there are words which occur many times, and  there are words which occur once or just a few times. Adrienn Skrop33

Disregarded words  Frequently occurring words (i.e., the frequency f exceeds some threshold value) on the ground that they are almost always insignificant.  Infrequent words (i.e., the frequency f is below some threshold value) on the ground that they are not much on the writer’s mind (or else they would occur more frequently). Adrienn Skrop34

Stoplist  List of frequent and infrequent words.  They do not carry meaning in natural language and therefore can be disregarded.  For the English language, a widely accepted and used stoplist is the so-called TIME stoplist with words e.g.:  a  the  By  The construction of a stoplist can be automatised. Adrienn Skrop35

Stemming  Many morphological variations of words  inflectional (plurals, tenses)  derivational (making verbs nouns etc.)  In most cases, these have the same or very similar meanings  Stemmers attempt to reduce morphological variations of words to a common stem  usually involves removing suffixes 36Adrienn Skrop

Terms  Let E = {E 1,…,E j,…,E m } denote a set of entities to be searched in a future retrieval system, and let  D = {D 1,…,D j,…,D m } denote the documents corresponding to E.  After word identification, stoplisting and stemming, the following set of terms is identified: T = {t 1,…,t i,…,t n }. Adrienn Skrop37

Inverted File Structure  The set T can be used to construct an inverted file structure as follows: 1. Sort the terms t 1,…,t i,…,t n alphabetically. For this purpose, some appropriate (fast) sorting algorithm should be used 2. Create an index table I in which every row r i contains exactly one term t i together with the codes (identifiers) of documents D j in which that term t i occurs. Adrienn Skrop38

Index table Adrienn Skrop39

Inverted file construction  Every document Dj uniquely identifies its corresponding entity Ej ⇓  a structure IF (Inverted File) consisting of the index table I and of the entities (master file) of the set E can be constructed (usually on a disk).  The Codes in the index table I can also contain the disk addresses of the corresponding entities in the master file. Adrienn Skrop40

Inverted file structure example Adrienn Skrop41

Usage of IF The inverted file structure IF is used in the following way: 1. Let t denote a query term. Using appropriate search algorithm, t is located in the table I, i.e., the result of the search is the row: [t | D t1,…,D tu ]. 2. Using the codes D t1,…,D tu, the corresponding entities E t1,…,E tu can be read from the master file for further processing. Adrienn Skrop42

IF options In an inverted file structure, other data can also be stored, such as:  the number of occurrences of term ti in document Dj,  the total number of occurrences of term ti in all documents,  etc.. Adrienn Skrop43

Term-Document Matrix Construction of term-document matrix TD (i = 1,…,n, j = 1,…,m) 1. Establish f ij : the number of times term t i occurs in document D j. 2. Construct the term-document matrix TD = (w ij )n  m, where the entry w ij is referred to as the weight of term t i in the document D j. The weight is a numerical measure of the extent to which the term reflects the content of the document. Adrienn Skrop44

Term weights  The weight is a numerical measure of the extent to which the term reflects the content of the document.  There are several methods to compute the weights. Adrienn Skrop45

Binary weighting method Adrienn Skrop46

Frequency weighting method  wij = fij. Adrienn Skrop47

max-tf  max-normalised method Adrienn Skrop48

norm-tf  length-normalized method Adrienn Skrop49

tf-idf  term frequency inverse document frequency method Adrienn Skrop50

norm-tf-idf  length normalized term frequency inverse document frequency method Adrienn Skrop51

References  Baeza-Yates, R. and Ribeiro-Neto, B. (2011). Modern Information Retrieval.The concepts and technology behind search. Second edition. Addison Wesley.  Croft, W. C., Metzler, D. and Strohman, T. (2010). Search Engines. Information Retrieval in Practice. Addison-Wesley. Adrienn Skrop52