Modern Information Retrieval Chapter 1: Introduction

Slides:



Advertisements
Similar presentations
Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 1: Introduction Alexander Gelbukh
Advertisements

Special Topics in Computer Science The Art of Information Retrieval Chapter 1: Introduction Alexander Gelbukh
1 Conventional Text-Retrieval Systems Automatic Text Processing by G. Salton, Addison-Wesley, (Chapter 9)
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Multimedia Database Systems
Basic IR: Modeling Basic IR Task: Slightly more complex:
INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
An Introduction to Information Retrieval and Applications J. H. Wang Feb. 19, 2008.
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
IR Models: Overview, Boolean, and Vector
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
ISP 433/533 Week 2 IR Models.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Modern Information Retrieval Chapter 1: Introduction
IR Models: Structural Models
Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
Vector Space Model CS 652 Information Extraction and Integration.
Modern Information Retrieval Chapter 1 Introduction.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
 IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
LIS618 lecture 1 Thomas Krichel economic rational for traditional model In olden days the cost of telecommunication was high. database use.
Modern Information Retrieval Computer engineering department Fall 2005.
Information Retrieval Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto.
Information Retrieval Chapter 2: Modeling 2.1, 2.2, 2.3, 2.4, 2.5.1, 2.5.2, Slides provided by the author, modified by L N Cassel September 2003.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval.
Modern Information Retrieval Presented by Miss Prattana Chanpolto Faculty of Information Technology.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Recuperação de Informação Cap. 01: Introdução 21 de Fevereiro de 1999 Berthier Ribeiro-Neto.
Information Retrieval
The Boolean Model Simple model based on set theory
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught.
Information Retrieval CSE 8337 Spring 2005 Modeling (Part II) Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Information Retrieval
موضوع پروژه : بازیابی اطلاعات Information Retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
Information Retrieval and Extraction
Recuperação de Informação B
Recuperação de Informação B
Recuperação de Informação
Advanced information retrieval
Presentation transcript:

Modern Information Retrieval Chapter 1: Introduction Ricardo Baeza-Yates Berthier Ribeiro-Neto

Motivation Example of the user information need Topic: NCAA college tennis team Description: Find all the pages (documents) containing information on college tennis teams which (1) are maintained by an university in the USA and (2) participate in the NCAA tennis tournament. Narrative: To be relevant, the page must include information on the national ranking of the team in the last three years and the email or phone number of the team coach.

IR Research Information retrieval vs Data retrieval Research information search information filtering (routing) document classification and categorization user interfaces and data visualization cross-language retrieval

IR History 1970 1990, WWW

The User Task Retrieval (Searching) Browsing classic information search process where clear objectives are defined Browsing a process where one’s main objectives are not clearly defined and might change during the interaction with the system

Logical View of the Documents Text Operations reduce the complexity of the document representation a full text  a set of index terms Steps 1. Stopwords removing 2. Stemming 3. Noun groups 4. ...

Past, Present, and Future Early Development Index Library Author name, title, subject headings, keywords The Web and Digital Libraries Hyperlinks

Conventional Text-Retrieval Systems Automatic Text Processing G. Salton, Addison-Wesley, 1989. (Chapter 9)

Data Retrieval A specified set of attributes is used to characterize each record. EMPLOYEE(NAME, SSN, BDATE, ADDR, SEX, SALARY, DNO) Exact match between the attributes used in query formulations and those attached to the document. SELECT BDATE, ADDR FROM EMPLOYEE WHERE NAME = ‘John Smith’

Text-Retrieval Systems Content identifiers (keywords, index terms, descriptors) characterize the stored texts. Degrees of coincidence between the sets of identifiers attached to queries and documents content analysis query formulation

Possible Representation Document representation (Text operation) unweighted index terms (term vectors) weighted index terms … Query (Query operation) unweighted or weighted index terms Boolean combinations (or, and, not) Search operation must be effective (Indexing)

File Structures Main requirements Alternatives fast-access for various kinds of searches large number of indices Alternatives Inverted Files Signature Files PAT trees

Inverted Files File is represented as an array of indexed documents.

Inverted-file process The document-term array is inverted (transposed).

Inverted-file process (Continued) Take two or more rows of an inverted term-document array, and produce a single combined list of document identifiers. Ex: Query= (term2 and term3) term2 1 1 0 0 term3 0 1 1 1 ------------------------------------------------------ 1 <-- D2

List-merging for two ordered lists The inverted-index operations to obtain answers are based on list-merging process. Example T1: {D1, D3} T2: {D1, D2} Merged(T1, T2): {D1, D1, D2, D3}

Extensions of Inverted Index Operations (Distance Constraints) (A within sentence B) terms A and B must co-occur in a common sentence (A adjacent B) terms A and B must occur adjacently in the text

Extensions of Inverted Index Operations (Distance Constraints) Implementation include term-location in the inverted indexes information: {P345, P348, P350, …} retrieval: {P123, P128, P345, …} include sentence-location in the indexes information: {P345, 25; P345, 37; P348, 10; P350, 8; …} retrieval: {P123, 5; P128, 25; P345, 37; P345, 40; …}

Extensions of Inverted Index Operations (Distance Constraints) Include paragraph numbers in the indexes sentence numbers within paragraphs word numbers within sentences information: {P345, 2, 3, 5; …} retrieval: {P345, 2, 3, 6; …} Query examples (information adjacent retrieval) (information within five words retrieval) Cost: the size of indexes

Retrieval models Set Theoretic Fuzzy Extended Boolean Classic Models Vector Probabilistic Algebraic Generalized Vector Latent Semantic Index Neural Networks Probabilistic Inference Network Belief Network

Classic IR Model Basic concepts : Each document is described by a set of representative keywords called index terms. Assign a numerical weights to distinct relevance between index terms.

Boolean model Binary decision criterion Data retrieval model Advantage clean formalism, simplicity Disadvantage It is not simple to translate an information need into a Boolean expression. exact matching may lead to retrieval of too few or too many documents

Vector model Assign non-binary weights to index terms in queries and in documents. => TFxIDF Compute the similarity between documents and query. => Sim(Dj, Q) More precise than Boolean model.

Term Weights Issues Term Weights Di={Ti1, 0.2; Ti2, 0.5; Ti3, 0.6} How to generate the term weights? How to apply the term weights? Sum the weights of all document terms that match the given query. Rank the output documents in the descending order of term weight.

Boolean Query with Term Weights Transform a Boolean expression into disjunctive normal form. T1 and (T2 or T3) = (T1 and T2) or (T1 and T3) For each conjunct, compute the minimum term weight of any document term in that conjunct. The document weight is the maximum of all the conjunct weights.

Boolean Query with Term Weights Example: Q=(T1 and T2) or T3 Document Conjunct Query Vectors Weights Weight (T1 and T2) (T3) (T1 and T2) or T3 D1=(T1,0.2;T2,0.5;T3,0.6) 0.2 0.6 0.6 D2=(T1,0.7;T2,0.2;T3,0.1) 0.2 0.1 0.2 D1 is preferred.

Summary Conventional IR systems Evaluation Text operations (Term selection) Query operations (Pattern matching, Relevance feedback) Indexing (File structure) Modeling

Resources Journals Conferences Journal of American Society of Information Sciences ACM Transactions on Information Systems Information Processing and Management Information Systems (Elsevier) Knowledge and Information Systems (Springer) Conferences ACM SIGIR, DL, CIKM, CHI, etc. Text Retrieval Conference (TREC)