Modern Information Retrieval Chapter 1: Introduction

Slides:



Advertisements
Similar presentations
Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 1: Introduction Alexander Gelbukh
Advertisements

Special Topics in Computer Science The Art of Information Retrieval Chapter 1: Introduction Alexander Gelbukh
1 Conventional Text-Retrieval Systems Automatic Text Processing by G. Salton, Addison-Wesley, (Chapter 9)
Chapter 5: Introduction to Information Retrieval
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Multimedia Database Systems
Modern Information Retrieval Chapter 1: Introduction
An Introduction to Information Retrieval and Applications J. H. Wang Feb. 19, 2008.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
IR Models: Structural Models
Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
WMES3103 : INFORMATION RETRIEVAL
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
Chapter 4 : Query Languages Baeza-Yates, 1999 Modern Information Retrieval.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
INFORMATION RETRIEVAL WEEK 1 AND 2
1 Information Retrieval and Web Search Introduction.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Modern Information Retrieval Chapter 1 Introduction.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Modern Information Retrieval Chapter 4 Query Languages.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Overview of Search Engines
 IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find.
Search Engines and Information Retrieval Chapter 1.
Modern Information Retrieval Computer engineering department Fall 2005.
資訊檢索與擷取 Information Retrieval and Extraction
Information Retrieval and Web Search Lecture 1. Course overview Instructor: Rada Mihalcea Class web page:
Information Retrieval Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Proposal for Term Project J. H. Wang Mar. 2, 2015.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Modern Information Retrieval Presented by Miss Prattana Chanpolto Faculty of Information Technology.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Recuperação de Informação Cap. 01: Introdução 21 de Fevereiro de 1999 Berthier Ribeiro-Neto.
Information Retrieval and Web Search Course overview Instructor: Rada Mihalcea.
Information Retrieval
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
Information Retrieval and Web Search
Information Retrieval and Web Search
Information Retrieval and Web Search
Multimedia Information Retrieval
Information Retrieval
موضوع پروژه : بازیابی اطلاعات Information Retrieval
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
Information Retrieval and Extraction
Information Retrieval and Web Design
Recuperação de Informação
Information Retrieval and Web Search
Presentation transcript:

Modern Information Retrieval Chapter 1: Introduction Ricardo Baeza-Yates Berthier Ribeiro-Neto

Motivation Example of the user information need Topic: NCAA college tennis team Description: Find all the pages (documents) containing information on college tennis teams which (1) are maintained by an university in the USA and (2) participate in the NCAA tennis tournament. Narrative: To be relevant, the page must include information on the national ranking of the team in the last three years and the email or phone number of the team coach.

IR Research Information retrieval vs Data retrieval Research information search information filtering (routing) document classification and categorization user interfaces and data visualization cross-language retrieval

IR History 1970 1990, WWW

The User Task Retrieval (Searching) Browsing classic information search process where clear objectives are defined Browsing a process where one’s main objectives are not clearly defined and might change during the interaction with the system

Logical View of the Documents Text Operations reduce the complexity of the document representation a full text  a set of index terms Steps 1. Stopwords removing 2. Stemming 3. Noun groups 4. ...

Past, Present, and Future Early Development Index Library Author name, title, subject headings, keywords The Web and Digital Libraries Hyperlinks

Resources Journals Conferences Journal of American Society of Information Sciences ACM Transactions on Information Systems Information Processing and Management Information Systems (Elsevier) Knowledge and Information Systems (Springer) Conferences ACM SIGIR, DL, CIKM, CHI, etc. Text Retrieval Conference (TREC)

Conventional Text-Retrieval Systems Automatic Text Processing G. Salton, Addison-Wesley, 1989. (Chapter 9)

Data Retrieval A specified set of attributes is used to characterize each record. EMPLOYEE(NAME, SSN, BDATE, ADDR, SEX, SALARY, DNO) Exact match between the attributes used in query formulations and those attached to the document. SELECT BDATE, ADDR FROM EMPLOYEE WHERE NAME = ‘John Smith’

Text-Retrieval Systems Content identifiers (keywords, index terms, descriptors) characterize the stored texts. Degrees of coincidence between the sets of identifiers attached to queries and documents content analysis query formulation

Possible Representation Document representation unweighted index terms (term vectors) weighted index terms … Query unweighted or weighted index terms Boolean combinations (or, and, not) Search operation must be effective

File Structures Main requirements Alternatives fast-access for various kinds of searches large number of indices Alternatives Inverted Files Signature Files PAT trees

Inverted Files File is represented as an array of indexed documents.

Inverted-file process The document-term array is inverted (transposed).

Inverted-file process (Continued) Take two or more rows of an inverted term-document array, and produce a single combined list of document identifiers. Ex: Query= (term2 and term3) term2 1 1 0 0 term3 0 1 1 1 ------------------------------------------------------ 1 <-- D2

List-merging for two ordered lists The inverted-index operations to obtain answers are based on list-merging process. Example T1: {D1, D3} T2: {D1, D2} Merged(T1, T2): {D1, D1, D2, D3}

Extensions of Inverted Index Operations (Distance Constraints) (A within sentence B) terms A and B must co-occur in a common sentence (A adjacent B) terms A and B must occur adjacently in the text

Extensions of Inverted Index Operations (Distance Constraints) Implementation include term-location in the inverted indexes information: {P345, P348, P350, …} retrieval: {P123, P128, P345, …} include sentence-location in the indexes information: {P345, 25; P345, 37; P348, 10; P350, 8; …} retrieval: {P123, 5; P128, 25; P345, 37; P345, 40; …}

Extensions of Inverted Index Operations (Distance Constraints) Include paragraph numbers in the indexes sentence numbers within paragraphs word numbers within sentences information: {P345, 2, 3, 5; …} retrieval: {P345, 2, 3, 6; …} Query examples (information adjacent retrieval) (information within five words retrieval) Cost: the size of indexes

Term Weights Issues Term Weights Di={Ti1, 0.2; Ti2, 0.5; Ti3, 0.6} How to generate the term weights? How to apply the term weights? Sum the weights of all document terms that match the given query. Rank the output documents in the descending order of term weight.

Boolean Query with Term Weights Transform a Boolean expression into disjunctive normal form. T1 and (T2 or T3) = (T1 and T2) or (T1 and T3) For each conjunct, compute the minimum term weight of any document term in that conjunct. The document weight is the maximum of all the conjunct weights.

Boolean Query with Term Weights Example: Q=(T1 and T2) or T3 Document Conjunct Query Vectors Weights Weight (T1 and T2) (T3) (T1 and T2) or T3 D1=(T1,0.2;T2,0.5;T3,0.6) 0.2 0.6 0.6 D2=(T1,0.7;T2,0.2;T3,0.1) 0.2 0.1 0.2 D1 is preferred.

Stemming Term Truncation Remove suffixes and/or prefixes from context terms. Example PSYCH*: psychiatrist, psychiatry, psychiatric, psychology, psychological, …

Summary