Download presentation
1
Modern Information Retrieval Chapter 1: Introduction
Ricardo Baeza-Yates Berthier Ribeiro-Neto
2
Motivation Example of the user information need
Topic: NCAA college tennis team Description: Find all the pages (documents) containing information on college tennis teams which (1) are maintained by an university in the USA and (2) participate in the NCAA tennis tournament. Narrative: To be relevant, the page must include information on the national ranking of the team in the last three years and the or phone number of the team coach.
3
IR Research Information retrieval vs Data retrieval Research
information search information filtering (routing) document classification and categorization user interfaces and data visualization cross-language retrieval
4
IR History 1970 1990, WWW
5
The User Task Retrieval (Searching) Browsing
classic information search process where clear objectives are defined Browsing a process where one’s main objectives are not clearly defined and might change during the interaction with the system
6
Logical View of the Documents
Text Operations reduce the complexity of the document representation a full text a set of index terms Steps 1. Stopwords removing 2. Stemming 3. Noun groups 4. ...
7
Past, Present, and Future
Early Development Index Library Author name, title, subject headings, keywords The Web and Digital Libraries Hyperlinks
8
Resources Journals Conferences
Journal of American Society of Information Sciences ACM Transactions on Information Systems Information Processing and Management Information Systems (Elsevier) Knowledge and Information Systems (Springer) Conferences ACM SIGIR, DL, CIKM, CHI, etc. Text Retrieval Conference (TREC)
9
Conventional Text-Retrieval Systems Automatic Text Processing
G. Salton, Addison-Wesley, 1989. (Chapter 9)
10
Data Retrieval A specified set of attributes is used to characterize each record. EMPLOYEE(NAME, SSN, BDATE, ADDR, SEX, SALARY, DNO) Exact match between the attributes used in query formulations and those attached to the document. SELECT BDATE, ADDR FROM EMPLOYEE WHERE NAME = ‘John Smith’
11
Text-Retrieval Systems
Content identifiers (keywords, index terms, descriptors) characterize the stored texts. Degrees of coincidence between the sets of identifiers attached to queries and documents content analysis query formulation
12
Possible Representation
Document representation unweighted index terms (term vectors) weighted index terms … Query unweighted or weighted index terms Boolean combinations (or, and, not) Search operation must be effective
13
File Structures Main requirements Alternatives
fast-access for various kinds of searches large number of indices Alternatives Inverted Files Signature Files PAT trees
14
Inverted Files File is represented as an array of indexed documents.
15
Inverted-file process
The document-term array is inverted (transposed).
16
Inverted-file process (Continued)
Take two or more rows of an inverted term-document array, and produce a single combined list of document identifiers. Ex: Query= (term2 and term3) term term <-- D2
17
List-merging for two ordered lists
The inverted-index operations to obtain answers are based on list-merging process. Example T1: {D1, D3} T2: {D1, D2} Merged(T1, T2): {D1, D1, D2, D3}
18
Extensions of Inverted Index Operations (Distance Constraints)
(A within sentence B) terms A and B must co-occur in a common sentence (A adjacent B) terms A and B must occur adjacently in the text
19
Extensions of Inverted Index Operations (Distance Constraints)
Implementation include term-location in the inverted indexes information: {P345, P348, P350, …} retrieval: {P123, P128, P345, …} include sentence-location in the indexes information: {P345, 25; P345, 37; P348, 10; P350, 8; …} retrieval: {P123, 5; P128, 25; P345, 37; P345, 40; …}
20
Extensions of Inverted Index Operations (Distance Constraints)
Include paragraph numbers in the indexes sentence numbers within paragraphs word numbers within sentences information: {P345, 2, 3, 5; …} retrieval: {P345, 2, 3, 6; …} Query examples (information adjacent retrieval) (information within five words retrieval) Cost: the size of indexes
21
Term Weights Issues Term Weights Di={Ti1, 0.2; Ti2, 0.5; Ti3, 0.6}
How to generate the term weights? How to apply the term weights? Sum the weights of all document terms that match the given query. Rank the output documents in the descending order of term weight.
22
Boolean Query with Term Weights
Transform a Boolean expression into disjunctive normal form. T1 and (T2 or T3) = (T1 and T2) or (T1 and T3) For each conjunct, compute the minimum term weight of any document term in that conjunct. The document weight is the maximum of all the conjunct weights.
23
Boolean Query with Term Weights
Example: Q=(T1 and T2) or T3 Document Conjunct Query Vectors Weights Weight (T1 and T2) (T3) (T1 and T2) or T3 D1=(T1,0.2;T2,0.5;T3,0.6) D2=(T1,0.7;T2,0.2;T3,0.1) D1 is preferred.
24
Stemming Term Truncation
Remove suffixes and/or prefixes from context terms. Example PSYCH*: psychiatrist, psychiatry, psychiatric, psychology, psychological, …
25
Summary
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.