Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.

www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval

www.monash.edu.au 2 Database Types Relational DB XML collections Text Collections Multimedia Collections highly-structured ill-structured

www.monash.edu.au 3 Ill-structured data Characteristics: –Variable length records, fields –Repeated fields [non-normalised] –Mixed media –Often large Often accessed by “novice” users Need for both currency and completeness

www.monash.edu.au 4 Information Retrieval Information retrieval has been the term applied to such areas as: – text retrieval systems, library systems, citation retrieval systems, records management and archives, photo library applications etc. These systems are typical of variable-length record systems Text retrieval is a subset of Information Retrieval. –research articles may use the term IR = text retrieval, especially in the 70s,80s and 90s.

www.monash.edu.au 5 Text Retrieval - Overview Information retrieval – branch of database theory –specialises in managing retrieval of unstructured data –large amount of free format text. Key problem: –How to retrieve the appropriate pieces of unstructured data (e.g. documents) in response to a more or less structured query. Response to a query: –Does not answer the query directly –Identify relevant information.

www.monash.edu.au 6 Text Retrieval Characteristics large volume of document space document space may/may not be structured. query may not be structured. exact matching, such as relational database, will not work effectively. objects which are to be retrieved, usually represented by surrogate records.

www.monash.edu.au 7 Surrogate Records Most text retrieval systems rely on surrogate records rather than directly accessing the objects themselves. The quality of the surrogate records often decides how well the system retrieves. The structure of the surrogate records will affect how well they can be indexed or otherwise accessed.

www.monash.edu.au 8 Text Retrieval Processes Representation Storage Organization Retrieval Presentation

www.monash.edu.au 9 Text Retrieval Processes Model

www.monash.edu.au 10 Retrieval Process

www.monash.edu.au 11 Indexing (Document Analysis) Document Natural Language Text ANALYSE Keywords STORE - Stemming - Thesauri Replacement - (Weight Assignment)

www.monash.edu.au 12 Query Formulation

www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Indexing in Information Retrieval

www.monash.edu.au 14 Indexed Files in Traditional Databases An index is a look up table which establishes a correspondence between a particular attribute (or attributes) and the address of the record in the file. One named (physical) file - two logical files: –Data file - contains full data records –Index file - “records” consist of two fields: key value and address Index file small - quick to search Addresses obtained from the index enable direct access to the data file Logically sequential access also via index The main purpose of indices in relational database is to improve the efficiency of the retrieval.

www.monash.edu.au 15 Indexed Sequential File Index Data Records

www.monash.edu.au 16 Indexing in Information Retrieval To create a representation of the document (surrogate record). The purpose of the indices is to improve the efficiency and effectiveness. The effectiveness is measured by whether the system can retrieve the “relevant” documents in response to a user’s query. A document is a collection of words, each of this words can be considered as an index entry. –A index record can be created using doc-id and word-id as its fields.

www.monash.edu.au 17 Indexing in Text Retrieval Systems Doc-idWord-idweightWord 14510 comput 1755 Tech 12003 Intro 1126 Program 2161 Policy 2382 Econom 22125 Share... Doc-1 (data record) Doc-2 (data record)

www.monash.edu.au 18 Objectives of Indexing Process a sufficiently general description of a document so that it can be retrieved with queries that concern the same subject as the document; sufficiently specific description so that the document will not be returned for those queries which are not related to the document.

www.monash.edu.au 19 Vocabulary in the System Controlled Vocabulary vs Uncontrolled Vocabulary.

www.monash.edu.au 20 Controlled Vocabulary Controlled vocabulary is a method of predetermining the terms which will be used in a specific domain so that –indexers will select from a limited set of terms –searchers can use terms knowing that they have been applied in an objective manner –index sets are reduced in size

www.monash.edu.au 21 Indexing Manual indexing Automatic indexing

www.monash.edu.au 22 Manual Indexing Methods 1. Give the document a single code from a predefined list. e.g.: –the first letter of the first author’s family name –a Dewey Decimal number 2. Assign several of a predefined lists of codes to a document. e.g.: –assign the Computing Reviews classification to articles. Assign to each document a set of descriptors that are not predefined. The descriptors may be words from the text of the document and/or thesaurus.

www.monash.edu.au 23 Manual Indexing – is it good? Single code from a predefined list. –simple and low index cost. –may not be effective for retrieval. All other techniques –require a more complex index to be maintained. –Effectiveness may be better compared to the single code approach.

www.monash.edu.au 24 Manual Indexing – is it good? Advantage: terms never used in the text but are extremely descriptive may be assigned to the document. Disadvantage: –Inconsistency due to human judgment >Two person may have different interpretation of a document’s content. –inflexible view of documents >The user’s view of a document will be dictated by the system’s view of a document. –no control on number of satisfying documents. >A document may be assigned to a different groups depending on the understanding of the person who performs the indexing.

www.monash.edu.au 25 Automatic Indexing - A Basic Method Assume that a document consists of just text and that we will derive our indexing terms from this text. Steps: –Break the text up into words, –casefold, –and index on every word. This technique is very simple and performs reasonably well.

www.monash.edu.au 26 Automatic Indexing - Refinement Language dependent. –refinement for English will be different from Chinese Techniques for refining indexing process: –Stop List –Stemming –Term Weighting

www.monash.edu.au 27 Indexing Refinement – Stop List A stop list contains a list of common words. A common word does not help the system to inference the content of a document. –Eg the, a, and, she, them Generally contains words that are NOT nouns, verbs, adjectives and adverbs. A stop list might consist of a, the, an is, be,.... Common stop lists run from 10 to hundreds of words. –It does not matter what the stop list is, typically around 300 common words will do well. Indexing process will ignore the words listed in the stop list.

www.monash.edu.au 28 Stop Lists Fox indicates that the first 20 stop words accounts for 31.19% of the English corpus. >Fox C. (1992). Lexical Analysis and Stoplists. In Frakes W.B. and Baeza-Yates R., Eds.), Information Retrieval:Data Structures and Algorithms, Englewood Cliffs, NJ.: Prentice-Hall The first 20 stop words: –The, of, and, to, a, in, that, is, was, he, for, it, with, as, not, his, on, be, at, by.

www.monash.edu.au 29 Refinement - Stemming To incorporate many variations of words, where an attempt is made to accommodate many variations comprising a concept This avoids exceedingly long “or” query statement. Example: inquiry or inquired or inquiries The process is performed after the “stop list” process. Porter stemming algorithm –Porter, M.F., 1980, An algorithm for suffix stripping, Program, 14(3) :130-137)

www.monash.edu.au 30 Stemming - Suffix Most English meaning shifts for grammatical purposes are handled by suffixes Most retrieval systems allow for “trailing” or suffixes truncation. Example: –“inquir$” will retrieve documents containing the words “inquire”, “inquired”, “inquires”, “inquiring”, “inquiry” etc.

www.monash.edu.au 31 Stemming - Prefix Usually is not used in English text retrieval systems. Prefix is substantial modifier, even a negation. Example: –Responsible, irresponsible –Patient, impatient Prefix stemming may be useful in Chemical databases.

www.monash.edu.au 32 Stemming – Exception List Irregularity in the language needs to be implemented as a “lookup list” Example: –Irregular plurals >woman => women >child => children –past tense >choose => chose >find => found

www.monash.edu.au 33 Summary Text Retrieval Systems: –motivation –model Indexing Refinements: –Stop List –Stemming –Term Weight (next week)

Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.

Similar presentations

Presentation on theme: "Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.

Similar presentations

Presentation on theme: "Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval."— Presentation transcript:

Similar presentations

About project

Feedback