Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.

Slides:



Advertisements
Similar presentations
Organisation Of Data (1) Database Theory
Advertisements

Chapter 5: Introduction to Information Retrieval
Multimedia Database Systems
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.
Modern Information Retrieval Chapter 1: Introduction
Introduction to Library Research Gabriela Scherrer Reference Librarian for English Languages and Literatures, University Library of Bern.
Database Systems: A Practical Approach to Design, Implementation and Management International Computer Science S. Carolyn Begg, Thomas Connolly Lecture.
WMES3103 : INFORMATION RETRIEVAL
Introduction to Library Research Gabriela Scherrer Reference Librarian for English Languages and Literatures, University Library of Bern.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Recap of Feb 27: Disk-Block Access and Buffer Management Major concepts in Disk-Block Access covered: –Disk-arm Scheduling –Non-volatile write buffers.
INFORMATION RETRIEVAL WEEK 1 AND 2
Evaluating the Performance of IR Sytems
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
“A successful man is usually a classifier and a chartmaker. This applies as much to modern business as to science or libraries… A large business or work.
Chapter 1 Introduction to Databases
Chapter 17 Methodology – Physical Database Design for Relational Databases Transparencies © Pearson Education Limited 1995, 2005.
Introduction to Databases and Database Languages
1.A file is organized logically as a sequence of records. 2. These records are mapped onto disk blocks. 3. Files are provided as a basic construct in operating.
Introduction to Library Research Gabriela Scherrer Reference Librarian for English Languages and Literatures, University Library of Bern.
Indexes/Abstracts Ready Reference Dr. Dania Bilal IS 530 Spring 2002.
Modern Information Retrieval Chapter 7: Text Processing.
Lecture 9 Methodology – Physical Database Design for Relational Databases.
Modern Information Retrieval Computer engineering department Fall 2005.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
IL Step 2: Searching for Information Information Literacy 1.
Database What is a database? A database is a collection of information that is typically organized so that it can easily be storing, managing and retrieving.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
10/10/2012ISC239 Isabelle Bichindaritz1 Physical Database Design.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
INFORMATION MANAGEMENT Unit 2 SO 4 Explain the advantages of using a database approach compared to using traditional file processing; Advantages including.
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
Structure of IR Systems INST 734 Module 1 Doug Oard.
Chapter 10 Designing the Files and Databases. SAD/CHAPTER 102 Learning Objectives Discuss the conversion from a logical data model to a physical database.
Methodology – Physical Database Design for Relational Databases.
Web- and Multimedia-based Information Systems Lecture 2.
XML and Database.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Basics of Information Retrieval and Query Formulation Bekele Negeri Duresa Nuclear Information Specialist.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Recuperação de Informação Cap. 01: Introdução 21 de Fevereiro de 1999 Berthier Ribeiro-Neto.
Information Retrieval
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
Information Retrieval Chapter 2 by Rajendra Akerkar, Pawan Lingras Presented by: Xxxxxx.
SOCSAMS e-learning Dept. of Computer Applications, MES College Marampally FILE SYSTEM.
1 Geog 357: Data models and DBMS. Geographic Decision Making.
Relevance Feedback in Image Retrieval System: A Survey Tao Huang Lin Luo Chengcui Zhang.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Definition, purposes/functions, elements of IR systems Lesson 1.
1 Agenda TMA02 M876 Block 4. 2 Model of database development data requirements conceptual data model logical schema schema and database establishing requirements.
Using computers to search electronic databases
Methodology – Physical Database Design for Relational Databases
CS 430: Information Discovery
Concept of a document Lesson 3.
Information Retrieval
IL Step 2: Searching for Information
CSE 635 Multimedia Information Retrieval
Spreadsheets, Modelling & Databases
The ultimate in data organization
Information Retrieval and Web Design
Recuperação de Informação
Introduction to Search Engines
Presentation transcript:

CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval

2 Database Types Relational DB XML collections Text Collections Multimedia Collections highly-structured ill-structured

3 Ill-structured data Characteristics: –Variable length records, fields –Repeated fields [non-normalised] –Mixed media –Often large Often accessed by “novice” users Need for both currency and completeness

4 Information Retrieval Information retrieval has been the term applied to such areas as: – text retrieval systems, library systems, citation retrieval systems, records management and archives, photo library applications etc. These systems are typical of variable-length record systems Text retrieval is a subset of Information Retrieval. –research articles may use the term IR = text retrieval, especially in the 70s,80s and 90s.

5 Text Retrieval - Overview Information retrieval – branch of database theory –specialises in managing retrieval of unstructured data –large amount of free format text. Key problem: –How to retrieve the appropriate pieces of unstructured data (e.g. documents) in response to a more or less structured query. Response to a query: –Does not answer the query directly –Identify relevant information.

6 Text Retrieval Characteristics large volume of document space document space may/may not be structured. query may not be structured. exact matching, such as relational database, will not work effectively. objects which are to be retrieved, usually represented by surrogate records.

7 Surrogate Records Most text retrieval systems rely on surrogate records rather than directly accessing the objects themselves. The quality of the surrogate records often decides how well the system retrieves. The structure of the surrogate records will affect how well they can be indexed or otherwise accessed.

8 Text Retrieval Processes Representation Storage Organization Retrieval Presentation

9 Text Retrieval Processes Model

10 Retrieval Process

11 Indexing (Document Analysis) Document Natural Language Text ANALYSE Keywords STORE - Stemming - Thesauri Replacement - (Weight Assignment)

12 Query Formulation

CSE3201/CSE4500 Information Retrieval Systems Indexing in Information Retrieval

14 Indexed Files in Traditional Databases An index is a look up table which establishes a correspondence between a particular attribute (or attributes) and the address of the record in the file. One named (physical) file - two logical files: –Data file - contains full data records –Index file - “records” consist of two fields: key value and address Index file small - quick to search Addresses obtained from the index enable direct access to the data file Logically sequential access also via index The main purpose of indices in relational database is to improve the efficiency of the retrieval.

15 Indexed Sequential File Index Data Records

16 Indexing in Information Retrieval To create a representation of the document (surrogate record). The purpose of the indices is to improve the efficiency and effectiveness. The effectiveness is measured by whether the system can retrieve the “relevant” documents in response to a user’s query. A document is a collection of words, each of this words can be considered as an index entry. –A index record can be created using doc-id and word-id as its fields.

17 Indexing in Text Retrieval Systems Doc-idWord-idweightWord comput 1755 Tech Intro 1126 Program 2161 Policy 2382 Econom Share... Doc-1 (data record) Doc-2 (data record)

18 Objectives of Indexing Process a sufficiently general description of a document so that it can be retrieved with queries that concern the same subject as the document; sufficiently specific description so that the document will not be returned for those queries which are not related to the document.

19 Vocabulary in the System Controlled Vocabulary vs Uncontrolled Vocabulary.

20 Controlled Vocabulary Controlled vocabulary is a method of predetermining the terms which will be used in a specific domain so that –indexers will select from a limited set of terms –searchers can use terms knowing that they have been applied in an objective manner –index sets are reduced in size

21 Indexing Manual indexing Automatic indexing

22 Manual Indexing Methods 1. Give the document a single code from a predefined list. e.g.: –the first letter of the first author’s family name –a Dewey Decimal number 2. Assign several of a predefined lists of codes to a document. e.g.: –assign the Computing Reviews classification to articles. Assign to each document a set of descriptors that are not predefined. The descriptors may be words from the text of the document and/or thesaurus.

23 Manual Indexing – is it good? Single code from a predefined list. –simple and low index cost. –may not be effective for retrieval. All other techniques –require a more complex index to be maintained. –Effectiveness may be better compared to the single code approach.

24 Manual Indexing – is it good? Advantage: terms never used in the text but are extremely descriptive may be assigned to the document. Disadvantage: –Inconsistency due to human judgment >Two person may have different interpretation of a document’s content. –inflexible view of documents >The user’s view of a document will be dictated by the system’s view of a document. –no control on number of satisfying documents. >A document may be assigned to a different groups depending on the understanding of the person who performs the indexing.

25 Automatic Indexing - A Basic Method Assume that a document consists of just text and that we will derive our indexing terms from this text. Steps: –Break the text up into words, –casefold, –and index on every word. This technique is very simple and performs reasonably well.

26 Automatic Indexing - Refinement Language dependent. –refinement for English will be different from Chinese Techniques for refining indexing process: –Stop List –Stemming –Term Weighting

27 Indexing Refinement – Stop List A stop list contains a list of common words. A common word does not help the system to inference the content of a document. –Eg the, a, and, she, them Generally contains words that are NOT nouns, verbs, adjectives and adverbs. A stop list might consist of a, the, an is, be,.... Common stop lists run from 10 to hundreds of words. –It does not matter what the stop list is, typically around 300 common words will do well. Indexing process will ignore the words listed in the stop list.

28 Stop Lists Fox indicates that the first 20 stop words accounts for 31.19% of the English corpus. >Fox C. (1992). Lexical Analysis and Stoplists. In Frakes W.B. and Baeza-Yates R., Eds.), Information Retrieval:Data Structures and Algorithms, Englewood Cliffs, NJ.: Prentice-Hall The first 20 stop words: –The, of, and, to, a, in, that, is, was, he, for, it, with, as, not, his, on, be, at, by.

29 Refinement - Stemming To incorporate many variations of words, where an attempt is made to accommodate many variations comprising a concept This avoids exceedingly long “or” query statement. Example: inquiry or inquired or inquiries The process is performed after the “stop list” process. Porter stemming algorithm –Porter, M.F., 1980, An algorithm for suffix stripping, Program, 14(3) : )

30 Stemming - Suffix Most English meaning shifts for grammatical purposes are handled by suffixes Most retrieval systems allow for “trailing” or suffixes truncation. Example: –“inquir$” will retrieve documents containing the words “inquire”, “inquired”, “inquires”, “inquiring”, “inquiry” etc.

31 Stemming - Prefix Usually is not used in English text retrieval systems. Prefix is substantial modifier, even a negation. Example: –Responsible, irresponsible –Patient, impatient Prefix stemming may be useful in Chemical databases.

32 Stemming – Exception List Irregularity in the language needs to be implemented as a “lookup list” Example: –Irregular plurals >woman => women >child => children –past tense >choose => chose >find => found

33 Summary Text Retrieval Systems: –motivation –model Indexing Refinements: –Stop List –Stemming –Term Weight (next week)