WMES3103 : INFORMATION RETRIEVAL

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Chapter 5: Introduction to Information Retrieval
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Tries Standard Tries Compressed Tries Suffix Tries.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Association Clusters Definition The frequency of a stem in a document,, is referred to as. Let be an association matrix with rows and columns, where. Let.
Modern Information Retrieval Chapter 1: Introduction
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
Information Retrieval in Text Part I Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto.
Prepared By : Loay Alayadhi Supervised by: Dr. Mourad Ykhlef
Information Retrieval
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
CS 430 / INFO 430 Information Retrieval
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
 IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find.
Query Relevance Feedback and Ontologies How to Make Queries Better.
Modern Information Retrieval Chapter 7: Text Processing.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
H. Lundbeck A/S3-Oct-151 Assessing the effectiveness of your current search and retrieval function Anna G. Eslau, Information Specialist, H. Lundbeck A/S.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Chapter 6: Information Retrieval and Web Search
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
LIS 7450, Searching Electronic Databases Basic: Database Structure & Database Construction Dialog: Database Construction for Dialog (FYI) Deborah A. Torres.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Web- and Multimedia-based Information Systems Lecture 2.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Information Retrieval
Text Operations J. H. Wang Feb. 21, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto.
Charlyn P. Salcedo Instructor Types of Indexing Languages.
1 Chapter 7 Text Operations. 2 Logical View of a Document document structure recognition text+ structure accents, spacing, etc. stopwords noun groups.
Information Organization: Overview
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
CS 430: Information Discovery
Chapter 7 Lexical Analysis and Stoplists
Chapter 5: Information Retrieval and Web Search
Automatic Global Analysis
Basic Text Processing Word tokenization.
Information Organization: Overview
Information Retrieval and Web Design
Information Retrieval and Web Design
Presentation transcript:

WMES3103 : INFORMATION RETRIEVAL TEXT OPERATIONS

INTRODUCTION Not all words in a document = significant to represent the contents/meanings of a document Some word carry more meaning than others Noun words or group of noun words = most representative of a document content Therefore, need to preprocess the text of a document in a collection to be used as index terms

Using the set of all words in a collection to index documents = too much noise for the retrieval task Reduce noise = reduce words which can be used to refer to the document Preprocessing = process of controlling the size of the vocabulary or the number of distinct words used as index terms Preprocessing will lead to an improvement in the information retrieval performance

However, some search engines on the Web omit preprocessing Every word in the document is an index term Suppose to make the retrieval task simpler and easier for the user

DOCUMENT PREPROCESSING Text operations = text transformations 5 main operations : a. Lexical analysis of the text - digits, hyphens, punctuations marks, and the case of letters b. Elimination of stop words - filter out words which are not useful in the retrieval process c. Stemming of the remaining words - remove affixes (prefixes and suffixes)

a – d = production of a set of good index terms d. Selection of index terms – choose words/stems (or groups of words) to be used as indexing terms e. Construction of term categorization structures such as thesaurus, or extraction of structure directly represented in the text, for allowing the expansion of the original query with related terms a – d = production of a set of good index terms e = building of categorization hierarchies to capture relationship

LEXICAL ANALYSIS OF TEXT Change text of the documents into words to be adopted as index terms Objective - identify words in the text Digits, hyphens, punctuation marks, case of letters Numbers not good index terms – 1910, 1999 - but 510 B.C. – unique Hyphen – break up the words (eg. state-of-the-art = state of the art)- but some words, eg. gilt-edged, B-49 - unique words which require hyphens Punctuation marks – remove totally unless significant , eg. program code x.id and xid Case of letters – not important and can convert all to upper or lower

ELIMINATION OF STOPWORD A word which occurs in 80% of the documents in a collection = useless for retrieval= stopwords and filtered out as potential index terms (eg. articles, prepositions, conjunctions) Reduces size of indexing structure Indexing structure compressed by 40% Some verbs, adverbs and adjectives can also be treated as stopwords

425 stopwords identified by W. B. Frakes and R. Baeza-Yates 425 stopwords identified by W.B. Frakes and R. Baeza-Yates. Information retrieval : data structures & algorithms. Englewood Cliffs : Prentice Hall, 1992. Programs in C for lexical analysis are also provided Elimination of stopwords might reduce recall (eg. “To be or not to be” – all eliminated except “be” – no or irrelevant retrieval)

STEMMING Stem = a portion of a word which is left after the removal of it affixes (i.e. prefixes and suffixes) Reduces variants of the same root to a common concept Reduces size of indexing structure because number of distinct index terms is reduced Many Web search engines do not use stemming

INDEX TERM SELECTION If a full text representation of the text is adopted, then all words in the text are used as index terms = full text indexing Need to select the words to be used as index terms Not all words will be selected Bibliographic sciences – done by a specialist Other alternative method is automatic selection

THESAURI Consists of : Aim a precompiled list of important words in a given discipline for each word, a set of related words Words and concepts Aim to provide a standard vocabulary for indexing and searching to assist users with locating terms for proper query formulation to provide classified hierarchies that allow the broadening and narrowing of the current request according to user needs

Main components of a thesaurus – index terms, relationship among terms (BT, NT, RT) and a layout design for the term relationships, sometimes a definition or explanation (eg. seal (animal) and seal (document) Controlled vocabulary for indexing and searching – useful for established body of knowledge with established terms. Web – thesaurus or free-text searching ?????

eg. Yahoo – present user with term classification hierarchy that reduces the space to be searched

OTHERS Document clustering – group similar or related documents in classes, operation on all documents in the collection and not operation of the text for a document Text compression – ways to represent the data in fewer bits and bytes, greatly reduces amount of space to store text on computers, text – compression – original text reconstructed, takes less time to transmit