Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,

Slides:



Advertisements
Similar presentations
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Advertisements

Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Association Clusters Definition The frequency of a stem in a document,, is referred to as. Let be an association matrix with rows and columns, where. Let.
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
WMES3103 : INFORMATION RETRIEVAL
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
1 Vocabulary & languages in indexing & searching Connection: indexing searching
Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
1 Languages for aboutness n Indexing languages: –Terminological tools Thesauri (CV – controlled vocabulary) Subject headings lists (CV) Authority files.
Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto.
Prepared By : Loay Alayadhi Supervised by: Dr. Mourad Ykhlef
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
Vocabulary & languages in searching
Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
Taxonomies: Hidden but Critical Tools Marjorie M.K. Hlava President Access Innovations, Inc.
Query Relevance Feedback and Ontologies How to Make Queries Better.
Query Expansion.
Modern Information Retrieval Chapter 7: Text Processing.
H. Lundbeck A/S3-Oct-151 Assessing the effectiveness of your current search and retrieval function Anna G. Eslau, Information Specialist, H. Lundbeck A/S.
ICS-FORTH January 11, Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Chapter 6: Information Retrieval and Web Search
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Thesauri usage in information retrieval systems: example of LISTA and ERIC database thesaurus Kristina Feldvari Departmant of Information Sciences, Faculty.
Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏
© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.
Web- and Multimedia-based Information Systems Lecture 2.
Layered MorphoSaurus Lexicon Extension. Problem Confuse and arbitrary synonym classes of non-medical concepts High ambiguity of general (non- terminological)
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Recuperação de Informação Cap. 01: Introdução 21 de Fevereiro de 1999 Berthier Ribeiro-Neto.
Information Retrieval
June 2003INIS Training Seminar1 INIS Training Seminar 2-6 June 2003 Subject Analysis Thesaurus and Indexing Alexander Nevyjel Subject Control Unit INIS.
Text Operations J. H. Wang Feb. 21, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Acceso a la información mediante exploración de sintagmas Anselmo Peñas, Julio Gonzalo y Felisa Verdejo Dpto. Lenguajes y Sistemas Informáticos UNED III.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Controlled Vocabulary & Thesaurus Design Associative Relationships & Thesauri.
ORGANIZATION OF ELEMENTS OF INFORMATION The Thesaurus.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto.
Charlyn P. Salcedo Instructor Types of Indexing Languages.
1 Chapter 7 Text Operations. 2 Logical View of a Document document structure recognition text+ structure accents, spacing, etc. stopwords noun groups.
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Introduction to Information Retrieval and Web Search
CS 430: Information Discovery
Chapter 7 Lexical Analysis and Stoplists
CSE 635 Multimedia Information Retrieval
Inf 722 Information Organisation
Information Retrieval and Web Design
THESAURUS CONSTRUCTION: GROUND WATER
Presentation transcript:

Text Operations: Preprocessing

Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination, stemming, index term selection, thesauri –build a thesaurus

Document Preprocessing Lexical analysis of the text –digits, hyphens, punctuation marks, the case of letters Elimination of stopwords –filtering out the useless words for retrieval purposes Stemming –dealing with the syntactic variations of query terms Index terms selection –determining the terms to be used as index terms Thesauri –the expansion of the original query with related term

The Process of Preprocessing structure Lexical analysis stopwords Noun groups stemming Manual indexing Docs structureFull textIndex terms

Lexical Analysis of the Text Four particular cases Numbers usually not good index terms because of their vagueness need some advanced lexical analysis procedure –ex) 510B.C., , 12/2/2000, …. Hyphens breaking up hyphenated words might be useful –ex) state-of-the-art  state of the art (Good!) –but, B-1  B 1 (???) need to adopt a general rule and to specify exceptions on a case by case basis

Lexical Analysis of the Text Punctuation marks –removed entirely ex) 510B.C  510BC if the query contains ‘510B.C’, removal of the dot both in query term and in the documents will not affect retrieval performance –require the preparation of a list of exceptions ex) val.id  valid (???) The case of letters –converts all the text to either lower or upper case –part of the semantics might be lost Northwestern University  northwestern university (???)

Elimination of Stopwords Basic concept –filtering out words with very low discrimination values ex) a, the, this, that, where, when, …. Advantage –reduce the size of the indexing structure considerably Disadvantage –might reduce recall as well ex) “to be or not to be”

Stemming What is the “stem”? –the portion of a word which is left after the removal of its affixes (i.e., prefixes and suffixes) –ex) ‘connect’ is the stem for the variants ‘connected’, ‘connecting’, ‘connection’, ‘connections’ Effect of stemming –reduce variants of the same root to a common concept –reduce the size of the indexing structure –controversy about the benefits of stemming

Index Term Selection Index terms selection –not all words are equally significant for representing the semantics of a document Manual selection –selection of index terms is usually done by specialist Automatic selection of index terms –most of the semantics is carried by the noun words –clustering nouns which appear nearby in the text into a single indexing component (or concept) –ex) computer science

Thesauri What is the “thesaurus”? –list of important words in a given domain of knowledge –a set of related words derived from a synonymity relationship –a controlled vocabulary for the indexing and searching Main purposes –provide a standard vocabulary for indexing and searching –assist users with locating terms for proper query formulation –provide classified hierarchies that allow the broadening and narrowing of the current query request

Thesauri Thesaurus index terms –denote a concept which is the basic semantic unit –can be individual words, groups of words, or phrases ex) building, teaching, ballistic missiles, body temperature –frequently, it is necessary to complement a thesaurus entry with a definition or an explanation ex) seal (marine animals), seal (documents) Thesaurus term relationships –mostly composed of synonyms and near-synonyms –BT (Broader Term), NT (Narrower Term), RT (Related Term)