1 Chapter 7 Text Operations. 2 Logical View of a Document document structure recognition text+ structure accents, spacing, etc. stopwords noun groups.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Introduction to Information Retrieval
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Stemming Algorithms 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.
CS 430 / INFO 430 Information Retrieval
1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday.
CS 430 / INFO 430 Information Retrieval
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Association Clusters Definition The frequency of a stem in a document,, is referred to as. Let be an association matrix with rows and columns, where. Let.
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5.
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
WMES3103 : INFORMATION RETRIEVAL
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II.
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto.
Prepared By : Loay Alayadhi Supervised by: Dr. Mourad Ykhlef
CS 430 / INFO 430 Information Retrieval
1 Terms and Query Operations Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Modern Information Retrieval Chapter 7: Text Processing.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement and Relevance Feedback.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Data Structure. Two segments of data structure –Storage –Retrieval.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
1 CS 430: Information Discovery Lecture 25 Cluster Analysis 2 Thesaurus Construction.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Web- and Multimedia-based Information Systems Lecture 2.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Information Retrieval
Text Operations J. H. Wang Feb. 21, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
1 Discussion Class 3 Stemming Algorithms. 2 Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others to.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Terms and Query Operations Hsin-Hsi Chen. Lexical Analysis and Stoplists.
SPAG Parent Workshop April Agenda English and the new SPaG curriculum How to help your children at home How we teach SPaG Sample questions from.
Information Retrieval in Practice
CS 430: Information Discovery
Chapter 7 Lexical Analysis and Stoplists
Discussion Class 3 Stemming Algorithms.
Information Retrieval and Web Design
Presentation transcript:

1 Chapter 7 Text Operations

2 Logical View of a Document document structure recognition text+ structure accents, spacing, etc. stopwords noun groups stemming automatic or manual indexing structure text full text index terms

3 Text Operations Lexical analysis of the text Elimination of stopwords Stemming Selection of index terms Construction of term categorization structures

4 Lexical Analysis of the Text Word separators  space  digits  hyphens  punctuation marks  the case of the letters

5 Lexical Analysis for Automatic Indexing Lexical Analysis Convert an input stream of characters into stream words or token. What is a word or a token? Tokens consist of letters.  digits: Most numbers are not good index terms. counterexamples: case numbers in a legal database, “ B6 ” and “ B12 ” in vitamin database.

6 Lexical Analysis for Automatic Indexing (Continued)  hyphens break hyphenated words: state-of-the- art, state of the art keep hyphenated words as a token: “ Jean-Claude ”, “ F-16 ”  punctuation marks: often used as parts of terms, e.g., OS/2, 510B.C.  case: usually not significant in index terms

7 Elimination of Stopwords A list of stopwords  words that are too frequent among the documents  articles, prepositions, conjunctions, etc. Can reduce the size of the indexing structure considerably Problem  Search for “ to be or not to be ” ?

8 Stopword Avoid retrieving almost very item in a database regardless of its relevance. Examples  conservative approach (ORBIT Search Service): and, an, by, from, of, the, with  (derived from Brown corpus): 425 words a, about, above, across, after, again, against, all, almost, alone, along, already, also, although, always, among, an, and, another, any, anybody, anyone, anything, anywhere, are, area, areas, around, as, ask, asked, asking, asks, at, away, b, back, backed, backing, backs, be, because, became,... Articles, prepositions, conjunctions, …

9 Stemming Example  connect, connected, connecting, connection, connections  effectiveness --> effective --> effect  picnicking --> picnic Removing strategies  affix removal: intuitive, simple  table lookup  successor variety  n-gram

10 Stemmers programs that relate morphologically similar indexing and search terms stem at indexing time  advantage: efficiency and index file compression  disadvantage: information about the full terms is lost example (CATALOG system), stem at search time Look for: system users Search Term: users TermOccurrences 1. user15 2. users1 3. used3 4. using2 The user selects the terms he wants by numbers

11 Conflation Methods manual automatic (stemmers)  affix removal longest match vs. simple removal  successor variety  table lookup  n-gram evaluation  correctness  retrieval effectiveness  compression performance TermStem engineeringengineer engineeredengineerengineer

12 Successor Variety Definition the number of different characters that follow it in words in some body of text Example a body of text: able, axle, accident, ape, about successor variety of apple 1st: 4 (b, x, c, p) 2nd: 1 (e)

13 Successor Variety (Continued) Idea The successor variety of substrings of a term will decrease as more characters are added until a segment boundary is reached, i.e., the successor variety will sharply increase. Example Test word: READABLE Corpus: ABLE, BEATABLE, FIXABLE, READS, READABLE, READING, RED, ROPE, RIPE PrefixSuccessor VarietyLetters R3E, O, I RE2A, D REA 1D READ3A, I, S READA1B READAB1L READABL1E READABLE 1blank

14 n-gram stemmers diagram a pair of consecutive letters shared diagram method association measures are calculated between pairs of terms where A: the number of unique diagrams in the first word, B: the number of unique diagrams in the second, C: the number of unique diagrams shared by A and B.

15 n-gram stemmers (Continued) Example statistics => st ta at ti is st ti ic cs unique diagrams => at cs ic is st ta ti statistical => st ta at ti is st ti ic ca al unique diagrams => al at ca ic is st ta ti

16 n-gram stemmers (Continued) similarity matrix determine the semantic measures for all pairs of terms in the database word 1 word 2 word 3...word n-1 word 1 wrod 2 S 21 word 3 S 31 S 32.. word n S n1 S n2 S n3 … S n(n-1) terms are clustered using a single link clustering method more a term clustering procedure than a stemming one

17 Q-gram stemmers Example Define Q=3 (L+Q-1) statistics => =12 { ##s,#st,sta,tat,ati,tis,ist,sti,tic,ics,cs#,s## } statistical => =13 {##s,#st,sta,tat,ati,tis,ist,sti,tic,ica,cal,al#,l##} Q-gram Distance = 4

18 Affix Removal Stemmers procedure Remove suffixes and/or prefixes from terms leaving a stem, and transform the resultant stem. example : plural forms If a word ends in “ ies ” but not “ eies ” or “ aies ” then “ ies ” --> “ y ” If a word ends in “ es ” but not “ aes ”, “ ees ”, or “ oes ” then “ es ” --> “ e ” If a word ends in “ s ”, but not “ us ” or “ ss ” then “ s ” --> NULL

19 Index Terms Selection Motivation  A sentence is usually composed of nouns, pronouns, articles, verbs, adjectives, adverbs, and connectives.  Most of the semantics is carried by the noun words. Identification of noun groups  A noun group is a set of nouns whose syntactic distance in the text does not exceed a predefined threshold

20 Index Terms Selection Indexing by single words  single words are often ambiguous and not specific enough for accurate discrimination of documents  bank terminology vs. terminology bank Indexing by phrases  Syntactic phrases are almost always more specific than single words Indexing by single words and phrases

21 Thesauri Peter Roget, 1988 Example cowardly adj. Ignobly lacking in courage: cowardly turncoats Syns: chicken (slang), chicken-hearted, craven, dastardly, faint-hearted, gutless, lily-livered, pusillanimous, unmanly, yellow (slang), yellow-bellied (slang). A controlled vocabulary for the indexing and searching

22 The Purpose of a Thesaurus To provide a standard vocabulary for indexing and searching To assist users with locating terms for proper query formulation To provide classified hierarchies that allow the broadening and narrowing of the current query request

23 Functions of thesauri Provide a standard vocabulary for indexing and searching Assist users with locating terms for proper query formulation Provide classified hierarchies that allow the broadening and narrowing of the current query request

24 Usage Indexing Select the most appropriate thesaurus entries for representing the document. Searching Design the most appropriate search strategy.  If the search does not retrieve enough documents, the thesaurus can be used to expand the query.  If the search retrieves too many items, the thesaurus can suggest more specific search vocabulary.

25 Document Clustering Global clustering  The grouping of documents accordingly to their occurrence in the whole collection Local clustering:  The grouping of the local set of retrieved documents by a query

26 Hypercentroid Supercentroids Centroids Documents Typical Clustered File Organization complete space superclusters clusters

27 Search Strategy for Clustered Documents CentroidsDocuments Typical Search path Highest-level centroid Supercentroids Centroids Documents

28 Text Compression Finding ways to represent the text in fewer bits or bytes Statistical Methods  Huffman coding Dictionary-based  Ziv-Lempel Inverted File compression