1 Terms and Query Operations Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice.

Slides:

Advertisements

Similar presentations

Chapter 5: Introduction to Information Retrieval

Advertisements

Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.

Stemming Algorithms 資訊擷取與推薦技術：期中報告指導教授：黃三益老師學生：黃哲修張家豪.

Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.

CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.

CS 430 / INFO 430 Information Retrieval

Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,

Information Retrieval in Practice

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Association Clusters Definition The frequency of a stem in a document,, is referred to as. Let be an association matrix with rows and columns, where. Let.

Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.

The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.

Aki Hecht Seminar in Databases (236826) January 2009

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

WMES3103 : INFORMATION RETRIEVAL

Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.

Sets and Maps Chapter 9. Chapter 9: Sets and Maps2 Chapter Objectives To understand the Java Map and Set interfaces and how to use them To learn about.

Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.

Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.

1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.

Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.

Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto.

Prepared By : Loay Alayadhi Supervised by: Dr. Mourad Ykhlef

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

Indexing Overview Approaches to indexing Automatic indexing Information extraction.

Modern Information Retrieval Chapter 7: Text Processing.

An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.

1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement and Relevance Feedback.

Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.

Querying Structured Text in an XML Database By Xuemei Luo.

1 Discussion Class 9 Thesaurus Construction. 2 Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others.

1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester

1 CS 430: Information Discovery Lecture 3 Inverted Files.

Chapter 6: Information Retrieval and Web Search

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.

1 CS 430: Information Discovery Lecture 25 Cluster Analysis 2 Thesaurus Construction.

1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.

Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.

Vector Space Models.

UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.

Information Retrieval

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

Text Operations J. H. Wang Feb. 21, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.

1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.

1 Discussion Class 3 Stemming Algorithms. 2 Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others to.

Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.

Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.

Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.

Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto.

Charlyn P. Salcedo Instructor Types of Indexing Languages.

Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Terms and Query Operations Hsin-Hsi Chen. Lexical Analysis and Stoplists.

1 Chapter 7 Text Operations. 2 Logical View of a Document document structure recognition text+ structure accents, spacing, etc. stopwords noun groups.

VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.

Information Retrieval in Practice

Plan for Today’s Lecture(s)

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Text Based Information Retrieval

CS 430: Information Discovery

Chapter 7 Lexical Analysis and Stoplists

Chapter 5: Information Retrieval and Web Search

Chapter 11 Describing Process Specifications and Structured Decisions

Discussion Class 3 Stemming Algorithms.

Information Retrieval and Web Design

THESAURUS CONSTRUCTION: GROUND WATER

Presentation transcript:

1 Terms and Query Operations Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, Chapter 7 - 9

2 Lexical Analysis and Stoplists Chapter 7

3 Lexical Analysis for Automatic Indexing l Lexical Analysis Convert an input stream of characters into stream words or token. l What is a word or a token? Tokens consist of letters. »Digits: Most numbers are not good index terms. counterexamples: case numbers in a legal database, “B6” and “B12” in vitamin database. »Hyphens –break hyphenated words: state-of-the-art, state of the art –keep hyphenated words as a token: “Jean-Claude”, “F-16” »Other punctuation: often used as parts of terms, e.g., OS/2 »Case: usually not significant in index terms

4 Lexical Analysis for Automatic Indexing ( Continued ) l Issues: recall and precision »breaking up hyphenated terms increase recall but decrease precision »preserving case distinctions enhance precision but decrease recall »commercial information systems usually take a conservative (recall enhancing) approach

5 Lexical Analysis for Query Processing l Tasks »depend on the design strategies of the lexical analyzer for automatic indexing (search terms must match index terms) »distinguish operators like Boolean operators »distinguish grouping indicators like parentheses and brackets »flag illegal characters as unrecognized tokens

6 STOPLISTS (negative dictionary) l Avoid retrieving almost every item in a database regardless of its relevance. l Example (derived from Brown corpus): 425 words a, about, above, across, after, again, against, all, almost, alone, along, already, also, although, always, among, an, and, another, any, anybody, anyone, anything, anywhere, are, area, areas, around, as, ask, asked, asking, asks, at, away, b, back, backed, backing, backs, be, because, became, … l Commercial Information systems tend to take a conservative approach, with few stopwords

7 Implementing Stoplists l Approaches »examine lexical analyzer output and remove any stopwords »remove stopwords as part of lexical analysis

8 Stemming Algorithms Chapter 8

9 Stemmers l Programs that relate morphologically similar indexing and search terms l Stem at indexing time »advantage: efficiency and index file compression »disadvantage: information about the full terms is lost l Example (CATALOG system), stem at search time Look for: system users Search Term: users TermOccurrences 1. user15 2. users1 3. used3 4. using2

10 Conflation Methods l Manual l Automatic (stemmers) »table lookup »successor variety »n-gram »affix removal longest match vs. simple removal l Evaluation »correctness »retrieval effectiveness »compression performance

11 Successor Variety l Definition (successor variety of a string) the number of different characters that follow it in words in some body of text l Example a body of text: able, axle, accident, ape, about successor variety of apple 1st: 4 (b, x, c, p) 2nd: (e)

12 Successor Variety ( Continued ) l Idea The successor variety of substrings of a term will decrease as more characters are added until a segment boundary is reached, i.e., the successor variety will sharply increase. l Example Test word: READABLE Corpus: ABLE, BEATABLE, FIXABLE, READ, READABLE, READING, RED, ROPE, RIPE PrefixSuccessor VarietyLetters R3E, O, I RE2A, D REA1D READ3A, I, S READA1B READAB1L READABL1E READABLE1blank

13 The successor variety stemming process l Determine the successor variety for a word. l Use this information to segment the word. »cutoff method a boundary is identified whenever the cutoff value is reached »peak and plateau method a character whose successor variety exceeds that of the character immediately preceding it and the character immediately following it »complete word method a segment is a complete word »entropy method l Select one of the segments as the stem.

14 n-gram stemmers l Diagram a pair of consecutive letters l Shared diagram method (Adamson and Boreham, 1974) association measures are calculated between pairs of terms l where A: the number of unique diagrams in the first word, B: the number of unique diagrams in the second, C: the number of unique diagrams shared by A and B

15 n-gram stemmers ( Continued ) l Example statistics => st ta at ti is st ti ic cs unique diagrams => at cs ic is st ta ti statistical => st ta at ti is st ti ic ca al unique diagrams => al at ca ic is st ta ti

16 n-gram stemmers ( Continued ) l similarity matrix determine the semantic measures for all pairs of terms in the database word 1 word 2 word 3...word n-1 word 1 word 2 S 21 word 3 S 31 S 32.. Word n S n1 S n2 S n3 …S n(n-1) l terms are clustered using a single link clustering method »most pairwise similarity measures were 0 »using a cutoff similarity value of.6

17 Affix Removal Stemmers l Procedure Remove suffixes and/or prefixes from terms leaving a stem, and transform the resultant stem. l Example: plural forms If a word ends in “ies” but not “eies” or “aies” then “ies” --> “y” If a word ends in “es” but not “aes”, “ees”, or “oes” then “es” --> “e” If a word ends in “s”, but not “us” or “ss” then “s” --> NULL l Ambiguity

18 Affix Removal Stemmers ( Continued ) l Iterative longest match stemmer remove the longest possible string of characters from a word according to a set of rules »recoding: AxC--> AyC, e.g., ki --> ky »partial matching: only n initial characters of stems are used in comparing l Different versions Lovins, Slaton, Dawson, Porter, … Students can refer to the rules listed in the text book.

19 Thesaurus Constructions Chapter 9

20 Thesaurus Construction l IR thesaurus a list of terms (words or phrases) along with relationships among them physics, EE, electronics, computer and control l INSPEC thesaurus (1979) cesium ( 銫， Cs) USE caesium (USE: the preferred form) computer-aided instruction see also education (cross-referenced terms) UF teaching machines (UF: a set of alternatives) BT educational computing (BT: broader terms, cf. NT) TT computer applications (TT: root node/top term) RT education (RT: related terms) teaching CC C7810C (CC: subject area) FC C7810Cf (subject area)

21 Usage l Indexing Select the most appropriate thesaurus entries for representing the document. l Searching Design the most appropriate search strategy. »If the search does not retrieve enough documents, the thesaurus can be used to expand the query. »If the search retrieves too many items, the thesaurus can suggest more specific search vocabulary.

22 Features of Thesauri (1/5) l Coordination Level »the construction of phrases from individual terms »pre-coordination: contains phrases –phrases are available for indexing and retrieval –advantage: reducing ambiguity in indexing and searching –disadvantage: searcher has to be know the phrase formulation rules –lower recall »post-coordination: does not allow phrases –phrases are constructed while searching –advantage: users do not worry about the exact word ordering –disadvantage: the search precision may fall, e.g., library school vs. school library –lower precision

23 Features of Thesauri (2/5) »intermediate level: allows both phrases and single words –the higher the level of coordination, the greater the precision of the vocabulary but the larger the vocabulary size –it also implies an increase in the number of relationships to be encoded l Precoordination is more common in manually constructed thesauri. l Automatic phrase construction is still quite difficult and therefore automatic thesaurus construction usually implies post- coordination

24 Features of Thesauri (3/5) l Term Relationships »Aitchison and Gilchrist (1972) –equivalence relationships: synonymy or quasi-synonymy –hierarchical relationships, e.g., genus ( 屬 )-species( 種 ) –nonhierarchical relationships, l e.g., thing-part, bus and seat l e.g., thing-attribute, rose and fragrance »Wang, Vandendorpe, and Evens (1985) –parts-wholes, e.g., set-element, count-mass –collocation relations: words that frequently co-occur in the same phrase or sentence –paradigmatic relations ( 詞形變化 ): e.g., “moon” and “lunar” –taxonomy and synonymy –antonymy relations

25 Features of Thesauri ( 4/5 ) l Number of entries for each term »homographs: words with multiple meanings »each homograph entry is associated with its own set of relations »problem: how to select between alternative meanings »typically the user has to select between alternative meanings l Specificity of vocabulary »is a function of the precision associated with the component terms »disadvantage: the size of the vocabulary grows since a large number of terms are required to cover the concepts in the domain »high specificity implies a high coordination level »a highly specific vocabulary promotes precision in retrieval

26 Features of Thesauri ( 5/5 ) l Control on term frequency of class members »for statistical thesaurus construction methods »terms included in the same thesaurus class have roughly equal frequencies »the total frequency in each class should also be roughly similar l Normalization of vocabulary »Normalization of vocabulary terms is given considerable emphasis in manual thesauri »terms should be in noun form »noun phrases should avoid prepositions unless they are commonly known »a limited number of adjectives should be used »...

27 Thesaurus Construction l Manual thesaurus construction »define the boundaries of the subject area »collect the terms for each subarea sources: indexes, encyclopedias, handbooks, textbooks, journal titles and abstracts, catalogues,... »organize the terms and their relationship into structures »review (and refine) the entire thesaurus for consistency l Automatic thesaurus construction »from a collection document items »by merging existing thesaurus

28 1. Construction of vocabulary normalization and selection of terms phrase construction depending on the coordination level desired 2. Similarity computations between terms identify the significant statistical associations between terms 3. Organization of vocabulary organize the selected vocabulary into a hierarchy on the basis of the associations computed in step 2. Thesaurus Construction from Texts

29 Construction of Vocabulary l Objective identify the most informative terms (words and phrases) l Procedure (1) Identify an appropriate document collection. The document collection should be sizable and representative of the subject area. (2) Determine the required specificity for the thesaurus. (3) Normalize the vocabulary terms. (a) Eliminate very trivial words such as prepositions and conjunctions. (b) Stem the vocabulary. (4) Select the most interesting stems, and create interesting phrases for a higher coordination level.

30 Stem evaluation and selection l Selection by frequency of occurrence »each term may belong to category of high, medium or low frequency »terms in the mid-frequency range are the best for indexing and searching

31 Stem evaluation and selection ( Continued ) l selection by discrimination value (DV) »the more discriminating a term, the higher its value as an index term »procedure –compute the average inter-document similarity in the collection –Remove the term K from the indexing vocabulary, and recompute the average similarity –DV(K)= (average similarity without K)-(average similarity with k) –The DV for good discriminators is positive.

32 Phrase Construction l Salton and McGill procedure 1. Compute pairwise co-occurrence for high-frequency words. 2. If this co-occurrence is lower than a threshold, then do not consider the pair any further. 3. For pairs that qualify, compute the cohesion value. COHESION(t i, t j )= co-occurrence-frequency/(sqrt(frequency(t i )*frequency(t j ))) COHESION(t i, t j )=size-factor* co-occurrence-frequency/(frequency(t i )*frequency(t j )) where size-factor is the size of thesaurus vocabulary 4. If cohesion is above a second threshold, retain the phrase

33 Phrase Construction ( Continued ) l Choueka Procedure 1. Select the range of length allowed for each collocational expression. E.g., 2-6 wsords 2. Build a list of all potential expressions from the collection with the prescribed length that have a minimum frequency. 3. Delete sequences that begin or end with a trivial word (e.g., prepositions, pronouns, articles, conjunctions, etc.) 4. Delete expressions that contain high-frequency nontrivial words. 5. Given an expression, evaluate any potential sub-expressions for relevance. Discard any that are not sufficiently relevant. 6. Try to merge smaller expressions into larger and more meaningful ones.

34 Term-Phrase Formation l Term Phrase a sequence of related text words carry a more specific meaning than the single terms e.g., “computer science” vs. computer; Document Frequency N Thesaurus transformation Phrase transformation

35 Similarity Computation l Cosine compute the number of documents associated with both terms divided by the square root of the product of the number of documents associated with the first term and the number of documents associated with the second term. l Dice compute the number of documents associated with both terms divided by the sum of the number of documents associated with one term and the number associated with the other.

36 Vocabulary Organization l Clustering l Forsyth and Rada (1986) »Assumptions: »(1) high-frequency words have broad meaning, while low- frequency words have narrow meaning. »(2) if the density functions of two terms have the same shape, then the two words have similar meaning. 1. Identify a set of frequency ranges. 2. Group the vocabulary terms into different classes based on their frequencies and the ranges selected in step The highest frequency class is assigned level 0, the next, level 1, and so on.

37 Forsyth and Rada (cont.) 4. Parent-child links are determined between adjacent levels as follows. For each term t in level i, compute similarity between t and every term in level i-1. Term t becomes the child of the most similar term in level i-1. If more than one term in level i-1qualifies for this, then each becomes a parent of t. In other words, a term is allowed to have multiple parents. 5. After all terms in level i have been linked to level i-1 terms, check level i-1terms and identify those that have no children. Propagate such terms to level i by creating an identical “dummy” term as its child. 6. Perform steps 4 and 5 for each level starting with level.

38 Merging Existing Thesauri l simple merge link hierarchies wherever they have terms in common l complex merge »link terms from different hierarchies if they are similar enough. »similarity is a function of the number of parent and child terms in common