Properties of Text CS336 Lecture 4:
2 Stop list Typically most frequently occurring words –a, about, at, and, etc, it, is, the, or, … Among the top 200 are words such as “time” “war” “home” etc. –May be collection specific “computer, machine, program, source, language” in a computer science collection Removal can be problematic (e.g. “Mr. The”, “and-or gates”)
3 Stop lists Commercial systems use only few stop words ORBIT uses only 8, “and, an, by, from, of, the, with” –patents,scientific and technical (sci-tech) information, trademarks and Internet domain names
4 Special Cases? Name Recognition –People’s names - “Bill Clinton” –Company names - IBM & big blue –Places New York City, NYC, the big apple
5 Text Goal: –Identify what can be inferred about text based on structural features statistical features of language Statistical Language Characteristics –convert text to form more easily manipulable via computer –reduce storage space and processing time –store and process in encrypted form text compression
6 Zipf’s Law p r = (freq of word of rank r)/N –Probability that a word chosen randomly will be the word of rank r –N = total word occurrences –for D distinct words, p r = 1 –r * p r = A A ≈ 0.1 –e.g.) the rank of a word is inversely proportional to its frequency The probability of occurrence of words or other items starts high and tapers off. Thus, a few occur very often while many others occur rarely.
7 Employing Zipf’s Law Identify significant words and ineffectual words –A few words occur very often 2 most frequent words can account for 10% of occurrences top 6 words are 20% top 50 words are 50% –Many words are infrequent
8 Most frequent words r Word f(r) r*f(r)/N 1 the 69, of 36, and 28, to 26, a 23, in 21, that 10, is 10, was 9, he 9, N~1,000,000
9 Employing Zipf’s Law Estimate technical needs –Estimating storage space saved by excluding stop words from index 10 most frequently occurring words in English make up about 25%-30% of text Deleting very low frequency words from index yields a large saving Estimate number of words n(1) that occur 1 times, n(2) that occur 2 times, etc –Words that occur at most twice comprise about 2/3 of a text Estimating the size of a term’s inverted index list Zipf is quite accurate except at very high and very low rank
10 Modeling Natural Language Length of the words –defines total space needed for vocabulary each character requires 1 byte Heaps’ Law: length increases logarithmically with text size.
11 Vocabulary Growth New words occur less frequently as collection grows Empirically t = kN , where –t is the number of unique words –k and are constants k As the total text size grows, the predictions of theHeaps’ Law become more accurate Sublinear growth rate
12 Information Theory Shannon studied theoretical limits for data compression and transmission rate –“…problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point." Compression limits given by Entropy (H) Transmission limits given by Channel Capacity (C) Many language tasks have been formulated as a “noisy channel” problem –determine most likely input given noisy output OCR Speech recognition Machine translation etc.
13 Shannon Game How should we complete the following? –The president of the United States is George W. … –The winner of the $10K prize is … –Mary had a little … –The horse raced past the barn … Period? etc
14 Information Theory Information content of a message is dependent on both –the receiver’s prior knowledge –the message itself How much of the receiver’s uncertainty (entropy) is reduced How predictable is the message
15 Information Theory Think of information content, H, as a measurement of our ability to guess rest of message, given only a portion of the message –if predict with probability 1, information content is zero –if predict with probability 0, infinite information content –H(p) = -log p Logs in base 2, unit of information content (entropy) is 1 bit If message is a priori predictable with pr = 0.5 Information content = -log 2 (1/2) = 1 bit
16 Information Theory Given n messages, the average or expected information content to be gained from receiving one of the messages is: where : # of symbols in an alphabet, p i : probability of a symbol’s appearance (freq i /all occurrences) –Amount of information in a message is related to the distribution of symbols in the message.
17 Entropy Average entropy is a maximum when messages are equally probable –e.g. average entropy associated with characters assuming equal probabilities For alphabet, H = log 1/26 = 4.7 bits With actual probabilities, H = 4.14 bits With bigram probabilites, H reduces to 3.56 bits People predict next letter with ~ 40% accuracy, H = 1.3 bits Better models reduce the relative entropy In text compression, entropy (H) specifies the limit on how much the text can be compressed –the more regularity (e.g. less uncertain) a data sequence, the more it can be compressed
18 Information Theory Let t = number of unique words in a vocabulary –For t = 10,000H = , , bits Information theory has been used for –Compression –Term weighting –Evaluation measures
Stemming Commonly used to conflate morphological variants –combine non identical words referring to same concept compute, computation, computer, … Stemming is used to: –Enhance query formulation (and improve recall) by providing term variants –Reduce size of index files by combining term variants into single index term
21 Stemmer correctness Two ways to be incorrect – Under-stemming Prevents related terms from being conflated “consideration” to “considerat” prevents conflating it with “consider” Under-stemming affects recall – Over-stemming Terms with different meanings are conflated “ considerate”, “consider” and “consideration” should not be stemmed to “con”, with “contra”, “contact”, etc. Over-stemming can reduce precision
22 The Concept of Relevance Relevant => does the document fulfill the query? Relevance of a document D to a query Q is subjective –Different users will have different judgments –Same users may judge differently at different times –Degree of relevance of different documents will vary In IR system evaluation it is assumed: –A subset of database documents (DB) are relevant –A document is either relevant or not relevant
23 Recall and precision Most common measures for evaluating IR systems Recall: % of relevant documents retrieved. – Measures ability to get ALL of the good documents. Precision: % of retrieved documents that are in fact relevant. – Measures amount of junk that is included in the results. Ideal Retrieval Results –100% recall (All good documents are retrieved ) –100% precision (No bad document is retrieved)
24 Evaluating stemmers In information retrieval stemmers are evaluated by their: –effect on retrieval improvements in recall or precision –compression rate –Not linguistic correctness
Stemmers 4 basic types –Affix removing stemmers –Dictionary lookup stemmers –n-gram stemmers –Corpus analysis Studies have shown that stemming has a positive effect on retrieval. Performance of different algorithms comparable Results vary between test collections
26 Affix removal stemmers Remove suffixes and/or prefixes leaving a stem –In English remove suffixes What might you remove if you were designing a stemmer? –In other languages, e.g. Hebrew, remove both prefix and suffix Keshehalachnu --> halach Nelechna --> halach –some languages are more difficult, e.g. Arabic –iterative: consideration => considerat => consider –longest match: use a set of stemming rules arranged on a ‘longest match’ principal (Lovins)
27 A simple stemmer (Harman) if word ends in “ies” but not “eies” or “aies” then “ies”->“y”; else in “es” but not “aes”, “ees” or “oes” then “es”->e; else in “s” but not “us” or “ss” then “s”->NULL endif Algorithm changes: – “skies” to “sky”, –“retrieves” to “retrieve” –“doors” to “door” –but not “corpus” or “wellness” –“dies” to “dy”?
Stemming w/ Dictionaries Avoid collapsing words with different meaning to same root Word is looked up and replaced by the best stem Typical stemmers consist of rules and/or dictionaries –simplest stemmer is “suffix s” –Porter stemmer is a collection of rules –KSTEM uses lists of words plus rules for inflectional and derivational morphology
29 Stemming Examples Original text: Document will describe marketing strategies carried out by U.S. companies for their agricultural chemicals, report predictions for market share of such chemicals, or report market statistics for agrochemicals, pesticide, herbicide, fungicide, insecticide, fertilizer, predicted sales, market share, stimulate demand, price cut, volume of sales Porter Stemmer: market strateg carr compan agricultur chemic report predict market share chemic report market statist agrochem pesticid herbicid fungicid insecticid fertil predict sale stimul demand price cut volum sale KSTEM: marketing strategy carry company agriculture chemical report prediction market share chemical report market statistic agrochemic pesticide herbicide fungicide insecticide fertilizer predict sale stimulate demand price cut volume sale
30 n-grams Fixed length consecutive series of “n” characters –Bigrams: Sea colony -> (se ea co ol lo on ny) –Trigrams Sea colony -> (sea col olo lon ony), or -> (#se sea ea# #co col olo lon ony ny#) Conflate words based on overlapping series of characters
31 Problems with Stemming Lack of domain-specificity and context can lead to occasional serious retrieval failures Stemmers are often difficult to understand and modify Sometimes too aggressive in conflation (over-stem) –e.g. “execute”/“executive”, “university”/“universe”, “policy”/“police”, “organization”/“organ” conflated by Porter Miss good conflations (under-stem) –e.g. “European”/“Europe”, “matrices”/“matrix”, “machine”/“machinery” are not conflated by Porter Stems that are not words are often difficult to interpret – e.g. with Porter, “iteration” produces “iter” and “general” produces “gener”
Corpus-Based Stemming Corpus analysis can improve/replace a stemmer Hypothesis: Word variants that should be conflated will co-occur in context Modify stem classes generated by a stemmer or other “aggressive” techniques such as initial n- grams –more aggressive classes mean less conflations missed Prune class by removing words that don’t co-occur sufficiently often Language independent
33 Equivalence Class Examples Some Porter Classes for a WSJ Database Classes refined through corpus analysis
Corpus-Based Stemming Results Both Porter and KSTEM stemmers are improved slightly by this technique Ngram stemmer gives same performance as “linguistic” stemmers for –English –Spanish –Not shown to be the case for Arabic
35 Stemmer Summary All automatic stemmers are sometimes incorrect –over-stemming –understemming In general, improves effectiveness May use varying levels of language specific information –morphological stemmers use dictionaries –affix removal stemmers use information about prefixes, suffixes, etc. n-gram and corpus analysis methods can be used for different languages
36 Generating Document Representations Use significant terms to build representations of documents –referred to as indexing Manual indexing: professional indexers –Assign terms from a controlled vocabulary –Typically phrases Automatic indexing: machine selects –Terms can be single words, phrases, or other features from the text of documents
37 Index Languages Language used to describe docs and queries Exhaustivity # of different topics indexed, completeness or breadth –increased exhaustivity => higher recall/ lower precision Specificity - accuracy of indexing, detail –increased specificity => higher precision/lower recall retrieved output size increases because documents are indexed by any remotely connected content information When doc represented by fewer terms, content may be lost. A query that refers to the lost content,will fail to retrieve the document
38 Index Languages Pre-coordinate indexing – combinations of terms (e.g. phrases) used as an indexing label Post-coordinate indexing - combinations generated at search time Faceted classification - group terms into facets that describe basic structure of a domain, less rigid than predefined hierarchy Enumerative classification - an alphabetic listing, underlying order less clear –e.g. Library of Congress class for “socialism, communism and anarchism” at end of schedule for social sciences, after social pathology and criminology