7CCSMWAL Algorithmic Issues in the WWW Lecture 6
See chapter 2 of Intro to IR Text Processing See chapter 2 of Intro to IR (Introduction to Information Retrieval: Manning, Raghavan Schutze. Available online)
Text processing Given a document or a collection of documents What do we identify as index terms? Process the documents Tokenization Token Normalization Elimination of stop words Stemming
Tokenization Convert a stream of characters (the text of the documents) into a stream of words (the candidate words to be adopted as index terms) Identification of the words (called tokens) Simplest way is to recognise spaces as word separators and remove punctuation characters E.g., Input: “Friends, Romans, Countrymen, lend me your ears;” Output: Friends Romans Countrymen lend me your ears
Problems with tokenization? Tokenization is relatively easy in languages with word boundaries separated by spaces What about eg East Asian languages?
Chinese and Japanese have no spaces between words: Sec. 2.2.1 Chinese and Japanese have no spaces between words: 莎拉波娃现在居住在美国东南部的佛罗里达。 Not always guaranteed a unique tokenization Further complicated in Japanese, with multiple alphabets intermingled Words interchangeably written in 2 alphabets フォーチュン500社は情報不足のため時間あた$500K(約6,000万円) Katakana Hiragana Kanji Romaji End-user can express query entirely in hiragana!
Tokenization The major question of the tokenization is what are the correct tokens to use Language specific Tricky cases Apostrophe Hyphens (and white space) Case of the letters (lower and upper case) Domain specific tokens HTML
Apostrophes Consider the text Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing. For “O’Neill”, which of the following is the desired tokenization? neil, oneill, o’neill, o’ neill, o neill For “aren’t” aren’t, arent, are n’t, aren t
Hyphens and white space Hyphenation is used for various purposes Splitting words over lines Splitting up vowels in words, e.g., co-education Joining nouns as names, e.g., Hewlett-Packard Copyediting device to show word grouping, e.g., the hold-him-back-and-drag-him-away manouver Complex to handle hyphens automatically The first 2 cases should be regarded as one token and the third one should be separated into words, and the second one is unclear Splitting on white space can also split what should be regarded as a single token Occur most commonly with names, e.g., San Francisco, Los Angeles Phone numbers and date, e.g., (800) 234-2333, 11 Mar 1983 Bad retrieval results may occur, e.g., if a search for York University mainly returns documents containing New York University
Case of the letters The case of letters is usually not important for the identification of index terms Normally, all the text is converted to either lower or upper case Particular scenarios that might require the distinction to be made, e.g., Windows (an operating system) General Motors (a company) Bush (a person)
Domain specific tokens For example Programming languages: C++ and C# Aircraft names: B-52 Television show names: M*A*S*H Email addresses: jblack@mail.yahoo.com Web URLs: http://stuff.big.com/new/ Numeric IP addresses: 142.32.48.231 Package tracking numbers: 1Z9999W99845399981
Tokenizing HTML Should text in HTML commands not typically seen by the user be included as tokens? Words appearing in URLs. Words appearing in “meta text” of images. Simplest approach is to exclude all HTML tag information (between “<” and “>”) from tokenization
Token Normalization Elimination of stop words
Token Normalization Canonicalize tokens, so that matches occur despite superficial differences in the character sequences of the tokens Create equivalence classes of terms E.g., anti-discriminatory and antidiscriminatory are both mapped onto the term antidiscriminatory This method can be extended to hand-constructed lists of synonyms such as car and automobile
Elimination of stop words Stop words – words which are too frequent among the documents in the collection Not good discriminators Normally filtered out as potential index terms Articles, prepositions, and conjunctions are natural candidates for a list of stop words E.g., a, the, and, in, but, who, that, or
Elimination of stop words Luhn (1958) suggested that both extremely common and extremely uncommon words were not very useful for indexing.
Elimination of stop words The list of stop words (stop list) might be extended to include words other than articles, preposition, and conjunctions E.g., some verbs, adverbs, and adjectives could be treated as stop words Typically a few hundred terms take up 50% of the text. Elimination of stop words can reduce the size of the indexing structure
Elimination of stop words But again care is required E.g., queries like “To be or not to be” and “Let it be” are terms containing entirely stop words The general trend over time has been to move from standard use of quite large stop lists (200-300 terms) to very small stop lists (7-12 terms) to no stop list whatsoever Web search engines generally do not use stop lists (keep a mixture of words and phrases)
Stemming Different forms of a word are used in documents E.g., organize, organizes, and organizing Families of derivationally related words with similar meanings E.g., democracy, democratic, and democratization It would be useful for a search for one of these words to return documents that contain another word in the set E.g., am, are, is be car, cars, car’s, cars’ car E.g. for text the boy’s cars are different colors the boy car be differ color
Stemming Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes The most common algorithm for stemming English, and one that has repeatedly been shown to be empirically very effective, is Porter’s algorithm Porter’s algorithm is long. We cover general principle only You can make up your own stemming rules
Porter’s algorithm Consist of five phases of word reduction Within each phase, there are various conventions to select rules. Eg: select the rule from each rule group that applies to the longest suffix In the first phase, this convention is used with the following rule groups Rule Example SSES SS caresses caress IES I ponies poni SS SS caress caress S cats cat
Porter’s algorithm Many of the later rules use a concept of the measure of a word. This checks the number of syllables, to see if a word is long enough to reasonably regard ‘the matching portion of a rule’ as a suffix, rather than as part of the stem Example: (m>1) EMENT replacement to replac, but not cement to c See www.tartarus.org/~martin/PorterStemmer/ for the entire Porter’s algorithm Try it for yourself (see next page)
Stemming
Stemming
Example Within each phase, there are various conventions to select rules, such as selecting the rule from each rule group that applies to the longest suffix Stemmed Word phase phase various variou conventions convent select select rules rule selecting select rule rule rule rule group group applies appli longest longest suffix suffix
Google uses stemming http://en.wikipedia.org/wiki/Stemming http://www.googleguide.com/interpreting_queries.html “Google now uses stemming technology. Thus, when appropriate, it will search not only for your search terms, but also for words that are similar to some or all of those terms. If you search for pet lemur dietary needs, Google will also search for pet lemur diet needs, and other related variations of your terms. Any variants of your terms that were searched for will be highlighted in the snippet of text accompanying each result.” (older version of page) http://en.wikipedia.org/wiki/Stemming An extensive list of stemming approaches
Text Properties
Text properties Text is composed of symbols from a finite alphabet Symbols can be divided in two disjoint subsets: symbols that separate words and symbols that belong to words Issues: How are the different words distributed inside each document? How many distinct words in a document? What is the average length of a word?
Vocabulary distribution A few words are very common 2 most frequent words (e.g. “the”, “of”) can account for about 10% of word occurrences Most words are very rare A “heavy tailed” distribution. Most of the words are in the “tail” (power law on rank) Rank: largest, next largest and so on en.wikipedia.org/wiki/Rank-size_distribution Zipf distribution (en.wikipedia.org/wiki/Zipf's_law)
TIME magazine collection 423 short articles 245412 words Example Top 10 frequent words Number of occurrences Percentage of total the 15861 6.46 of 7239 2.95 to 6331 2.58 a 5878 2.40 and 5614 2.29 in 5294 2.16 that 2507 1.02 for 2228 0.91 was 2149 0.88 with 1839 0.75 TIME magazine collection 423 short articles 245412 words
Distribution of top 50 frequent words TIME magazine collection 423 short articles 245412 words
Zipf’s law Frequency of occurrence of some event F(r) is a function of its rank r as F(r) = C / ra where C and a are constants Let P(r) be the probability of rank r word, P(r) = F(r) / N where N is the total number of word occurrences. Then P(r) = (C / N) / ra
How to find the constant c Frequency of occurrence of some event F(r) is a function of its rank r as F(r) = C / ra where C and a are constants Given we can estimate the value of a we can get C from Zeta(a)= (Sum r=1,2,…) (1 / ra ) Zeta functions are tabulated Eg if a=2, Zeta(2)= 1.645, so if we set C=1/1.645 then P(r)= 1/(1.645 r^2)
Vocabulary (a=1) For vocabulary distribution, a1 and C/N0.1 So P(r) 0.1/r Decreases directly with rank The values are content independent This means r * P(r) = r*F(r)/N 0.1 See examples overleaf Why might this matter? storage requirements of index, generate artificial test documents, spot exceptional occurrence of rare words, general curiosity
TIME magazine collection r Word F(r) P(r)*r P(r) 1 the 15861 0.065 18 it 1290 0.095 35 week 793 0.113 2 of 7239 0.059 19 from 1228 36 they 697 0.102 3 to 6331 0.077 20 but 1138 0.093 37 govern 687 0.104 4 a 5878 0.096 21 u 955 0.082 38 all 672 5 and 5614 0.114 22 had 940 0.084 39 year 0.107 6 in 5294 0.129 23 last 930 0.087 40 its 620 0.101 7 that 2507 0.072 24 be 915 0.089 41 britain 589 0.098 8 for 2228 0.073 25 have 914 42 when 579 0.099 9 was 2149 0.079 26 who 894 43 out 577 10 with 1839 0.075 27 not 882 0.097 44 would 0.103 11 his 1815 0.081 28 has 880 0.100 45 new 572 0.105 12 is 1810 29 an 873 46 up 559 13 he 1700 0.090 30 s 865 0.106 47 been 554 14 as 1581 31 were 848 48 more 540 15 on 1551 32 their 815 49 which 539 0.108 16 by 1467 33 are 812 0.109 50 into 518 17 at 1333 0.092 34 one 811 0.112 P(r) = F(r) / N where N=245412
Wall Street Journal (WSJ87) collection 46,449 newspaper articles, N = 19 Million r Word F(r) P(r)*r 1 the 1130021 0.059 18 from 96900 0.092 35 or 54958 0.101 2 of 547311 0.058 19 he 94585 0.095 36 about 53713 0.102 3 to 516635 0.082 20 million 3515 0.098 37 market 52110 4 a 464736 21 year 90104 0.100 38 they 51359 0.103 5 in 390819 22 its 86774 39 this 50933 0.105 6 and 387703 0.122 23 be 85588 0.104 40 would 50828 0.107 7 that 204351 0.075 24 was 83398 41 u 49281 0.106 8 for 199340 0.084 25 company 3070 0.109 42 which 48273 9 is 152483 0.072 26 an 76974 43 bank 47940 10 said 148302 0.078 27 has 74405 44 stock 47401 0.110 11 it 134323 28 are 74097 45 trade 47310 0.112 12 on 121173 0.077 29 have 73132 46 his 47116 0.114 13 by 118863 0.081 30 but 71887 47 more 46244 14 as 109135 0.080 31 will 71494 0.117 48 who 42142 15 at 101779 32 say 66807 0.113 49 one 41635 16 mr 101679 0.086 33 new 64456 50 their 40910 0.108 17 with 101210 0.091 34 share 63925
Average word length TREC (Text REtrieval Conference) has collected documents since 1992 as a test bed for participants. In the TREC (English language) collection, average word length is 5 (range 4.8-5.3 in sub-collections) 6-7 if stop words are eliminated 8-9 over all vocabulary entries (long words occur less frequently)
Vocabulary growth How does the size of the overall vocabulary (number of unique words) grow with the size of the corpus? This determines how the size of the inverted index will scale with the size of the corpus. Vocabulary not really upper-bounded due to proper names, typos, etc.
Heaps’ law If V is the size of the vocabulary and the n is the length of the corpus in words: V = K n with constant K, 0 < <1 Typical constants: K 10100 0.40.6 (approx. square-root)
Heaps’ law data (From James Allan, UMass) V Words in vocabulary, V (thousands) Words in Collection, n (millions)