Presentation is loading. Please wait.

Presentation is loading. Please wait.

Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.

Similar presentations


Presentation on theme: "Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2."— Presentation transcript:

1 Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2

2 Overview Getting started: – tokenization, stemming, compounds – end of sentence Collection vocabulary – Terms, tokens, types – Vocabulary size – Term distribution Stop words Vector representation of text and term weighting

3 Tokenization Friends, Romans, Countrymen, lend me your ears; Friends | Romans | Countrymen | lend | me your | ears Token an instance of a sequence of characters that are grouped together as a useful semantic unit for processing Type the class of all tokens containing the same character sequence Term type that is included in the system dictionary (normalized)

4 The cat slept peacefully in the living room. It’s a very old cat.

5 Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing. How to handle special cases involving apostrophes, hyphens etc? C++, C#, URLs, emails, phone numbers, dates San Francisco, Los Angeles

6 Issues of tokenization are language specific – Requires the language to be known Language identification based on classifiers that use short character subsequences as features is highly effective – Most languages have distinctive signature patterns

7 Very important for information retrieval Splitting tokens on spaces can cause bad retrieval results – Search for York University, returns pages containing new york university German: compound nouns – Retrieval systems for German greatly benefit fron the use of compound-splitter module – Checks if a word can be subdivided into words that appear in the vocabulary East Asian Languages (Chinese, Japanese, Korean, Thai) – Text is written without any spaces between words

8

9 Stop words Very common words that have no discriminatory power

10 Building a stop word list collection frequency Sort terms by collection frequency and take the most frequent – In a collection about insurance practices, “insurance” would be a stop word Why do we need stop lists – Smaller indices for information retrieval – Better approximation of importance for summarization etc Use problematic in phrasal searches

11 Trend in IR systems over time – Large stop lists (200-300 terms) – Very small stop lists (7-12 terms) – No stop list whatsoever – The 30 most common words account for 30% of the tokens in written text Good compression techniques for indices Term weighting leads to very common words having little impact for document represenation

12 Normalization Token normalization – Canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens – U.S.A vs USA – Anti-discriminatory vs antidiscriminatory – Car vs automobile?

13 Normalization sensitive to query Query term Terms that should match Windows windows Windows, windows, window Window window, windows

14 Capitalization/case folding Good for – Allow instances of Automobile at the beginning of a sentence to match with a query of automobile – Helps a search engine when most users type ferrari when they are interested in a Ferrari car Bad for – Proper names vs common nouns – General Motors, Associated Press, Black Heuristic solution: lowercase only words at the beginning of the sentence; true casing via machine learning In IR, lowercasing is most practical because of the way users issue their queries

15 Other languages 60% of webpages are in english – Less than one third of Internet users speak English – Less than 10% of the world’s population primarily speak English Only about one third of blog posts are in English

16 Stemming and lemmatization Organize, organizes, organizing Democracy, democratic, democratization Am, are, is  be Car, cars, car’s, cars’ ==? car

17 Stemming – Crude heuristic process that chops off the ends of the words Democratic  democa Lemmatization – Use of vocabulary and morphological analysis, returns the base form of a word (lemma) Democratic  democracy Sang  sing

18 Porter stemmer Most common algorithm for stemming English – 5 phases of word reduction – SSES  SS caresses  caress – IES  I ponies  poni – SS  SS – S  cats  cat – EMENT  replacement  replac cement  cement

19

20

21 Vocabulary size Dictionaries – 600,000+ words But they do not include names of people, locations, products etc

22 Heap’s law: estimating the number of terms M vocabulary size (number of terms) T number of tokens 30 < k < 100 b = 0.5 Linear relation between vocabulary size and number of tokens in log-log space

23

24 Zipf’s law: modeling the distribution of terms The collection frequency of the i th most common term is proportional to 1/i If the most frequent term occurs cf 1 then the second most frequent term has half as many occurrences, the third most frequent term has a third as many, etc

25

26 Problems with the normalization A change in the stop word list can dramatically alter term weightings A document may contain an outlier term


Download ppt "Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2."

Similar presentations


Ads by Google