CH.4 PROBABILITY AND TEXT SAMPLING Data mining LAB 이아람
4.5 THE BAG-OF-WORDS MODEL Only analyzing word frequencies Word order is irrelevant
4.6 THE EFFECT OF SAMPLE SIZE How the number of types is related to the number of tokens as the sample size increases. Types vs Tokens as the sample size increases
4.6.1 TOKENS vs TYPES Tokens : every word is counted, including repetitions Types : repetitions are ignored The cat ate the bird.
Notation N = the size of the text sample the number of tokens V(N) = the number of types w i = Labeled word f( w i, N ) = the frequency of the word w i in a text of size N
TOKENS vs TYPES
Tokens vs Types Figure 4.5
Tokens vs Tokens/Types Figure 4.6
Tokens vs Tokens/Types (2) Figure 4.7 The Black cat 3.17 tokens per type The Unparalleled Adventures of One Hans Pfaall 5.61 tokens per type N -> 1,000~4,000
Size of sample In corpus linguistics, take samples of equal size. Smaller than each text -> analyzed in a similar fashion The corpora use this approach. ex) the Brown Corpus