Download presentation
Presentation is loading. Please wait.
Published byMadeline Cain Modified over 8 years ago
1
CH.4 PROBABILITY AND TEXT SAMPLING 2011.10.19. Data mining LAB 이아람
2
4.5 THE BAG-OF-WORDS MODEL Only analyzing word frequencies Word order is irrelevant
3
4.6 THE EFFECT OF SAMPLE SIZE How the number of types is related to the number of tokens as the sample size increases. Types vs Tokens as the sample size increases
4
4.6.1 TOKENS vs TYPES Tokens : every word is counted, including repetitions Types : repetitions are ignored The cat ate the bird.
5
Notation N = the size of the text sample the number of tokens V(N) = the number of types w i = Labeled word f( w i, N ) = the frequency of the word w i in a text of size N
6
TOKENS vs TYPES
7
Tokens vs Types Figure 4.5
8
Tokens vs Tokens/Types Figure 4.6
9
Tokens vs Tokens/Types (2) Figure 4.7 The Black cat 3.17 tokens per type The Unparalleled Adventures of One Hans Pfaall 5.61 tokens per type N -> 1,000~4,000
10
Size of sample In corpus linguistics, take samples of equal size. Smaller than each text -> analyzed in a similar fashion The corpora use this approach. ex) the Brown Corpus
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.