Download presentation
Presentation is loading. Please wait.
1
Ngram frequency smooting
David Ling
2
Problem Corpus is impossible to cover all possible ngrams
Eg. “by bus yesterday” gives 0 in Google Ngram Corpus Results: too many false positive in detecting errors 4 possible ideas (need time to perform and evaluate) Linear interpolation (attempting) Neural network (attempting) Regression (if have annotated data) Other hand-crafted generative models
3
Linear interpolation Linear interpolation using tri-gram and bi-gram frequencies Easy and common 𝑝 𝑤 1 , 𝑤 2 , 𝑤 3 = 𝜆 1 𝑝′ 𝑤 1 , 𝑤 2 , 𝑤 3 + 𝜆 2 𝑝′′ 𝑤 1 , 𝑤 2 , 𝑤 3 Probability approximated by counting 3-grams: 𝑝 ′ = 𝑐𝑜𝑢𝑛𝑡( 𝑤 1 , 𝑤 2 , 𝑤 3 ) total trigram count Probability approximated by counting 2-grams 𝑝′′=𝑝 𝑤 1 𝑝 𝑤 2 | 𝑤 1 𝑝 𝑤 3 | 𝑤 1 𝑤 2 ≈𝑝 𝑤 1 𝑝 𝑤 2 | 𝑤 1 𝑝 𝑤 3 | 𝑤 2 Example: “by bus yesterday”: 𝑝 ′′ = count("by") N count("by bus") count("by ∗") count("bus yesterday") count("bus ∗")
4
Example: Ling and I go to school by bus yesterday.
3-gram frequency Score (Lambda1, lambda2 = 0.8, 0.2) Not normalized Normalized (10-22) Ling and I 600 3218 0.856 And I go 194508 3.676 I go to 6.601 Go to school 714593 604034 23.70 To school by 21730 39138 0.122 School by bus 1887 1575 4.681 By bus yesterday 63 2.486 We may use the score directly: score = 𝜆 1 𝑝′ 𝑤 1 , 𝑤 2 , 𝑤 3 + 𝜆 2 𝑝′′ 𝑤 1 , 𝑤 2 , 𝑤 3 , Or normalize it by unigram frequency to eliminate the effects due to word popularity: score(normalized) = score 𝑐𝑜𝑢𝑛𝑡( 𝑤 1 )×𝑐𝑜𝑢𝑛𝑡( 𝑤 2 )×𝑐𝑜𝑢𝑛𝑡( 𝑤 3 )
5
Solely tri-gram frequency (40 times)
6
Solely tri-gram frequency (40 times)
7
Linear interpolation Problems Next steps Slow (online api)
Google 3-gram corpus Problems Slow (online api) Parameters are not tuned (missing evaluation on recall and precision) Next steps Download and make the google n-gram corpus into our server (attempting) Many subcategories and therefore huge 3-grams with initial letter ‘a’ require ~150GB With year, term frequency, and vol frequency Some are tagged with POS Test on various scripts (UNCLE or marked HSMC scripts)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.