Ngram frequency smooting 11-12-2017 David Ling
Problem Corpus is impossible to cover all possible ngrams Eg. “by bus yesterday” gives 0 in Google Ngram Corpus Results: too many false positive in detecting errors 4 possible ideas (need time to perform and evaluate) Linear interpolation (attempting) Neural network (attempting) Regression (if have annotated data) Other hand-crafted generative models
Linear interpolation Linear interpolation using tri-gram and bi-gram frequencies Easy and common 𝑝 𝑤 1 , 𝑤 2 , 𝑤 3 = 𝜆 1 𝑝′ 𝑤 1 , 𝑤 2 , 𝑤 3 + 𝜆 2 𝑝′′ 𝑤 1 , 𝑤 2 , 𝑤 3 Probability approximated by counting 3-grams: 𝑝 ′ = 𝑐𝑜𝑢𝑛𝑡( 𝑤 1 , 𝑤 2 , 𝑤 3 ) total trigram count Probability approximated by counting 2-grams 𝑝′′=𝑝 𝑤 1 𝑝 𝑤 2 | 𝑤 1 𝑝 𝑤 3 | 𝑤 1 𝑤 2 ≈𝑝 𝑤 1 𝑝 𝑤 2 | 𝑤 1 𝑝 𝑤 3 | 𝑤 2 Example: “by bus yesterday”: 𝑝 ′′ = count("by") N count("by bus") count("by ∗") count("bus yesterday") count("bus ∗")
Example: Ling and I go to school by bus yesterday. 3-gram frequency Score (Lambda1, lambda2 = 0.8, 0.2) Not normalized Normalized (10-22) Ling and I 600 3218 0.856 And I go 194508 1893922 3.676 I go to 1061721 2872124 6.601 Go to school 714593 604034 23.70 To school by 21730 39138 0.122 School by bus 1887 1575 4.681 By bus yesterday 63 2.486 We may use the score directly: score = 𝜆 1 𝑝′ 𝑤 1 , 𝑤 2 , 𝑤 3 + 𝜆 2 𝑝′′ 𝑤 1 , 𝑤 2 , 𝑤 3 , Or normalize it by unigram frequency to eliminate the effects due to word popularity: score(normalized) = score 𝑐𝑜𝑢𝑛𝑡( 𝑤 1 )×𝑐𝑜𝑢𝑛𝑡( 𝑤 2 )×𝑐𝑜𝑢𝑛𝑡( 𝑤 3 )
Solely tri-gram frequency (40 times)
Solely tri-gram frequency (40 times)
Linear interpolation Problems Next steps Slow (online api) Google 3-gram corpus Problems Slow (online api) Parameters are not tuned (missing evaluation on recall and precision) Next steps Download and make the google n-gram corpus into our server (attempting) Many subcategories and therefore huge 3-grams with initial letter ‘a’ require ~150GB With year, term frequency, and vol frequency Some are tagged with POS Test on various scripts (UNCLE or marked HSMC scripts)