Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ngram frequency smooting

Similar presentations


Presentation on theme: "Ngram frequency smooting"— Presentation transcript:

1 Ngram frequency smooting
David Ling

2 Problem Corpus is impossible to cover all possible ngrams
Eg. “by bus yesterday” gives 0 in Google Ngram Corpus Results: too many false positive in detecting errors 4 possible ideas (need time to perform and evaluate) Linear interpolation (attempting) Neural network (attempting) Regression (if have annotated data) Other hand-crafted generative models

3 Linear interpolation Linear interpolation using tri-gram and bi-gram frequencies Easy and common 𝑝 𝑤 1 , 𝑤 2 , 𝑤 3 = 𝜆 1 𝑝′ 𝑤 1 , 𝑤 2 , 𝑤 3 + 𝜆 2 𝑝′′ 𝑤 1 , 𝑤 2 , 𝑤 3 Probability approximated by counting 3-grams: 𝑝 ′ = 𝑐𝑜𝑢𝑛𝑡( 𝑤 1 , 𝑤 2 , 𝑤 3 ) total trigram count Probability approximated by counting 2-grams 𝑝′′=𝑝 𝑤 1 𝑝 𝑤 2 | 𝑤 1 𝑝 𝑤 3 | 𝑤 1 𝑤 2 ≈𝑝 𝑤 1 𝑝 𝑤 2 | 𝑤 1 𝑝 𝑤 3 | 𝑤 2 Example: “by bus yesterday”: 𝑝 ′′ = count("by") N count("by bus") count("by ∗") count("bus yesterday") count("bus ∗")

4 Example: Ling and I go to school by bus yesterday.
3-gram frequency Score (Lambda1, lambda2 = 0.8, 0.2) Not normalized Normalized (10-22) Ling and I 600 3218 0.856 And I go 194508 3.676 I go to 6.601 Go to school 714593 604034 23.70 To school by 21730 39138 0.122 School by bus 1887 1575 4.681 By bus yesterday 63 2.486 We may use the score directly: score = 𝜆 1 𝑝′ 𝑤 1 , 𝑤 2 , 𝑤 3 + 𝜆 2 𝑝′′ 𝑤 1 , 𝑤 2 , 𝑤 3 , Or normalize it by unigram frequency to eliminate the effects due to word popularity: score(normalized) = score 𝑐𝑜𝑢𝑛𝑡( 𝑤 1 )×𝑐𝑜𝑢𝑛𝑡( 𝑤 2 )×𝑐𝑜𝑢𝑛𝑡( 𝑤 3 )

5 Solely tri-gram frequency (40 times)

6 Solely tri-gram frequency (40 times)

7 Linear interpolation Problems Next steps Slow (online api)
Google 3-gram corpus Problems Slow (online api) Parameters are not tuned (missing evaluation on recall and precision) Next steps Download and make the google n-gram corpus into our server (attempting) Many subcategories and therefore huge 3-grams with initial letter ‘a’ require ~150GB With year, term frequency, and vol frequency Some are tagged with POS Test on various scripts (UNCLE or marked HSMC scripts)


Download ppt "Ngram frequency smooting"

Similar presentations


Ads by Google