web1T and deep learning methods 2018-03-29 David Ling
Contents Newly added corpus (web1T) Deep-learning method Revised rule Performance Deep-learning method Classifier Translation Word prediction (a trial on tensorflow)
Newly added web1T corpus web1T ~37GB Google ngram ~20GB Wiki2007 ~10GB “Youtude to show the trend on internet about making video.” Web1T has more trigrams Typically, trigram with score order e-26 is rare trigram w1 w2 w3 Total count score wiki07 Google ngram web1T youtude to show 2.18E+10 3.37E+08 1.36E-19 to show the 4.97E+10 5813781 1.59E-23 2107 3483882 2327792 show the trend 26937990 13994 3.10E-23 10445 3549 the trend on 6.05E+09 8947 1.10E-24 4 3226 5717 trend on internet 2.77E+08 2.22E-26 on internet about 1.76E+09 1320 4.48E-25 internet about making 2.26E+08 59 5.44E-25 about making video 3.75E+08 318 2.13E-24 making video . 4.13E+10 963 2.75E-25 1 41 921
Exception on the score Some trigrams has low score even they have many counts “His and the”, “email, he” Contains multiple very common tokens: and, the, “,”, _start_, in, etc. They have large legitimated combinations (broad distribution), and thus low score Filter them out by adding an highlighting exception: 1. Score < 0.9e-25 (original) AND 2. total_count < 50 (additional) Score = 𝑓𝑟𝑒𝑞(𝑤1,𝑤2,𝑤3) 𝑓𝑟𝑒𝑞(𝑤1)×𝑓𝑟𝑒𝑞(𝑤2)×𝑓𝑟𝑒𝑞(𝑤3) trigram w1 w2 w3 Total count score wiki07 Google ngram web1T his and the 2.47E+09 2.43E+10 4.97E+10 99422 3.33E-26 95 65015 34312 kept connect with 78939166 51001544 5.78E+09 4.30E-26 spammer message , 921650 4E+08 5.86E+10 4.63E-26 the spammer email 4.46E+08 4.89E-26 the spammer message 5.46E-26 these funny video 1.19E+09 39292367 3.75E+08 5.68E-26 talk `` this 1.4E+08 21715521 5.14E+09 6.40E-26 of unloaded video 2.87E+10 1426742 6.52E-26 's attention successfully 4.6E+09 1.09E+08 30227572 6.59E-26 can became famous 2.08E+09 1.34E+08 53195226 6.76E-26 video draw most 45710262 8.47E+08 6.88E-26 common asker , 2.24E+08 987843 7.71E-26 he kept connect 2.9E+09 8.56E-26 first speaker present 1.04E+09 43143869 2.56E+08 8.68E-26 _start_ in kevin 1.17E+11 1.69E+10 26733733 4898 9.28E-26 15 2098 2785 email , he 7387 9.75E-26 2 713 6672
Green color: by trigram detection Newer result O Older result O O O O O O X O O O O It seems that the newer result is better
Newer result Older result O X O X O X However, the newer result is not often better.
Deep-learning method Classifier Translation Word prediction (simple trial using tensorflow)
Classifier The Illinois-Columbia System in the CoNLL-2014 Shared Task(Rank 2nd) For article correction: < The UI system in the HOO 2012 shared task on error correction.
Classifier Different features for different grammatical errors Is confident to achieve some results (preposition, confusing words, verb, etc) However, may not be able to correct semantic error
Translation Seq2seq (planning to follow Attention is all you need) Input: unedited sentences (with or without mistakes) Output: edited sentences (without mistakes) Wikipedia articles, published books seq2seq edited sentences Lang8, NUCLE (2014 shared task) Artificial generated sentence
Translation Problem: not enough of data Artificial generated data: Replace randomly for Preposition, Confusion words , Articles Chinglish (style) from a translator on parallel corpus Result from google translate
Problem of neural network translation: not enough data Generate Chinese style writing by using parallel corpus Can rewrite sentence in a better way Edited sentence (target sentence) Given: Google translate Unedited sentence (input sentence)
Word prediction (a trial on tensorflow) Judge the target word is problematic or not by guessing the target word probability distribution (similar to skip-gram) For example, As Hong Kong students are not native speakers. Input: word_vector(As), word_vector(Kong), Preposition, Noun phrase Output: “hong” Input vector W1.wvect (100) W3.wvect (100) W1.POS (56) W3.POS (56) Hidden layer Output layer Dimension = 100*2 +56*2 = 312 Dimension = 156 Dimension = 400 k (word distribution)
Statistics 814400000 steps, batch size = 32 Data ~ 2.2e9 Epoch = 814400000 x 32/2.2e9 ~12 epoch As Hong Kong students are not native English speakers. 1 ['As', 'Kong'] ['IN', 'NNP'] ->Hong top 10: ['hong', 'a', 'the', '"', 'unk', 'new', 'united', '“', 'in', 'being'] rank: 0 eprob: -1.756897 prob:0.270145 percentile:0.000000 As Hong Kong students are not native English speakers. 2 ['Hong', 'students'] ['NNP', 'NNS'] ->Kong top 10: ["'s", 'and', '.', 'kong', 'unk', 'university', 'for', ',', 'of', '’s'] rank: 3 eprob: -2.834891 prob:0.065298 percentile:0.261457 As Hong Kong students are not native English speakers. 3 ['Kong', 'are'] ['NNP', 'VBP'] -> students top 10: ['unk', 'kong', ')', 'studios', ',', 'offices', 'and', 'who', 'games', 'members'] rank: 22 eprob: -5.721673 prob:0.003666 percentile:0.397681
Idea: target word with high percentile is problematic As Hong Kong students are not native English speakers. 4 ['students', 'not'] ['NNS', 'RB'] -> are top 10: ['were', 'are', 'did', 'can', 'do', 'would', ',', 'could', '.', 'and'] rank: 1 eprob: -0.780017 prob:0.199650 percentile:0.291891 7 ['native', 'speakers'] ['JJ', 'NNS'] -> English top 10: ['unk', 'european', 'english', 'asian', 'indian', '.', 'indigenous', ',', 'native', 'korean'] rank: 2 eprob: -3.813728 prob:0.025270 percentile:0.132014 Idea: target word with high percentile is problematic Why percentile (area)? Some distributions are narrow, some are wide Wide distribution occurs when there are many legitimated target words Wide distribution explained one of the weaknesses in using frequency count to judge the trigram or dependency
Highlight when percentile > 0.88 Math lessons use English. 1 ['Math', 'use'] ['NNP', 'VB'] lessons top 10: ['to', 'and', 'can', ',', '.', 'will', 'would', 'may', 'could', '-'] rank: 193 eprob: -10.189608 prob:0.000072 percentile:0.967132 2 ['lessons', 'English'] ['NNS', 'NNP'] use top 10: ['in', 'from', '.', 'of', 'at', ',', 'include', '(', 'for', 'to'] rank: 110 eprob: -8.563295 prob:0.000137 percentile:0.979467 3 ['use', '.'] ['VB', '.'] English top 10: ['unk', 'it', 'them', 'this', 'him', 'use', '"', 'applications', 'law', 'there'] rank: 55 eprob: -6.149250 prob:0.001047 percentile:0.242663
James Veitch shows what would happens when you reply to spam email. Highlight when percentile > 0.88 James Veitch shows what would happens when you reply to spam email. 1 ['James', 'shows'] ['NNP', 'VBZ'] Veitch top 10: ['unk', '"', ',', 'and', 'also', 'who', 'that', 'then', 'first', 'often'] 7 ['when', 'reply'] ['WRB', 'VBP'] you rank: 1951 eprob: -9.807776 prob:0.000046 percentile:0.811201 top 10: ['unk', 'they', 'i', 'we', 'you', 'often', 'he', 'many', 'females', 'others'] 2 ['Veitch', 'what'] ['NNP', 'WP'] shows rank: 4 eprob: -3.141033 prob:0.019829 percentile:0.804588 top 10: [',', 'of', '.', 'in', 'and', 'on', 'from', 'to', 'for', 'at'] rank: 243 eprob: -8.311202 prob:0.000319 percentile:0.878116 8 ['you', 'to'] ['PRP', 'IN'] reply top 10: ['moved', 'belong', 'listen', 'go', 'went', 'come', 'back', 'refer', 'due', '-'] 3 ['shows', 'would'] ['VBZ', 'MD'] what top 10: ['it', 'he', 'that', 'who', 'they', 'unk', 'and', ',', 'she', '"'] rank: 393 eprob: -8.268242 prob:0.000270 percentile:0.847230 rank: 12 eprob: -3.928356 prob:0.015693 percentile:0.597846 9 ['reply', 'spam'] ['VBP', 'NN'] to 4 ['what', 'happens'] ['WP', 'VBZ'] would top 10: ['the', 'unk', 'a', 'on', 'to', 'for', '"', 'out', '.', 'from'] top 10: ['it', 'he', 'she', 'unk', 'this', '"', 'really', 'nothing', 'what', 'and'] rank: 4 eprob: -3.349006 prob:0.038771 percentile:0.338736 rank: 347 eprob: -9.204218 prob:0.000083 percentile:0.922861 10 ['to', 'email'] ['IN', 'NN'] spam 5 ['would', 'when'] ['MD', 'WRB'] happens top 10: ['the', 'an', 'a', 'unk', 'his', 'this', 'their', 'its', '-', 'her'] top 10: ['be', 'occur', 'continue', 'unk', 'have', ',', 'happen', 'survive', 'play', 'do'] rank: 1152 eprob: -10.655535 prob:0.000019 percentile:0.962199 rank: 893 eprob: -9.550125 prob:0.000105 percentile:0.890853 11 ['spam', '.'] ['NN', '.'] email top 10: ['unk', '"', ')', 'system', 'content', 'technology', 'market', 'letters', 'systems', 'website'] 6 ['happens', 'you'] ['VBZ', 'PRP'] when top 10: ['to', '.', 'for', 'what', ',', 'if', 'that', 'and', 'as', '"'] rank: 1879 eprob: -9.714923 prob:0.000062 percentile:0.833495 rank: 10 eprob: -3.795723 prob:0.023509 percentile:0.565027
Highlight when percentile > 0.88 I had a causal chat with Tim yesterday. 1 ['I', 'a'] ['PRP', 'DT'] had 5 ['chat', 'Tim'] ['NN', 'NNP'] with top 10: ["'m", '’m', 'was', 'had', 'am', 'have', 'got', 'has', 'is', ','] top 10: ['with', '.', ',', 'to', 'between', 'and', 'by', 'writer', 'on', 'for'] rank: 3 eprob: -2.941615 prob:0.041764 percentile:0.450077 rank: 0 eprob: -1.452636 prob:0.231977 percentile:0.000000 2 ['had', 'causal'] ['VBD', 'NN'] a 6 ['with', 'yesterday'] ['IN', 'NN'] Tim top 10: ['a', 'no', 'the', 'any', 'its', 'an', 'that', 'in', 'significant', 'been'] top 10: ['a', 'the', 'unk', 'an', 'his', '"', 'this', '-', 'that', 'their'] rank: 0 eprob: -0.051633 prob:0.615139 percentile:0.000000 rank: 1663 eprob: -11.673390 prob:0.000008 percentile:0.962557 3 ['a', 'chat'] ['DT', 'NN'] causal 7 ['Tim', '.'] ['NNP', '.'] yesterday top 10: ['unk', '"', 'live', 'free', 'single', 'regular', 'new', 'long', 'news', 'separate'] top 10: ['unk', '"', 'brady', 'jones', 'miller', 'hortons', 'taylor', 'brown', 'russert', 'redman'] rank: 5186 eprob: -11.522699 prob:0.000010 percentile:0.942488 rank: 25695 eprob: -12.312710 prob:0.000001 percentile:0.967669 4 ['causal', 'with'] ['NN', 'IN'] chat top 10: ['unk', ',', 'relationship', 'associated', 'relationships', 'junctions', 'problems', 'interaction', '.', 'function'] rank: 3189 eprob: -10.738490 prob:0.000014 percentile:0.960157
Word prediction May be useful for detection and correction Some noises More features can be added (dependency)
My plan Translation first, then the Classifier approach Study the system in “attention is all you need” Test training on Nucle and Lang8 Look for a Chinese-> English translator Generate problematic sentences from parallel corpus