web1T and deep learning methods

web1T and deep learning methods
David Ling

Contents Newly added corpus (web1T) Deep-learning method Revised rule
Performance Deep-learning method Classifier Translation Word prediction (a trial on tensorflow)

Newly added web1T corpus
web1T ~37GB Google ngram ~20GB Wiki2007 ~10GB “Youtude to show the trend on internet about making video.” Web1T has more trigrams Typically, trigram with score order e-26 is rare trigram w1 w2 w3 Total count score wiki07 Google ngram web1T youtude to show 2.18E+10 3.37E+08 1.36E-19 to show the 4.97E+10 1.59E-23 2107 show the trend 13994 3.10E-23 10445 3549 the trend on 6.05E+09 8947 1.10E-24 4 3226 5717 trend on internet 2.77E+08 2.22E-26 on internet about 1.76E+09 1320 4.48E-25 internet about making 2.26E+08 59 5.44E-25 about making video 3.75E+08 318 2.13E-24 making video . 4.13E+10 963 2.75E-25 1 41 921

Exception on the score Some trigrams has low score even they have many counts “His and the”, “ , he” Contains multiple very common tokens: and, the, “,”, _start_, in, etc. They have large legitimated combinations (broad distribution), and thus low score Filter them out by adding an highlighting exception: 1. Score < 0.9e-25 (original) AND 2. total_count < 50 (additional) Score = 𝑓𝑟𝑒𝑞(𝑤1,𝑤2,𝑤3) 𝑓𝑟𝑒𝑞(𝑤1)×𝑓𝑟𝑒𝑞(𝑤2)×𝑓𝑟𝑒𝑞(𝑤3) trigram w1 w2 w3 Total count score wiki07 Google ngram web1T his and the 2.47E+09 2.43E+10 4.97E+10 99422 3.33E-26 95 65015 34312 kept connect with 5.78E+09 4.30E-26 spammer message , 921650 4E+08 5.86E+10 4.63E-26 the spammer 4.46E+08 4.89E-26 the spammer message 5.46E-26 these funny video 1.19E+09 3.75E+08 5.68E-26 talk `` this 1.4E+08 5.14E+09 6.40E-26 of unloaded video 2.87E+10 6.52E-26 's attention successfully 4.6E+09 1.09E+08 6.59E-26 can became famous 2.08E+09 1.34E+08 6.76E-26 video draw most 8.47E+08 6.88E-26 common asker , 2.24E+08 987843 7.71E-26 he kept connect 2.9E+09 8.56E-26 first speaker present 1.04E+09 2.56E+08 8.68E-26 _start_ in kevin 1.17E+11 1.69E+10 4898 9.28E-26 15 2098 2785 , he 7387 9.75E-26 2 713 6672

Green color: by trigram detection
Newer result O Older result O O O O O O X O O O O It seems that the newer result is better

Newer result Older result O X O X O X However, the newer result is not often better.

Deep-learning method Classifier Translation
Word prediction (simple trial using tensorflow)

Classifier The Illinois-Columbia System in the CoNLL-2014 Shared Task(Rank 2nd) For article correction: < The UI system in the HOO 2012 shared task on error correction.

Classifier Different features for different grammatical errors
Is confident to achieve some results (preposition, confusing words, verb, etc) However, may not be able to correct semantic error

Translation Seq2seq (planning to follow Attention is all you need)
Input: unedited sentences (with or without mistakes) Output: edited sentences (without mistakes) Wikipedia articles, published books seq2seq edited sentences Lang8, NUCLE (2014 shared task) Artificial generated sentence

Translation Problem: not enough of data Artificial generated data:
Replace randomly for Preposition, Confusion words , Articles Chinglish (style) from a translator on parallel corpus Result from google translate

Problem of neural network translation: not enough data
Generate Chinese style writing by using parallel corpus Can rewrite sentence in a better way Edited sentence (target sentence) Given: Google translate Unedited sentence (input sentence)

Word prediction (a trial on tensorflow)
Judge the target word is problematic or not by guessing the target word probability distribution (similar to skip-gram) For example, As Hong Kong students are not native speakers. Input: word_vector(As), word_vector(Kong), Preposition, Noun phrase Output: “hong” Input vector W1.wvect (100) W3.wvect (100) W1.POS (56) W3.POS (56) Hidden layer Output layer Dimension = 100*2 +56*2 = 312 Dimension = 156 Dimension = 400 k (word distribution)

Statistics 814400000 steps, batch size = 32 Data ~ 2.2e9
Epoch = x 32/2.2e9 ~12 epoch As Hong Kong students are not native English speakers. 1 ['As', 'Kong'] ['IN', 'NNP'] ->Hong top 10: ['hong', 'a', 'the', '"', 'unk', 'new', 'united', '“', 'in', 'being'] rank: 0 eprob: prob: percentile: As Hong Kong students are not native English speakers. 2 ['Hong', 'students'] ['NNP', 'NNS'] ->Kong top 10: ["'s", 'and', '.', 'kong', 'unk', 'university', 'for', ',', 'of', '’s'] rank: 3 eprob: prob: percentile: As Hong Kong students are not native English speakers. 3 ['Kong', 'are'] ['NNP', 'VBP'] -> students top 10: ['unk', 'kong', ')', 'studios', ',', 'offices', 'and', 'who', 'games', 'members'] rank: 22 eprob: prob: percentile:

Idea: target word with high percentile is problematic
As Hong Kong students are not native English speakers. 4 ['students', 'not'] ['NNS', 'RB'] -> are top 10: ['were', 'are', 'did', 'can', 'do', 'would', ',', 'could', '.', 'and'] rank: 1 eprob: prob: percentile: 7 ['native', 'speakers'] ['JJ', 'NNS'] -> English top 10: ['unk', 'european', 'english', 'asian', 'indian', '.', 'indigenous', ',', 'native', 'korean'] rank: 2 eprob: prob: percentile: Idea: target word with high percentile is problematic Why percentile (area)? Some distributions are narrow, some are wide Wide distribution occurs when there are many legitimated target words Wide distribution explained one of the weaknesses in using frequency count to judge the trigram or dependency

Highlight when percentile > 0.88
Math lessons use English. 1 ['Math', 'use'] ['NNP', 'VB'] lessons top 10: ['to', 'and', 'can', ',', '.', 'will', 'would', 'may', 'could', '-'] rank: 193 eprob: prob: percentile: 2 ['lessons', 'English'] ['NNS', 'NNP'] use top 10: ['in', 'from', '.', 'of', 'at', ',', 'include', '(', 'for', 'to'] rank: 110 eprob: prob: percentile: 3 ['use', '.'] ['VB', '.'] English top 10: ['unk', 'it', 'them', 'this', 'him', 'use', '"', 'applications', 'law', 'there'] rank: 55 eprob: prob: percentile:

James Veitch shows what would happens when you reply to spam email.
Highlight when percentile > 0.88 James Veitch shows what would happens when you reply to spam . 1 ['James', 'shows'] ['NNP', 'VBZ'] Veitch top 10: ['unk', '"', ',', 'and', 'also', 'who', 'that', 'then', 'first', 'often'] 7 ['when', 'reply'] ['WRB', 'VBP'] you rank: 1951 eprob: prob: percentile: top 10: ['unk', 'they', 'i', 'we', 'you', 'often', 'he', 'many', 'females', 'others'] 2 ['Veitch', 'what'] ['NNP', 'WP'] shows rank: 4 eprob: prob: percentile: top 10: [',', 'of', '.', 'in', 'and', 'on', 'from', 'to', 'for', 'at'] rank: 243 eprob: prob: percentile: 8 ['you', 'to'] ['PRP', 'IN'] reply top 10: ['moved', 'belong', 'listen', 'go', 'went', 'come', 'back', 'refer', 'due', '-'] 3 ['shows', 'would'] ['VBZ', 'MD'] what top 10: ['it', 'he', 'that', 'who', 'they', 'unk', 'and', ',', 'she', '"'] rank: 393 eprob: prob: percentile: rank: 12 eprob: prob: percentile: 9 ['reply', 'spam'] ['VBP', 'NN'] to 4 ['what', 'happens'] ['WP', 'VBZ'] would top 10: ['the', 'unk', 'a', 'on', 'to', 'for', '"', 'out', '.', 'from'] top 10: ['it', 'he', 'she', 'unk', 'this', '"', 'really', 'nothing', 'what', 'and'] rank: 4 eprob: prob: percentile: rank: 347 eprob: prob: percentile: 10 ['to', ' '] ['IN', 'NN'] spam 5 ['would', 'when'] ['MD', 'WRB'] happens top 10: ['the', 'an', 'a', 'unk', 'his', 'this', 'their', 'its', '-', 'her'] top 10: ['be', 'occur', 'continue', 'unk', 'have', ',', 'happen', 'survive', 'play', 'do'] rank: 1152 eprob: prob: percentile: rank: 893 eprob: prob: percentile: 11 ['spam', '.'] ['NN', '.'] top 10: ['unk', '"', ')', 'system', 'content', 'technology', 'market', 'letters', 'systems', 'website'] 6 ['happens', 'you'] ['VBZ', 'PRP'] when top 10: ['to', '.', 'for', 'what', ',', 'if', 'that', 'and', 'as', '"'] rank: 1879 eprob: prob: percentile: rank: 10 eprob: prob: percentile:

Highlight when percentile > 0.88
I had a causal chat with Tim yesterday. 1 ['I', 'a'] ['PRP', 'DT'] had 5 ['chat', 'Tim'] ['NN', 'NNP'] with top 10: ["'m", '’m', 'was', 'had', 'am', 'have', 'got', 'has', 'is', ','] top 10: ['with', '.', ',', 'to', 'between', 'and', 'by', 'writer', 'on', 'for'] rank: 3 eprob: prob: percentile: rank: 0 eprob: prob: percentile: 2 ['had', 'causal'] ['VBD', 'NN'] a 6 ['with', 'yesterday'] ['IN', 'NN'] Tim top 10: ['a', 'no', 'the', 'any', 'its', 'an', 'that', 'in', 'significant', 'been'] top 10: ['a', 'the', 'unk', 'an', 'his', '"', 'this', '-', 'that', 'their'] rank: 0 eprob: prob: percentile: rank: 1663 eprob: prob: percentile: 3 ['a', 'chat'] ['DT', 'NN'] causal 7 ['Tim', '.'] ['NNP', '.'] yesterday top 10: ['unk', '"', 'live', 'free', 'single', 'regular', 'new', 'long', 'news', 'separate'] top 10: ['unk', '"', 'brady', 'jones', 'miller', 'hortons', 'taylor', 'brown', 'russert', 'redman'] rank: 5186 eprob: prob: percentile: rank: eprob: prob: percentile: 4 ['causal', 'with'] ['NN', 'IN'] chat top 10: ['unk', ',', 'relationship', 'associated', 'relationships', 'junctions', 'problems', 'interaction', '.', 'function'] rank: 3189 eprob: prob: percentile:

Word prediction May be useful for detection and correction Some noises
More features can be added (dependency)

My plan Translation first, then the Classifier approach
Study the system in “attention is all you need” Test training on Nucle and Lang8 Look for a Chinese-> English translator Generate problematic sentences from parallel corpus

web1T and deep learning methods

Similar presentations

Presentation on theme: "web1T and deep learning methods"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

web1T and deep learning methods

Similar presentations

Presentation on theme: "web1T and deep learning methods"— Presentation transcript:

Similar presentations

About project

Feedback