web1T and deep learning methods

Slides:



Advertisements
Similar presentations
Messaging Is More Than “ Type ” and “ Click ” Your Writing Style Represents You ― Use Conventional Style for Messaging Your Writing Style.
Advertisements

Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.
Face Detection and Neural Networks Todd Wittman Math 8600: Image Analysis Prof. Jackie Shen December 2001.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
The Internet Explained
Parallelism Revision Review. What is Parallelism? Parallelism is no different in English than it is in math. In writing, parallelism occurs when there.
Lecture 10 NLTK POS Tagging Part 3 Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Funny Factory Keith Harris Matt Gamble Mike Cialowicz Zeid Rusan.
Korean Phoneme Discrimination Ben Lickly Motivation Certain Korean phonemes are very difficult for English speakers to distinguish, such as ㅅ and ㅆ.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Feasibility of Using Machine Learning Algorithms to Determine Future Price Points of Stocks By: Alexander Dumont.
NEVER TRUESOMETIMES TRUEUSUALLY TRUEALWAYS TRUE Listen attentively to English Language teacher during the lesson. 0 (0%) 7 (43.75%)9 (56.25%) Listen.
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
Final Project: English Preposition Usage Checker J.-S. Roger Jang ( 張智星 ) MIR Lab, CSIE Dept. National Taiwan University.
Week 7. Tuesday: Peer Review Planner Table of Contents You need your comparing and contrasting essay from last week IN: – Edit the paper at your desk.
A Simple Approach for Author Profiling in MapReduce
Glottodidactics Lesson 7.
Big data classification using neural network
How to Teach a RAZ Reading Class to Chinese Students
Zheng ZHANG 1-st year PhD candidate Group ILES, LIMSI
Collocation – Encouraging Learning Independence
A Straightforward Author Profiling Approach in MapReduce
Learning Usage of English KWICly with WebLEAP/DSR
Deep learning David Kauchak CS158 – Fall 2016.
More notes on verbs: helping verbs
How to Teach a RAZ Reading Class to Chinese Students
James L. McClelland SS 100, May 31, 2011
English only: Consulting L2 Students
Deep Learning Hung-yi Lee 李宏毅.
LanguageTool - Part A David Ling.
Computer Networks Lesson 3.
Prototype-Driven Learning for Sequence Models
Prepositions and Prepositional Phrases
English correction corpora
A1-A2 Unit One Lesson 4B Making mistakes.
Funny Factory Mike Cialowicz Zeid Rusan Matt Gamble Keith Harris.
Core Concepts Lecture 1 Lexical Frequency.
A1-A2 Unit One Lesson 4B Making mistakes.
Transformer result, convolutional encoder-decoder
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
Language Arts Grade 11 Week 23 Lesson 1 & 2
The CoNLL-2014 Shared Task on Grammatical Error Correction
Word Embedding Word2Vec.
Automatic Detection of Causal Relations for Question Answering
The CoNLL-2014 Shared Task on Grammatical Error Correction
Hong Kong English in Students’ Writing
The Big Health Data–Intelligent Machine Paradox
Grammar correction – Data collection interface
Chapter 11 Practical Methodology
Statistical n-gram David ling.
I've Got To Write A Research Paper ! ! !.
Ensemble learning.
Lip movement Synthesis from Text
Machine Translation(MT)
Natural Language Processing
Ngram frequency smooting
Computer Networks Lesson 3.
University of Illinois System in HOO Text Correction Shared Task
Word embeddings (continued)
English project More detail and the data collection system
Preposition error correction using Graph Convolutional Networks
11 simple rules to using pronouns correctly
Introduction to Sentiment Analysis
A unified extension of lstm to deep network
Some preliminary results
Word representations David Kauchak CS158 – Fall 2016.
Part-of-Speech Tagging Using Hidden Markov Models
Teaching a receptive lesson
Presentation transcript:

web1T and deep learning methods 2018-03-29 David Ling

Contents Newly added corpus (web1T) Deep-learning method Revised rule Performance Deep-learning method Classifier Translation Word prediction (a trial on tensorflow)

Newly added web1T corpus web1T ~37GB Google ngram ~20GB Wiki2007 ~10GB “Youtude to show the trend on internet about making video.” Web1T has more trigrams Typically, trigram with score order e-26 is rare trigram w1 w2 w3 Total count score wiki07 Google ngram web1T youtude to show 2.18E+10 3.37E+08 1.36E-19 to show the 4.97E+10 5813781 1.59E-23 2107 3483882 2327792 show the trend 26937990 13994 3.10E-23 10445 3549 the trend on 6.05E+09 8947 1.10E-24 4 3226 5717 trend on internet 2.77E+08 2.22E-26 on internet about 1.76E+09 1320 4.48E-25 internet about making 2.26E+08 59 5.44E-25 about making video 3.75E+08 318 2.13E-24 making video . 4.13E+10 963 2.75E-25 1 41 921

Exception on the score Some trigrams has low score even they have many counts “His and the”, “email, he” Contains multiple very common tokens: and, the, “,”, _start_, in, etc. They have large legitimated combinations (broad distribution), and thus low score Filter them out by adding an highlighting exception: 1. Score < 0.9e-25 (original) AND 2. total_count < 50 (additional) Score = 𝑓𝑟𝑒𝑞(𝑤1,𝑤2,𝑤3) 𝑓𝑟𝑒𝑞(𝑤1)×𝑓𝑟𝑒𝑞(𝑤2)×𝑓𝑟𝑒𝑞(𝑤3) trigram w1 w2 w3 Total count score wiki07 Google ngram web1T his and the 2.47E+09 2.43E+10 4.97E+10 99422 3.33E-26 95 65015 34312 kept connect with 78939166 51001544 5.78E+09 4.30E-26 spammer message , 921650 4E+08 5.86E+10 4.63E-26 the spammer email 4.46E+08 4.89E-26 the spammer message 5.46E-26 these funny video 1.19E+09 39292367 3.75E+08 5.68E-26 talk `` this 1.4E+08 21715521 5.14E+09 6.40E-26 of unloaded video 2.87E+10 1426742 6.52E-26 's attention successfully 4.6E+09 1.09E+08 30227572 6.59E-26 can became famous 2.08E+09 1.34E+08 53195226 6.76E-26 video draw most 45710262 8.47E+08 6.88E-26 common asker , 2.24E+08 987843 7.71E-26 he kept connect 2.9E+09 8.56E-26 first speaker present 1.04E+09 43143869 2.56E+08 8.68E-26 _start_ in kevin 1.17E+11 1.69E+10 26733733 4898 9.28E-26 15 2098 2785 email , he 7387 9.75E-26 2 713 6672

Green color: by trigram detection Newer result O Older result O O O O O O X O O O O It seems that the newer result is better

Newer result Older result O X O X O X However, the newer result is not often better.

Deep-learning method Classifier Translation Word prediction (simple trial using tensorflow)

Classifier The Illinois-Columbia System in the CoNLL-2014 Shared Task(Rank 2nd) For article correction: < The UI system in the HOO 2012 shared task on error correction.

Classifier Different features for different grammatical errors Is confident to achieve some results (preposition, confusing words, verb, etc) However, may not be able to correct semantic error

Translation Seq2seq (planning to follow Attention is all you need) Input: unedited sentences (with or without mistakes) Output: edited sentences (without mistakes) Wikipedia articles, published books seq2seq edited sentences Lang8, NUCLE (2014 shared task) Artificial generated sentence

Translation Problem: not enough of data Artificial generated data: Replace randomly for Preposition, Confusion words , Articles Chinglish (style) from a translator on parallel corpus Result from google translate

Problem of neural network translation: not enough data Generate Chinese style writing by using parallel corpus Can rewrite sentence in a better way Edited sentence (target sentence) Given: Google translate Unedited sentence (input sentence)

Word prediction (a trial on tensorflow) Judge the target word is problematic or not by guessing the target word probability distribution (similar to skip-gram) For example, As Hong Kong students are not native speakers. Input: word_vector(As), word_vector(Kong), Preposition, Noun phrase Output: “hong” Input vector W1.wvect (100) W3.wvect (100) W1.POS (56) W3.POS (56) Hidden layer Output layer Dimension = 100*2 +56*2 = 312 Dimension = 156 Dimension = 400 k (word distribution)

Statistics 814400000 steps, batch size = 32 Data ~ 2.2e9 Epoch = 814400000 x 32/2.2e9 ~12 epoch As Hong Kong students are not native English speakers. 1 ['As', 'Kong'] ['IN', 'NNP'] ->Hong top 10: ['hong', 'a', 'the', '"', 'unk', 'new', 'united', '“', 'in', 'being'] rank: 0 eprob: -1.756897 prob:0.270145 percentile:0.000000 As Hong Kong students are not native English speakers. 2 ['Hong', 'students'] ['NNP', 'NNS'] ->Kong top 10: ["'s", 'and', '.', 'kong', 'unk', 'university', 'for', ',', 'of', '’s'] rank: 3 eprob: -2.834891 prob:0.065298 percentile:0.261457 As Hong Kong students are not native English speakers. 3 ['Kong', 'are'] ['NNP', 'VBP'] -> students top 10: ['unk', 'kong', ')', 'studios', ',', 'offices', 'and', 'who', 'games', 'members'] rank: 22 eprob: -5.721673 prob:0.003666 percentile:0.397681

Idea: target word with high percentile is problematic As Hong Kong students are not native English speakers. 4 ['students', 'not'] ['NNS', 'RB'] -> are top 10: ['were', 'are', 'did', 'can', 'do', 'would', ',', 'could', '.', 'and'] rank: 1 eprob: -0.780017 prob:0.199650 percentile:0.291891 7 ['native', 'speakers'] ['JJ', 'NNS'] -> English top 10: ['unk', 'european', 'english', 'asian', 'indian', '.', 'indigenous', ',', 'native', 'korean'] rank: 2 eprob: -3.813728 prob:0.025270 percentile:0.132014 Idea: target word with high percentile is problematic Why percentile (area)? Some distributions are narrow, some are wide Wide distribution occurs when there are many legitimated target words Wide distribution explained one of the weaknesses in using frequency count to judge the trigram or dependency

Highlight when percentile > 0.88 Math lessons use English. 1 ['Math', 'use'] ['NNP', 'VB'] lessons top 10: ['to', 'and', 'can', ',', '.', 'will', 'would', 'may', 'could', '-'] rank: 193 eprob: -10.189608 prob:0.000072 percentile:0.967132 2 ['lessons', 'English'] ['NNS', 'NNP'] use top 10: ['in', 'from', '.', 'of', 'at', ',', 'include', '(', 'for', 'to'] rank: 110 eprob: -8.563295 prob:0.000137 percentile:0.979467 3 ['use', '.'] ['VB', '.'] English top 10: ['unk', 'it', 'them', 'this', 'him', 'use', '"', 'applications', 'law', 'there'] rank: 55 eprob: -6.149250 prob:0.001047 percentile:0.242663

James Veitch shows what would happens when you reply to spam email. Highlight when percentile > 0.88 James Veitch shows what would happens when you reply to spam email. 1 ['James', 'shows'] ['NNP', 'VBZ'] Veitch top 10: ['unk', '"', ',', 'and', 'also', 'who', 'that', 'then', 'first', 'often'] 7 ['when', 'reply'] ['WRB', 'VBP'] you rank: 1951 eprob: -9.807776 prob:0.000046 percentile:0.811201 top 10: ['unk', 'they', 'i', 'we', 'you', 'often', 'he', 'many', 'females', 'others'] 2 ['Veitch', 'what'] ['NNP', 'WP'] shows rank: 4 eprob: -3.141033 prob:0.019829 percentile:0.804588 top 10: [',', 'of', '.', 'in', 'and', 'on', 'from', 'to', 'for', 'at'] rank: 243 eprob: -8.311202 prob:0.000319 percentile:0.878116 8 ['you', 'to'] ['PRP', 'IN'] reply top 10: ['moved', 'belong', 'listen', 'go', 'went', 'come', 'back', 'refer', 'due', '-'] 3 ['shows', 'would'] ['VBZ', 'MD'] what top 10: ['it', 'he', 'that', 'who', 'they', 'unk', 'and', ',', 'she', '"'] rank: 393 eprob: -8.268242 prob:0.000270 percentile:0.847230 rank: 12 eprob: -3.928356 prob:0.015693 percentile:0.597846 9 ['reply', 'spam'] ['VBP', 'NN'] to 4 ['what', 'happens'] ['WP', 'VBZ'] would top 10: ['the', 'unk', 'a', 'on', 'to', 'for', '"', 'out', '.', 'from'] top 10: ['it', 'he', 'she', 'unk', 'this', '"', 'really', 'nothing', 'what', 'and'] rank: 4 eprob: -3.349006 prob:0.038771 percentile:0.338736 rank: 347 eprob: -9.204218 prob:0.000083 percentile:0.922861 10 ['to', 'email'] ['IN', 'NN'] spam 5 ['would', 'when'] ['MD', 'WRB'] happens top 10: ['the', 'an', 'a', 'unk', 'his', 'this', 'their', 'its', '-', 'her'] top 10: ['be', 'occur', 'continue', 'unk', 'have', ',', 'happen', 'survive', 'play', 'do'] rank: 1152 eprob: -10.655535 prob:0.000019 percentile:0.962199 rank: 893 eprob: -9.550125 prob:0.000105 percentile:0.890853 11 ['spam', '.'] ['NN', '.'] email top 10: ['unk', '"', ')', 'system', 'content', 'technology', 'market', 'letters', 'systems', 'website'] 6 ['happens', 'you'] ['VBZ', 'PRP'] when top 10: ['to', '.', 'for', 'what', ',', 'if', 'that', 'and', 'as', '"'] rank: 1879 eprob: -9.714923 prob:0.000062 percentile:0.833495 rank: 10 eprob: -3.795723 prob:0.023509 percentile:0.565027

Highlight when percentile > 0.88 I had a causal chat with Tim yesterday. 1 ['I', 'a'] ['PRP', 'DT'] had 5 ['chat', 'Tim'] ['NN', 'NNP'] with top 10: ["'m", '’m', 'was', 'had', 'am', 'have', 'got', 'has', 'is', ','] top 10: ['with', '.', ',', 'to', 'between', 'and', 'by', 'writer', 'on', 'for'] rank: 3 eprob: -2.941615 prob:0.041764 percentile:0.450077 rank: 0 eprob: -1.452636 prob:0.231977 percentile:0.000000 2 ['had', 'causal'] ['VBD', 'NN'] a 6 ['with', 'yesterday'] ['IN', 'NN'] Tim top 10: ['a', 'no', 'the', 'any', 'its', 'an', 'that', 'in', 'significant', 'been'] top 10: ['a', 'the', 'unk', 'an', 'his', '"', 'this', '-', 'that', 'their'] rank: 0 eprob: -0.051633 prob:0.615139 percentile:0.000000 rank: 1663 eprob: -11.673390 prob:0.000008 percentile:0.962557 3 ['a', 'chat'] ['DT', 'NN'] causal 7 ['Tim', '.'] ['NNP', '.'] yesterday top 10: ['unk', '"', 'live', 'free', 'single', 'regular', 'new', 'long', 'news', 'separate'] top 10: ['unk', '"', 'brady', 'jones', 'miller', 'hortons', 'taylor', 'brown', 'russert', 'redman'] rank: 5186 eprob: -11.522699 prob:0.000010 percentile:0.942488 rank: 25695 eprob: -12.312710 prob:0.000001 percentile:0.967669 4 ['causal', 'with'] ['NN', 'IN'] chat top 10: ['unk', ',', 'relationship', 'associated', 'relationships', 'junctions', 'problems', 'interaction', '.', 'function'] rank: 3189 eprob: -10.738490 prob:0.000014 percentile:0.960157

Word prediction May be useful for detection and correction Some noises More features can be added (dependency)

My plan Translation first, then the Classifier approach Study the system in “attention is all you need” Test training on Nucle and Lang8 Look for a Chinese-> English translator Generate problematic sentences from parallel corpus