An Improved Hierarchical Word Sequence Language Model Using Word Association NARA Institute of Science and Technology Xiaoyi WuYuji Matsumoto Kevin Duh Hiroyuki Shindo
Motivation Continuous Language Model Continuous Language Model a a selfish man Unseen Sequence a a man a a selfish man Learned Sequences Data Sparsity 1
Motivation a a selfish man a a Training data Discontinuous Sequence a a man P( ) ≈ ? Smoothing techniques 30 years worth of newswire text 1/3 trigrams are unseen (Allison et al., 2005) 1/3 trigrams are unseen (Allison et al., 2005) 2
HWS language model 3 as soon as possible as soon as possible n-gram HWS continuous discontinuous utterance-oriented pattern-oriented
Basic Idea of HWS Patterns are discontinuous(Sentence are divided into several sections by patterns) – x is a y of z Patterns are hierarchical – x is a y of z → x is y of z → x is z Words are generated from certain position of patterns (Words depend on patterns) 4
Basic Idea of HWS 5 is Tom a a boy of nine is Tom a a boy of nine discontinuous Hierarchical Word depends on pattern Word depends on pattern
Proposed Approach (Frequency-based HWS Model) Corpus Mrs. Allen is a senior editor of insight magazine 6 of Mrs. Allen is a senior editor Insight magazine
Proposed Approach (Frequency-based HWS Model) Corpus Mrs. Allen is a senior editor of insight magazine 7 of is magazine Mrs. Allen a senior editor Insight
Proposed Approach (Original HWS Model) Corpus Mrs. Allen is a senior editor of insight magazine 8 of is magazine Mrs. a a Insight Allen editor senior
Proposed Approach (Original HWS Model) Corpus Mrs. Allen is a senior editor of insight magazine 9 of is magazine Mrs. a a Insight Allen editor senior ($, of), (of, a), (a, is), (is, Mrs.), (Mrs., Allen), (a, senior), (senior, editor), (of, magazine), (magazine, insight)
Advantage of HWS: discontinuity 10 as soon as possible as soon as possible n-gram HWS
Word Association Based HWS 11 too much to handle Frequency-based HWS Frequency-based HWS too much to handle Word Association Based HWS Word Association Based HWS Frequency Word Association Score
Extra Techniques 1/2: Directionalization 12 as soon as possible as soon as possible ($, $, as), ($, as, soon), (as, soon, as), (soon, as, possible) ($, $, as), ($, as, soon), (as, soon, as), (soon, as, possible) ($, $, as), ($, as, as), (as, as, soon), (as, as, possible) ($, $, as), ($, as, as), (as, as, soon), (as, as, possible) n-gram (One-side generation) n-gram (One-side generation) HWS (Double-side generation) HWS (Double-side generation)
Extra Techniques 1/2: Directionalization 13 as soon as possible HWS (Directionalization) HWS (Directionalization) as soon as possible HWS R L R ($, as, as), (as, as, soon), (as, as, possible) ($, as, as), (as, as, soon), (as, as, possible) ($-R, as-R, as), (as-R, as-L, soon), (as-R, as-R, possible) ($-R, as-R, as), (as-R, as-L, soon), (as-R, as-R, possible)
Extra Techniques 2/2: Unification 14 the.. HWS (Unification) HWS (Unification) the.. when constructing a HWS structure, for each word in one sentence, we only count it once. when constructing a HWS structure, for each word in one sentence, we only count it once.
Intrinsic Experiments Training data – British National Corpus ( 449,755 sentences, 10 million words ) Test data – English Gigaword Corpus ( 44,702 sentences, 1 million words ) Preprocessing – NLTK tokenizer – Lowercase Word Association Score – Dice coefficient Smoothing methods – MKN(Modeified Kneser-Ney) (Chen & Goodman, 1999) – GLM(Generalized language model) (Pickhardt et. al.,2014) Evaluation measures – Perplexity – Coverage (|TR∩TE| / |TE|) – Usage (|TR∩TE| / |TR|) 15
Evaluation 16
Extrinsic Experimental Settings Training data – TED talks parallel corpus French-English ( sentence pairs) Test data – TED talks parallel corpus French-English (1617sentence pairs) Translation toolkit – Moses system Evaluation measures – BLEU (Papineni et al., 2002) – METEOR(Banerjee & Lavie, 2005) – TER (Snover et al., 2006) 17
Extrinsic Evaluation 18
Conclusions We proposed an improved hierarchical language model using word association and two extra techniques Proposed model can model natural language more precisely than the original FB-HWS Proposed model has better performance on both intrinsic and extrinsic experiments Source code can be downloaded at – 19
` Thank you for your attention! 20