Download presentation
Presentation is loading. Please wait.
Published byJody Osborne Modified over 6 years ago
1
Fabien Cromieres Chenhui Chu Toshiaki Nakazawa Sadao Kurohashi
Kyoto University Participation to the 3rd Workshop on Asian Translation Fabien Cromieres Chenhui Chu Toshiaki Nakazawa Sadao Kurohashi
2
Overview of our submissions
2 Systems Kyoto-EBMT Example-Based Machine Translation Uses Dependency analysis for both source and target side Some small incremental improvements over our last year participation Kyoto-NMT Our new implementation of the Neural MT paradigm Sequence-to-Sequence model with Attention Mechanism As first introduced by (Bahdanau et al., 2015) For the tasks: ASPEC Ja->En ASPEC En -> Ja ASPEC Ja -> Zh ASPEC Zh -> Ja
3
KyotoEBMT
4
KyotoEBMT Overview Example-Based MT paradigm
Need parallel corpus Few language-specific assumptions still a few language-specific rules Tree-to-Tree Machine Translation Maybe the least commonly used variant of x-to-x Sensitive to parsing quality of both source and target languages Maximize the chances of preserving information Dependency trees Less commonly used than Constituent trees Most natural for Japanese Should contain all important semantic information
5
KyotoEBMT pipeline Somehow classic pipeline
1- Preprocessing of the parallel corpus 2- Processing of input sentence 3- Decoding/Tuning/Reranking Tuning and reranking done with kbMira seems to work better than PRO for us
6
KyotoNMT
7
KyotoNMT Overview Uses the sequence-to-sequence with attention model
as proposed in (Bahdanau et al., 2015) with other subsequent improvements UNK-tags replacement (Luang et al., 2015) ADAM training, sub-word units, … Hopefully we can add more original ideas in the future Implemented in Python using the Chainer library A version is GPL open-sourced
8
Sequence-to-Sequence with Attention Bahdanau+, 2015
Previous state new state Attention Model <1000> LSTM <1000> <2620> <1000> <1000> <1000> <1000> Encoding of input <1000> <1000> <1000> <1000> <3620> <1000> <1000> concatenation maxout LSTM LSTM LSTM LSTM Decoder Current context <500> LSTM LSTM LSTM LSTM <620> softmax Encoder <620> <620> <620> <620> Target Embedding <30000> Source Embedding I am a Student 私 は 学生 です Previously generated word New word
9
Depending on experiments: GRU LSTM 2-layers LSTM
Other values were the same for all experiments Previous state new state Attention Model <1000> LSTM <1000> <2620> <1000> <1000> <1000> <1000> Encoding of input <1000> <1000> <1000> <1000> <3620> <1000> <1000> concatenation Target Vocabulary Size: – maxout LSTM LSTM LSTM LSTM Decoder Current context <500> LSTM LSTM LSTM LSTM <620> softmax Encoder <620> <620> <620> <620> Target Embedding <30000> Source Embedding I am a Student Source Vocabulary Size: – 私 は 学生 です Previously generated word New word
10
Regularization Weight Decay Early Stopping Dropout
Choosing a good value seemed quite important 1e-6 worked noticeably better than 1e-5 or 1e-7 Early Stopping Keep the parameters with the best loss on dev set Or keep the parameters with the best BLEU on dev set “best BLEU” work better, but even better to ensemble “best BLEU” and “best loss” Dropout Only used between LSTM layers (when used multi-layer LSTM) 20% dropout Noise on target word embeddings
11
Noise on target word embedding
Previous state Idea: add random noise at training time here to force the network to not rely too much on this information Seems to work (+ 1.5 BLEU) But is it actually because the network became less prone to cascading errors? Or simply a regularization effect? new state Attention Model <1000> LSTM <1000> <2620> <1000> <1000> <1000> <1000> Encoding of input <1000> <1000> <1000> <1000> <3620> <1000> <1000> concatenation maxout LSTM LSTM LSTM LSTM Decoder This part can be the source of cascading errors at translation time At training time, we always give the correct previous word, but not at translation time Current context <500> LSTM LSTM LSTM LSTM <620> softmax Encoder <620> <620> <620> <620> Target Embedding <30000> Source Embedding I am a Student 私 は 学生 です Previously generated word New word
12
Translation Translation with beam-search
Large beam (maximum 100) Although other authors mention issues with large beam, it worked for us Normalization of the score by the length of the sentence final n-best candidates are pruned by the average loss per word UNK words replaced with a dictionary using the attention values Dictionary extracted from the aligned training corpus attention not always very precise, but does help
13
Ensembling Ensembling is known to improve Neural-MT results substantially. We could confirm this, using three type of ensembling: “Normal” Ensembling Train different models and ensemble over them Self-Ensembling Ensembling of several parameters at different steps of the same training session Mixed Ensembling Train several models, and use several parameters for each models Observations: Ensembling does help a lot Mixed > Normal > Self Diminishing returns ( typically +2-3 BLEU going from one to two models, less than going from three to four models) Geometric averaging of probabilities worked better than Arithmetic averaging
14
The question of segmentation
Several options for segmentation Natural (ie. “words” for English) Subword units, using eg. BPE (Senrich et al., 2015) Automatic segmentation tools (JUMAN, SKP) -> Trade-off between sentence size, generalization capacity and computation efficiency English Words units Subword units with BPE Japanese JUMAN segmentation Chinese SKP segmentation “short units” segmentation
15
Results
16
In term of AM-FM actually ranks the EBMT system higher
In term of BLEU, ensembling 4 simple models beats the larger NMT system In term of Human evaluation, the larger NMT model has a slightly better score In term of AM-FM actually ranks the EBMT system higher Results for WAT 2016 Ja -> En BLEU AM-FM Pairwise JPO Adequacy EBMT 21.22 59.52 - NMT 1 24.71 56.27 47.0 (3/9) 3.89 (1/3) NMT 2 26.22 55.85 44.25 (4/9) # layers Source Vocabulary Target Vocabulary Ensembling NMT 1 2 200k (JUMAN) 52k (BPE) - NMT 2 1 30k (JUMAN) 30k (words) x4
17
Results for WAT 2016 En -> Ja BLEU AM-FM Pairwise JPO Adequacy EBMT
31.03 74.75 - NMT 1 36.19 73.87 55.25 (1/10) 4.02 (1/4) # layers Source Vocabulary Target Vocabulary Ensembling NMT 1 2 52k (BPE) 52k (BPE) -
18
Results for WAT 2016 Ja -> Zh BLEU AM-FM Pairwise JPO Adequacy EBMT
30.27 76.42 30.75 (3/5) - NMT 1 31.98 76.33 58.75 (1/5) 3.88 (1/3) # layers Source Vocabulary Target Vocabulary Ensembling NMT 1 2 30k (JUMAN) 30k (KyotoMorph) -
19
Results for WAT 2016 Zh -> Ja BLEU AM-FM Pairwise JPO Adequacy EBMT
36.63 76.71 - NMT 1 46.04 78.59 63.75 (1/9) 3.94 (1/3) NMT 2 44.29 78.44 56.00 (2/9) # layers Source Vocabulary Target Vocabulary Ensembling NMT 1 2 30k (KyotoMorph) 30k (JUMAN) x2 NMT 2 200k (KyotoMorph) 50k (JUMAN) -
20
EBMT vs NMT NMT vs EBMT: NMT seems more fluent
Src 本フローセンサーの型式と基本構成,規格を図示, 紹介。 Ref Shown here are type and basic configuration and standards of this flow with some diagrams. EBMT This flow sensor type and the basic composition, standard is illustrated, and introduced. NMT This paper introduces the type, basic configuration, and standards of this flow sensor. NMT vs EBMT: NMT seems more fluent NMT sometimes add parts not in the source (over-translation) NMT sometimes forget to translate some part of the source (under-translation)
21
Conclusion Neural MT proved to be very efficient NMT vs EBMT:
Especially for Ja -> Zh (almost +10 BLEU compared with EBMT) NMT vs EBMT: NMT output is more fluent and readable NMT has more often issues of under- or over-translation NMT takes longer to train but can be faster to translate Finding the optimal settings for NMT is very tricky Many hyper-parameters Each training takes a long time on a single GPU
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.