Pick samples from task t Learning Cross-lingual Knowledge with Multilingual BLSTM for Emphasis Detection with Limited Training Data Yishuang Ning1,2, Zhiyong Wu 1,2,3, Runnan Li 1,2, Jia Jia 1,2,* , Mingxing Xu1,2, Helen Meng 3 , Lianhong Cai 1,2 1 Tsinghua-CUHK Joint Research Center for Media Sciences, Technologies and Systems, Graduate School at Shenzhen, Tsinghua University 2 Department of Computer Science and Technology, Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University 3 Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong 1. Introduction 4. Approaches 5. Experiments and Results Motivation Automatic detection of emphasis plays an important role in human-computer interactions, e.g., emphatic speech synthesis, content spotting and user intention understanding Various classification models are unable to incorporate contextual information which emphasis detection mainly relies on LSTM can leverage contextual information for modeling, but it needs moderate or large corpus to train a good model Contribution Introduce contextual dependencies for emphasis detection Leverage cross-lingual knowledge between different languages to improve the detection performance Propose a multilingual BLSTM (MTL-BLSTM) for emphasis detection Emphasis Detection with Multilingual BLSTM (MTL-BLSTM) Motivation 1: Emphasis is related with its past and future acoustic contexts Emphasis has the characteristic of local prominence Syllables whose acoustic features are higher than their neighbors are easier to be perceived as emphasis Motivation 2: Many intrinsic features can be shared across different languages F0 and duration vary with vowel height F0 and duration are constrained by the place of articulation Emphasis can be realized by F0 variations Architecture Form a uniform representation of input features between different languages Hidden layers are shared across different languages Softmax output layers are language-dependent Training procedure A variation of multi-task learning (MTL) The tasks of both languages are trained simultaneously The mini-batch-based adaptive gradient (Adagrad) algorithm is used The model is updated according to the task-specific objective function Experimental Setup Data sets Language 1: Mandarin (MAN) corpus, Language 2: English (ENG) corpus 1942 MAN utterances from Sogou Voice Assistant, 339 ENG utterances from CUHK 100 MAN utterances and 30 ENG utterances from the above sets are used as the test set Comparison methods Support vector machine (SVM), Bayesian network (BN), Conditional random field (CRF), Monolingual LSTM (MNL-LSTM), Monolingual BLSTM (MNL-BLSTM), Mix-lingual BLSTM (MXL-BLSTM) Our method: Multilingual BLSTM (MTL-BLSTM) Experimental Results Experiment 1: Influence of contextual dependencies on ENG test set The performance of using MNL-LSTM is better than that of using SVM, BN and CRF (2-15.6% in terms of F1-measure) Compared with CRF, LSTM can better leverage contextual dependencies for modeling When both past and future contexts are considered (for MNL-BLSTM), the performance can be further improved Experiment 2: Influence of cross-lingual knowledge Both MXL-BLSTM and MTL-BLSTM outperform MNL-BLSTM The model with uniform feature representation (MTL-BLSTM) is better than that of simply mixing the samples (MXL-BLSTM) of different languages The results demonstrate using large amount of MAN training data is helpful to improve the performance of limited amount of ENG training data, and vise versa Experiment 3: Influence of the complementary data (left figure below) The performance on ENG data achieves consistent improvement with the scale of MAN training data The results validate the usefulness of the cross-lingual knowledge for emphasis detection Experiment 4: Influence of model architectures (upper right figure) The number of LSTM memory blocks per hidden layer affects the model performance The performance gets better at first and then decreases gradually (64 is the best) Formulate the emphasis detection problem as a sequential learning task and use BLSTM for modeling Propose an MTL-BLSTM model for emphasis detection % 2. Problem Statement Definition of Emphasis A word or part of a word perceived as standing out from its surrounding words with auditory perception Definition of Emphasis Detection Perceive or recognize the emphasized speech segments from natural speech Label words or phonemes in the corpus as emphatic or non-emphatic Problem Statement View emphasis detection as a binary classification problem Phonemes or syllables of emphatic words as 1 (Positive samples) Phonemes or syllables of non-emphatic words as 0 (Negative samples) Performance on ENG corpus Performance on MAN corpus % 3. Acoustic Features Segmental Features from syllable level for Mandarin, phoneme level for English F0 related features (0 for unvoiced segments) meanlf0: the mean value of log F0 minlf0: the minimum value of log F0 maxlf0: the maximum value of log F0 lf0range: the range of log F0 Energy related features (extracted from MFCC) meanenergy: the mean value of energy minenergy: the minimum value of energy maxenergy: the maximum value of energy energyrange: the range of energy Duration duration: duration of each syllable or phoneme Semitone: more suitable for the human’s auditory perception (f is the F0 value) Pick a task t Pick samples from task t Compute loss Compute gradient Update model 6. Conclusions Proposes a multilingual BLSTM (MTL-BLSTM) to address the emphasis detection problem The cross-lingual knowledge can be learned to provide benefits to both languages Experimental results demonstrate effectiveness of our proposed method and show superior performance over monolingual BLSTM (MNL-BLSTM) : initialized with Gaussian distribution ct: class category ( ) of the tth language zj: linear prediction of the jth category m: number of class categories ɛ: learning rate (initialized with 0.01)