Pick samples from task t

Slides:



Advertisements
Similar presentations
PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.
Advertisements

Autonomic Scaling of Cloud Computing Resources
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.
ICONIP 2005 Improve Naïve Bayesian Classifier by Discriminative Training Kaizhu Huang, Zhangbing Zhou, Irwin King, Michael R. Lyu Oct
Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Speaker Adaptation for Vowel Classification
Deep Belief Networks for Spam Filtering
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
Optimal Adaptation for Statistical Classifiers Xiao Li.
The Chinese University of Hong Kong Department of Computer Science and Engineering Lyu0202 Advanced Audio Information Retrieval System.
SOMTIME: AN ARTIFICIAL NEURAL NETWORK FOR TOPOLOGICAL AND TEMPORAL CORRELATION FOR SPATIOTEMPORAL PATTERN LEARNING.
Eng. Shady Yehia El-Mashad
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Kumar Srijan ( ) Syed Ahsan( ). Problem Statement To create a Neural Networks based multiclass object classifier which can do rotation,
Table 3:Yale Result Table 2:ORL Result Introduction System Architecture The Approach and Experimental Results A Face Processing System Based on Committee.
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Predicting Voice Elicited Emotions
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
Interpreting Ambiguous Emotional Expressions Speech Analysis and Interpretation Laboratory ACII 2009.
An Adaptive Learning with an Application to Chinese Homophone Disambiguation from Yue-shi Lee International Journal of Computer Processing of Oriental.
Facial Smile Detection Based on Deep Learning Features Authors: Kaihao Zhang, Yongzhen Huang, Hong Wu and Liang Wang Center for Research on Intelligent.
Experience Report: System Log Analysis for Anomaly Detection
Olivier Siohan David Rybach
Jonatas Wehrmann, Willian Becker, Henry E. L. Cagnini, and Rodrigo C
Learning Deep Generative Models by Ruslan Salakhutdinov
Research on Machine Learning and Deep Learning
Investigating Pitch Accent Recognition in Non-native Speech
Sentiment analysis algorithms and applications: A survey
University of Rochester
Summary of “Efficient Deep Learning for Stereo Matching”
Online Multiscale Dynamic Topic Models
Data Mining, Neural Network and Genetic Programming
Feature Mapping FOR SPEAKER Diarization IN NOisy conditions
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
Conditional Random Fields for ASR
Saliency detection Donghun Yeo CV Lab..
Machine Learning Week 1.
Distributed Learning of Multilingual DNN Feature Extractors using GPUs
Two-Stream Convolutional Networks for Action Recognition in Videos
A First Look at Music Composition using LSTM Recurrent Neural Networks
MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.
John H.L. Hansen & Taufiq Al Babba Hasan
View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.
GANG: Detecting Fraudulent Users in OSNs
Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2
Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang
Word embeddings (continued)
Deep Learning for the Soft Cutoff Problem
Learning Incoherent Sparse and Low-Rank Patterns from Multiple Tasks
Deep Interest Network for Click-Through Rate Prediction
Primal Sparse Max-Margin Markov Networks
Jointly Generating Captions to Aid Visual Question Answering
Introduction to Neural Networks
NON-NEGATIVE COMPONENT PARTS OF SOUND FOR CLASSIFICATION Yong-Choon Cho, Seungjin Choi, Sung-Yang Bang Wen-Yi Chu Department of Computer Science &
2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.
DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.
LHC beam mode classification
Learning to Detect Human-Object Interactions with Knowledge
Presentation transcript:

Pick samples from task t Learning Cross-lingual Knowledge with Multilingual BLSTM for Emphasis Detection with Limited Training Data Yishuang Ning1,2, Zhiyong Wu 1,2,3, Runnan Li 1,2, Jia Jia 1,2,* , Mingxing Xu1,2, Helen Meng 3 , Lianhong Cai 1,2 1 Tsinghua-CUHK Joint Research Center for Media Sciences, Technologies and Systems, Graduate School at Shenzhen, Tsinghua University 2 Department of Computer Science and Technology, Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University 3 Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong 1. Introduction 4. Approaches 5. Experiments and Results Motivation Automatic detection of emphasis plays an important role in human-computer interactions, e.g., emphatic speech synthesis, content spotting and user intention understanding Various classification models are unable to incorporate contextual information which emphasis detection mainly relies on LSTM can leverage contextual information for modeling, but it needs moderate or large corpus to train a good model Contribution Introduce contextual dependencies for emphasis detection Leverage cross-lingual knowledge between different languages to improve the detection performance Propose a multilingual BLSTM (MTL-BLSTM) for emphasis detection Emphasis Detection with Multilingual BLSTM (MTL-BLSTM) Motivation 1: Emphasis is related with its past and future acoustic contexts Emphasis has the characteristic of local prominence Syllables whose acoustic features are higher than their neighbors are easier to be perceived as emphasis Motivation 2: Many intrinsic features can be shared across different languages F0 and duration vary with vowel height F0 and duration are constrained by the place of articulation Emphasis can be realized by F0 variations Architecture Form a uniform representation of input features between different languages Hidden layers are shared across different languages Softmax output layers are language-dependent Training procedure A variation of multi-task learning (MTL) The tasks of both languages are trained simultaneously The mini-batch-based adaptive gradient (Adagrad) algorithm is used The model is updated according to the task-specific objective function Experimental Setup Data sets Language 1: Mandarin (MAN) corpus, Language 2: English (ENG) corpus 1942 MAN utterances from Sogou Voice Assistant, 339 ENG utterances from CUHK 100 MAN utterances and 30 ENG utterances from the above sets are used as the test set Comparison methods Support vector machine (SVM), Bayesian network (BN), Conditional random field (CRF), Monolingual LSTM (MNL-LSTM), Monolingual BLSTM (MNL-BLSTM), Mix-lingual BLSTM (MXL-BLSTM) Our method: Multilingual BLSTM (MTL-BLSTM) Experimental Results Experiment 1: Influence of contextual dependencies on ENG test set The performance of using MNL-LSTM is better than that of using SVM, BN and CRF (2-15.6% in terms of F1-measure) Compared with CRF, LSTM can better leverage contextual dependencies for modeling When both past and future contexts are considered (for MNL-BLSTM), the performance can be further improved Experiment 2: Influence of cross-lingual knowledge Both MXL-BLSTM and MTL-BLSTM outperform MNL-BLSTM The model with uniform feature representation (MTL-BLSTM) is better than that of simply mixing the samples (MXL-BLSTM) of different languages The results demonstrate using large amount of MAN training data is helpful to improve the performance of limited amount of ENG training data, and vise versa Experiment 3: Influence of the complementary data (left figure below) The performance on ENG data achieves consistent improvement with the scale of MAN training data The results validate the usefulness of the cross-lingual knowledge for emphasis detection Experiment 4: Influence of model architectures (upper right figure) The number of LSTM memory blocks per hidden layer affects the model performance The performance gets better at first and then decreases gradually (64 is the best) Formulate the emphasis detection problem as a sequential learning task and use BLSTM for modeling Propose an MTL-BLSTM model for emphasis detection % 2. Problem Statement Definition of Emphasis A word or part of a word perceived as standing out from its surrounding words with auditory perception Definition of Emphasis Detection Perceive or recognize the emphasized speech segments from natural speech Label words or phonemes in the corpus as emphatic or non-emphatic Problem Statement View emphasis detection as a binary classification problem Phonemes or syllables of emphatic words as 1 (Positive samples) Phonemes or syllables of non-emphatic words as 0 (Negative samples) Performance on ENG corpus Performance on MAN corpus % 3. Acoustic Features Segmental Features from syllable level for Mandarin, phoneme level for English F0 related features (0 for unvoiced segments) meanlf0: the mean value of log F0 minlf0: the minimum value of log F0 maxlf0: the maximum value of log F0 lf0range: the range of log F0 Energy related features (extracted from MFCC) meanenergy: the mean value of energy minenergy: the minimum value of energy maxenergy: the maximum value of energy energyrange: the range of energy Duration duration: duration of each syllable or phoneme Semitone: more suitable for the human’s auditory perception (f is the F0 value) Pick a task t Pick samples from task t Compute loss Compute gradient Update model 6. Conclusions Proposes a multilingual BLSTM (MTL-BLSTM) to address the emphasis detection problem The cross-lingual knowledge can be learned to provide benefits to both languages Experimental results demonstrate effectiveness of our proposed method and show superior performance over monolingual BLSTM (MNL-BLSTM) : initialized with Gaussian distribution ct: class category ( ) of the tth language zj: linear prediction of the jth category m: number of class categories ɛ: learning rate (initialized with 0.01)