Presentation is loading. Please wait.

Presentation is loading. Please wait.

1-P-30 Speech-to-Speech Translation using Dual Learning and Prosody Conversion Zhaojie Luo, Yoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki (Kobe.

Similar presentations


Presentation on theme: "1-P-30 Speech-to-Speech Translation using Dual Learning and Prosody Conversion Zhaojie Luo, Yoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki (Kobe."— Presentation transcript:

1 1-P-30 Speech-to-Speech Translation using Dual Learning and Prosody Conversion Zhaojie Luo, Yoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki (Kobe University) Canonical Correlation Analysis Overview Background Problems Goal 1. The traditional pipeline combined method will increase the cumulative loss of information greatly when combing each system. 2. The prosody of the source utterance is discarded by the ASR system and is not accessible to the TTS system in the target language. Hellow 你好 1. Reduce the error rate when combining the Automatic Speech Recognizer (ASR), Machine Translation system (NMT) and Test-to-Speech (TTS) system. 2. Changing the prosody in different languages. 3. Applying the voice conversion (VC) to the speech-to-speech translation system. Translate the linguistic content in different language  Speech prosody conversion VEmotional robot Proposed method ASR NMT TTS Good morning “早上好” Conversion features speaker's voice collected speech to speech translation with speaker' voice VC “text” ASR MT TTS Traditional approaches to S2SMT use a pipeline architecture. “text” ASR MT TTS In this paper, we proposed a speech-to-speech translation combined with the voice conversion model. For combining the most popular ASR, NMT, TTS and VC models, we used the dual learning. TTS: We used the Tacotron 2, a neural network architecture for speech synthesis directly from the text. VC: We used the starGAN, which requires no parallel utterances, transcriptions, or time alignment procedures for speech generator training. NMT: For the standard NMT model, we carry out experiments with Chinese-to-English (Zh-En) using the data of iwslt2015. ASR: We used the pre-trained Deep Speech 2 model, which uses 9,400 hours of labeled speech containing 11 million mandarin utterances as training data. To reduce the error rate, we aiming to apply the dual learning to combine the ASR, NMT, and TTS using the popular models. Dataset Experiment Dataset Results MCD MCD results of using and not using dual learning in different VC Native to Native: The VC from native speaker to native speaker. TTS to Native: The VC from TTS voice to the native speaker. TTS to Non-Native: The VC from TTS voice to the non-native speaker. BELU result of tranforming Chinese to English using standard NMT model.


Download ppt "1-P-30 Speech-to-Speech Translation using Dual Learning and Prosody Conversion Zhaojie Luo, Yoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki (Kobe."

Similar presentations


Ads by Google