Speech Prosody Conversion using Sequence Generative Adversarial Nets

Slides:

Advertisements

Similar presentations

By: Hossein and Hadi Shayesteh Supervisor: Mr J.Connan.

Advertisements

HARMONIC MODEL FOR FEMALE VOICE EMOTIONAL SYNTHESIS Anna PŘIBILOVÁ Department of Radioelectronics, Slovak University of Technology Ilkovičova 3, SK-812.

High Level Prosody features: through the construction of a model for emotional speech Loic Kessous Tel Aviv University Speech, Language and Hearing

Vocal Emotion Recognition with Cochlear Implants Xin Luo, Qian-Jie Fu, John J. Galvin III Presentation By Archie Archibong.

Survey of INTERSPEECH 2013 Reporter: Yi-Ting Wang 2013/09/10.

EMOTIONS NATURE EVALUATION BASED ON SEGMENTAL INFORMATION BASED ON PROSODIC INFORMATION AUTOMATIC CLASSIFICATION EXPERIMENTS RESYNTHESIS VOICE PERCEPTUAL.

Spoken Language Generation Project II Synthesizing Emotional Speech in Fairy Tales.

Using Emotion Recognition and Dialog Analysis to Detect Trouble in Communication in Spoken Dialog Systems Nathan Imse Kelly Peterson.

Inducing and Detecting Emotion in Voice Aaron S. Master Peter X. Deng Kristin L. Richards Advisor: Clifford Nass.

Equine Gait Analysis and Visualization Methods Dr. Marjorie Skubic Samer Arafat Justin Satterley Computer Engineering & Computer Science Dr. Kevin Keegan.

Exploring Emotions.

Producing Emotional Speech Thanks to Gabriel Schubiner.

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

SPEECH CONTENT Spanish Expressive Voices: Corpus for Emotion Research in Spanish R. Barra-Chicote 1, J. M. Montero 1, J. Macias-Guarasa 2, S. Lufti 1,

Multimodal Information Analysis for Emotion Recognition

1 Multiple Classifier Based on Fuzzy C-Means for a Flower Image Retrieval Keita Fukuda, Tetsuya Takiguchi, Yasuo Ariki Graduate School of Engineering,

Understanding The Semantics of Media Chapter 8 Camilo A. Celis.

NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.

 Detecting system  Training system Human Emotions Estimation by Adaboost based on Jinhui Chen, Tetsuya Takiguchi, Yasuo Ariki （ Kobe University ） User's.

Variation of aspect ratio Voice section Correct voice section Voice Activity Detection by Lip Shape Tracking Using EBGM Purpose What is EBGM ？ Experimental.

Estimation of Sound Source Direction Using Parabolic Reflection Board 2008 RISP International Workshop on Nonlinear Circuits and Signal Processing (NCSP’08)

Perceptual Analysis of Talking Avatar Head Movements: A Quantitative Perspective Xiaohan Ma, Binh H. Le, and Zhigang Deng Department of Computer Science.

2D Continuous Wavelet Transform Heejong Yoo(ECE) April 26, 2001.

CCN COMPLEX COMPUTING NETWORKS1 This research has been supported in part by European Commission FP6 IYTE-Wireless Project (Contract No: )

AMSP : Advanced Methods for Speech Processing An expression of Interest to set up a Network of Excellence in FP6 Prepared by members of COST-277 and colleagues.

Active Microphone with Parabolic Reflection Board for Estimation of Sound Source Direction Tetsuya Takiguchi, Ryoichi Takashima and Yasuo Ariki Organization.

Performance Comparison of Speaker and Emotion Recognition

ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION. Introduction What is Speech Recognition?  also known as automatic speech recognition or computer speech.

Sparse Granger Causality Graphs for Human Action Classification Saehoon Yi and Vladimir Pavlovic Rutgers, The State University of New Jersey.

CS654: Digital Image Analysis Lecture 11: Image Transforms.

1 st State T 2 nd State T 3 rd State T “[t]” Phoneme “track” “down” “and” “neutralize” “terrorists” “Track down and neutralize terrorists.” Word Sentence.

RESEARCH MOTHODOLOGY SZRZ6014 Dr. Farzana Kabir Ahmad Taqiyah Khadijah Ghazali (814537) SENTIMENT ANALYSIS FOR VOICE OF THE CUSTOMER.

Speech emotion detection General architecture of a speech emotion detection system: What features?

DeepWalk: Online Learning of Social Representations

Nataliya Nadtoka James Edge, Philip Jackson, Adrian Hilton CVSSP Centre for Vision, Speech & Signal Processing UNIVERSITY OF SURREY.

Naifan Zhuang, Jun Ye, Kien A. Hua

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

August 15, 2008, presented by Rio Akasaka

Emotional Speech Modelling and Synthesis

Intelligent Information System Lab

The Matrix Model of Computation (MMC)

Random walk initialization for training very deep feedforward networks

HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs

Multi-resolution analysis

Textual Video Prediction

The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression By: Patrick Lucey, Jeffrey F. Cohn, Takeo.

پروتكل آموزش سلامت به مددجو

Viewing Transformations

Visualizing Audio for Anomaly Detection

Using Matrices with Transformations

1-R-43 　Neutral-to-Emotional Voice Conversion with Latent Representations of F0 using Generative Adversarial Networks Zhaojie Luo, Tetsuya Takiguchi, and.

The Big Health Data–Intelligent Machine Paradox

HW 洪銘佑王璽喆.

Wavelet Transform Fourier Transform Wavelet Transform

Yi Zhao1, Yanyan Shen*1, Yanmin Zhu1, Junjie Yao2

Lip movement Synthesis from Text

Overview Accomplishments Automatic Queen selection Side by Side Tracks

Chapter 15: Wavelets (i) Fourier spectrum provides all the frequencies

Chapter 3 Sampling.

TOFEL Reading Feb

Data Pre-processing Lecture Notes for Chapter 2

Artificial Intelligence 2004 Speech & Natural Language Processing

Motivation Semantic Transformation Module Most of the existing works neglect the semantic relationship between the visual feature and linguistic knowledge,

Measuring the Similarity of Rhythmic Patterns

Variational autoencoders to visualize non-intuitive data

Text-to-speech (TTS) Traditional approaches (before 2016) Neural TTS

Auditory Morphing Weyni Clacken

End-to-End Speech-Driven Facial Animation with Temporal GANs

Deep screen image crop and enhance

1-P-30 Speech-to-Speech Translation using Dual Learning and Prosody Conversion Zhaojie Luo, Yoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki (Kobe.

Presentation transcript:

Speech Prosody Conversion using Sequence Generative Adversarial Nets with Continuous Wavelet Transform F0 features Zhaojie Luo, Tetsuya Takiguchi, and Yasuo Ariki (Kobe University) Canonical Correlation Analysis Overview Background Problems Goal 1. The representation of fundamental frequency (F0) is too simple for prosody conversion. 2. GANs can only give the score/loss for an entire sequence when it has been generated. 1. Applying the continuous wavelet transform (CWT) to systematically capture the F0 features of different temporal scales. 2. Using sequence-GAN to address the continuity of each matrix in a completed sentence, which can train both long-term and short-term dependencies. Sentence Sentence Keep linguistic information unchanged　 Neutral Sad Happy Angry Speech prosody conversion Emotional robot Framework Input voice Features extraction Features processing Matrics processing Training Conversion seq-GANs seq-GANs Dataset Feature extraction Dataset Training model Mexican hat Interpolated log-normalized F0 and wavelet transform features (i=30, i=24, i=18, i=12, i=6) Illustration of training the sentence features in the Seq-GANs Dataset Results (a) recorded voice (b) DBNs+LG (c) DBNs+NNs (d) GAN (e) Sequence-GAN Tar./Percept Ang. Sad Hap. Neu. Angry 90 2 8 27 5 68 59 13 3 25 33 21 44 71 4 1 24 97 62 53 43 39 45 15 12 17 Happy 98 18 74 38 47 48 80 Neutral 95 14 10 7 69 56 22 23 37 11 60