Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee

Slides:



Advertisements
Similar presentations
PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.
Advertisements

Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.
ICASSP2013 SLP-L1 Human Spoken Language Acquisition and Learning Hsiao-Tsung Hung.
Modulation Spectrum Factorization for Robust Speech Recognition Wen-Yi Chu 1, Jeih-weih Hung 2 and Berlin Chen 1 Presenter : 張庭豪.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
1 A scheme for racquet sports video analysis with the combination of audio-visual information Visual Communication and Image Processing 2005 Liyuan Xing,
Phoneme Alignment. Slide 1 Phoneme Alignment based on Discriminative Learning Shai Shalev-Shwartz The Hebrew University, Jerusalem Joint work with Joseph.
MODULATION SPECTRUM EQUALIZATION FOR ROBUST SPEECH RECOGNITION Source: Automatic Speech Recognition & Understanding, ASRU. IEEE Workshop on Author.
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.
Philip Jackson and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics in a segmental-HMM recognizer using intermediate.
Advances in WP2 Chania Meeting – May
May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.
AdvisorStudent Dr. Jia Li Shaojun Liu Dept. of Computer Science and Engineering, Oakland University 3D Shape Classification Using Conformal Mapping In.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
ICASSP Speech Discrimination Based on Multiscale Spectro–Temporal Modulations Nima Mesgarani, Shihab Shamma, University of Maryland Malcolm Slaney.
Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.
Temple University Training Acoustic model using Sphinx Train Jaykrishna shukla,Mubin Amehed& cara Santin Department of Electrical and Computer Engineering.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
Content-Based MP3 Information Retrieval Chueh-Chih Liu Department of Accounting Information Systems Chihlee Institute of Technology 2005/06/16.
Advisor: Hsin-Hsi Chen Reporter: Chi-Hsin Yu Date: From Word Representations:... ACL2010, From Frequency... JAIR 2010 Representing Word... Psychological.
A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
A NONPARAMETRIC BAYESIAN APPROACH FOR
Olivier Siohan David Rybach
CS 445/656 Computer & New Media
Prof. Lin-shan Lee TA. Roy Lu
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Feature Mapping FOR SPEAKER Diarization IN NOisy conditions
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
Adversarial Learning for Neural Dialogue Generation
Pick samples from task t
Conditional Random Fields for ASR
Prof. Lin-shan Lee TA. Lang-Chi Yu
Pattern Recognition Sergios Theodoridis Konstantinos Koutroumbas
Presenter: Hajar Emami
Morphological Segmentation Inside-Out
Ewald van der Westhuizen Digital signal processing group
Audio Books for Phonetics Research
Automatic Speech Recognition: Conditional Random Fields for ASR
Introduction Task: extracting relational facts from text
Supervised vs. unsupervised Learning
Prof. Lin-shan Lee TA. Po-chun, Hsu
Audio and Speech Computers & New Media.
DCT-based Processing of Dynamic Features for Robust Speech Recognition Wen-Chi LIN, Hao-Teng FAN, Jeih-Weih HUNG Wen-Yi Chu Department of Computer Science.
Lip movement Synthesis from Text
Recognizing Location Names from Chinese Texts
View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.
Speech recognition, machine learning
Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2
Zhedong Zheng, Liang Zheng and Yi Yang
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Hsien-Chin Lin, Chi-Yu Yang, Hung-Yi Lee, Lin-shan Lee
Visual Recognition of American Sign Language Using Hidden Markov Models 문현구 문현구.
Artificial Intelligence 2004 Speech & Natural Language Processing
Ask and Answer Questions
Prof. Lin-shan Lee TA. Roy Lu
Presenter: Shih-Hsiang(士翔)
2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.
Speech recognition, machine learning
Language Transfer of Audio Word2Vec:
End-to-End Speech-Driven Facial Animation with Temporal GANs
Listen Attend and Spell – a brief introduction
Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen
Presentation transcript:

Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee Completely Unsupervised Phoneme Recognition By Adversarially Learning Mapping Relationships From Audio Embeddings Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee

Introduction In this paper, we propose a completely unsupervised phoneme recognition framework, in which only unparalleled or unrelated speech utterances and text sentences are needed in model training. The audio signals are automatically segmented into acoustic tokens and encoded into representative vectors. These representative vectors are then clustered, and each speech utterance is represented as a cluster index sequence.

Introduction A mapping relationship GAN is then developed, in which a generator transforms each cluster index sequence into a predicted phoneme sequence, and a discriminator is trained to distinguish the predicted phoneme sequence from the real phoneme sequences collected from text sentences. A phoneme recognition accuracy of 36% was achieved on TIMIT testing set for a model trained with TIMIT audio training set and unrelated text corpus.

Unsupervised phoneme recognition framework

K-means clustering

WGAN

Experimental setup Audio data: TIMIT 39-dim MFCCs Lexicon: TIMIT text sentences: English monolingual data of WMT’16 (27 millions of sentences)

Experimental results

Experimental results

Experimental results

Conclusions and future works In this work we proposed a framework to achieve completely unsupervised phoneme recognition without parallel data. We demonstrate that the proposed unsupervised framework can achieve roughly comparable performance on supervised learning with 0.1% of training data. There is still a long way to go along this direction, but here we show it is feasible to recognize phonemes without any labeled data.