Language Transfer of Audio Word2Vec:

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Research & Development ICASSP' Analysis of Model Adaptation on Non-Native Speech for Multiple Accent Speech Recognition D. Jouvet & K. Bartkova France.
Relevance Feedback Retrieval of Time Series Data Eamonn J. Keogh & Michael J. Pazzani Prepared By/ Fahad Al-jutaily Supervisor/ Dr. Mourad Ykhlef IS531.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Facial expression as an input annotation modality for affective speech-to-speech translation Éva Székely, Zeeshan Ahmed, Ingmar Steiner, Julie Carson-Berndsen.
Nonparametric-Bayesian approach for automatic generation of subword units- Initial study Amir Harati Institute for Signal and Information Processing Temple.
J. Kunzmann, K. Choukri, E. Janke, A. Kießling, K. Knill, L. Lamel, T. Schultz, and S. Yamamoto Automatic Speech Recognition and Understanding ASRU, December.
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
12.0 Computer-Assisted Language Learning (CALL) References: 1.“An Overview of Spoken Language Technology for Education”, Speech Communications, 51, pp ,
9.0 Speaker Variabilities: Adaption and Recognition References: of Huang 2. “ Maximum A Posteriori Estimation for Multivariate Gaussian Mixture.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU.
Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System.
A Survey for Interspeech Xavier Anguera Information Retrieval-based Dynamic TimeWarping.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent Recognition of foreign names spoken by native speakers Frederik Stouten & Jean-Pierre Martens Ghent University.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Speed improvements to information retrieval-based dynamic time warping using hierarchical K-MEANS clustering Presenter: Kai-Wun Shih Gautam Mantena 1,2.
DYNAMIC TIME WARPING IN KEY WORD SPOTTING. OUTLINE KWS and role of DTW in it. Brief outline of DTW What is training and why is it needed? DTW training.
Automatic Pronunciation Scoring of Specific Phone Segments for Language Instruction EuroSpeech 1997 Authors: Y. Kim, H. Franco, L. Neumeyer Presenter:
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
STD Approach Two general approaches: word-based and phonetics-based Goal is to rapidly detect the presence of a term in a large audio corpus of heterogeneous.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
SPEECH VARIATION AND THE USE OF DISTANCE METRICS ON THE ARTICULATORY FEATURE SPACE Louis ten Bosch.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Spoken Term Discovery for Language Documentation using Translations
A NONPARAMETRIC BAYESIAN APPROACH FOR
Naifan Zhuang, Jun Ye, Kien A. Hua
Olivier Siohan David Rybach
Neural Machine Translation
Supervisor: Prof Michael Lyu Presented by: Lewis Ng, Philip Chan
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous.
Artificial Intelligence for Speech Recognition
College of Engineering
Attention Is All You Need
Supervised Time Series Pattern Discovery through Local Importance
Joint Training for Pivot-based Neural Machine Translation
Deep Neural Networks based Text- Dependent Speaker Verification
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Semantic Interoperability and Data Warehouse Design
Chinese Poetry Generation with Planning based Neural Network
EE513 Audio Signals and Systems
Statistical Machine Translation Papers from COLING 2004
Ying Dai Faculty of software and information science,
Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2
Attention.
Ying Dai Faculty of software and information science,
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Word embeddings (continued)
Hsien-Chin Lin, Chi-Yu Yang, Hung-Yi Lee, Lin-shan Lee
Natural Language Processing (NLP) Systems Joseph E. Gonzalez
Jointly Generating Captions to Aid Visual Question Answering
Neural Machine Translation using CNN
Baseline Model CSV Files Pandas DataFrame Sentence Lists
Watermarking with Side Information
2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.
Sequence-to-Sequence Models
Auditory Morphing Weyni Clacken
Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee
Neural Machine Translation by Jointly Learning to Align and Translate
Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen
Presentation transcript:

Language Transfer of Audio Word2Vec: Learning Audio Segment Representations without Target Language Data Introduction: 4:00 Training: 2:30 Language transfer: 6:30 STD: 5:30 Speaker: Hung-Yi Lee Chia-Hao Shen, Janet Y. Sung, Hung-Yi Lee

Outline Introduction Training of Audio Word2vec Language Transfer Application to Query-by-example Spoken Term Detection (STD) Concluding Remarks 0:50

Audio Word to Vector Model Model Model Model word-level audio segment Model Model Model Model As its name implies Learn from lots of audio without annotation

Audio Word to Vector The audio segments corresponding to words with similar pronunciations are close to each other. dog never dog never dogs 1:00 never ever ever

Language Transfer Model ? Not included in training audio Language X Model unsupervised :39 ? Audio collection without annotation Can we train an universal model that can be applied even on the unknown languages?

Language Transfer Why consider universal model for all languages? If you want to apply the model on language X, why don’t you simply train a model by the audio of language X. Many audio files are code-switched across several different languages. Applied the model on audio data on the Internet in hundreds of languages. Audio collection for model training may not cover all the languages. It would be beneficial to have an universal model. 1:30 Star War C-3PO six million galactic languages Cannot Ewok language The Yuzzum language is closely associated with the Ewok language C-3PO first communicated with the Ewoks using the Yuzzum language, and gradually pieced together enough to learn the Ewok language sufficiently to be conversational. C-3PO learnt the Ewok language through observation. https://scifi.stackexchange.com/questions/124394/how-did-c3po-know-the-ewok-language The 3PO-series protocol droids are equipped with a TranLang III communications module. It comes with up to six million galactic languages - common and obscure, organic and inorganic - at purchase. It also possessed phonetic pattern analysers that provides the capability to learn and translate new languages not in its existing database.

Outline Introduction Training of Audio Word2vec Language Transfer Application to Query-by-example Spoken Term Detection Concluding Remarks

Audio Word to Vector Model Model Model Model word-level audio segment There are lots of segmentation approaches. Model Model Model Model In the following discussion, assume that we already get the segmentation.

Sequence-to-sequence Auto-encoder vector audio segment We use sequence-to-sequence auto-encoder here The training is unsupervised. RNN Encoder The vector we want (similar to the model used in speech summarization) x1 x2 x3 x4 acoustic features audio segment

Sequence-to-sequence Auto-encoder Input acoustic features x1 x2 x4 x3 The RNN encoder and decoder are jointly trained. y1 y2 y3 y4 RNN Encoder RNN Decoder x1 x2 x3 x4 acoustic features audio segment

What does machine learn? Text word to vector: Audio word to vector (phonetic information) 𝑉 𝑅𝑜𝑚𝑒 −𝑉 𝐼𝑡𝑎𝑙𝑦 +𝑉 𝐺𝑒𝑟𝑚𝑎𝑛𝑦 ≈𝑉 𝐵𝑒𝑟𝑙𝑖𝑛 𝑉 𝑘𝑖𝑛𝑔 −𝑉 𝑞𝑢𝑒𝑒𝑛 +𝑉 𝑎𝑢𝑛𝑡 ≈𝑉 𝑢𝑛𝑐𝑙𝑒 V( ) - V( ) + V( ) = V( ) 2:30 IT CAT CATS GIRL PEARL PEARLS GIRLS V( ) - V( ) + V( ) = V( ) IT CAT CATS ITS [Chung, Wu, Lee, Lee, Interspeech 16)

Outline Introduction Training of Audio Word2vec Language Transfer Application to Query-by-example Spoken Term Detection Concluding Remarks

Language Transfer We train sequence-to-sequence auto-encoder on a source language with a large amount of data. Training Apply RNN encoder on a new language. source language source language RNN Encoder RNN Decoder trained by source language Testing target language vector representation z for target language RNN Encoder

Experimental Setup Using 1-layer GRU as encoder and decoder Training with SGD. Initial learning rate was 1 and decayed with a factor of 0.95 every 500 batches. Acoustic features: 39-dim MFCC We used forced alignment with reference transcriptions to obtain word boundaries. The results is kind of oracle. We address this issue in another ICASSP paper.

Experimental Setup - Corpus English is our source language, while the other languages are target languages. English: Librispeech Training data: 2.2M word-level audio segments Testing data: 250K audio segments French, German, Czech and Spanish: GlobalPhone Testing data: 20K audio segments

Phonetic Information ever never RNN RNN Encoder Encoder Edit Distance between Phoneme sequences EH V ER N EH V ER ever never =1 RNN Encoder RNN Encoder Cosine Similarity

Phonetic Information Model trained on English, tested on English Variance Larger phoneme sequence edit distance, smaller cosine similarity Cosine Similarity The same pronunciation very different pronunciation Phoneme Sequence Edit Distance

Phonetic Information Model trained on English, tested on other languages Audio word2vec still capture phonetic information even though the model has never heard the language. Cosine Similarity Phoneme Sequence Edit Distance:

Visualization Visualizing embedding vectors of each word RNN Encoder day Project to 2-D RNN Encoder day average RNN Encoder day

Visualization Learn on English, and apply on the other languages 6:30 左邊法文 右邊德文 French German

Outline Introduction Training of Audio Word2vec Language Transfer Application to Query-by-example Spoken Term Detection (STD) Concluding Remarks

Query-by-example Spoken Term Detection “ICASSP” spoken query user “ICASSP” “ICASSP” We have to mention some evolutions here!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!????????????????????? Also known as unsupervised spoken term detection, zero-resource spoken content retrieval, etc. Evaluation program Spoken Content Compute similarity between spoken queries and audio files on acoustic level, and find the query term

Query-by-example Spoken Term Detection Segmental DTW [Zhang, ICASSP 10], Subsequence DTW [Anguera, ICME 13][Calvo, MediaEval 14] DTW for query-by-example Adding slope-constraints [Chan & Lee, Interspeech 10] The blue path is better than the green one. Spoken Query Utterance

Query-by-example Spoken Term Detection Much faster than DTW Audio archive divided into variable-length audio segments Off-line Audio Word to Vector Spoken Query Audio Word to Vector Similarity On-line Search Result

Query-by-Example STD Baseline: Naïve Encoder [Tu & Lee, ASRU 11] [I.-F. Chen, Interspeech 13] … … … … … … … …

Query-by-Example STD ─ English 1K queries to retrieve 250K audio segments DTW? Audio Word2vec Naïve Encoder DTW is not tractable here due to computing issue The evaluation measure is Mean Average Precision (MAP) (the larger, the better).

Query-by-Example STD ─ Language Transfer : Naïve Encoder : Audio word2vec by target language (4K segments) 1K queries to retrieve 20K audio segments FRE GRE CZE ESP The performance based on audio word2vec is poor with limited training data.

Query-by-Example STD ─ Language Transfer : Naïve Encoder : Audio word2vec by target language (4K segments) : Audio word2vec by English (2.2M segments) FRE GRE CZE ESP Audio word2vec learned by English can be directly applied on French and German.

Query-by-Example STD ─ Language Transfer : Naïve Encoder : Audio word2vec by target language (4K segments) : Audio word2vec by English (2.2M segments) : Audio word2vec by English + target (2K segments) 5:30 FRE GRE CZE ESP Fine-tuning English model by target language is helpful.

Outline Introduction Training of Audio Word2vec Language Transfer Application to Query-by-example Spoken Term Detection Concluding Remarks

Concluding Remarks We verify the capability of language transfer of Audio Word2Vec. Audio Word2Vec learned from English captures the phonetic information of other languages. In Query-by-example STD, Audio Word2Vec learned from English outperformed the baselines on French and German.

SEGMENTAL AUDIO WORD2VEC Session: Spoken Language Acquisition and Retrieval Time: Wednesday, April 18, 16:00 - 18:00