Learning of Word Boundaries in Continuous Speech using Time Delay Neural Networks Colin Tan School of Computing, National University of Singapore.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Neural Networks  A neural network is a network of simulated neurons that can be used to recognize instances of patterns. NNs learn by searching through.
Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.
Liner Predictive Pitch Synchronization Voiced speech detection, analysis and synthesis Jim Bryan Florida Institute of Technology ECE5525 Final Project.
1 Video Processing Lecture on the image part (8+9) Automatic Perception Volker Krüger Aalborg Media Lab Aalborg University Copenhagen
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
CS 678 –Relaxation and Hopfield Networks1 Relaxation and Hopfield Networks Totally connected recurrent relaxation networks Bidirectional weights (symmetric)
Hidden Markov Models Theory By Johan Walters (SR 2003)
Modular Neural Networks CPSC 533 Franco Lee Ian Ko.
Simple Neural Nets For Pattern Classification
Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University
Data Mining Techniques Outline
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Connectionist models. Connectionist Models Motivated by Brain rather than Mind –A large number of very simple processing elements –A large number of weighted.
Speaker Adaptation for Vowel Classification
November 2, 2010Neural Networks Lecture 14: Radial Basis Functions 1 Cascade Correlation Weights to each new hidden node are trained to maximize the covariance.
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
October 14, 2010Neural Networks Lecture 12: Backpropagation Examples 1 Example I: Predicting the Weather We decide (or experimentally determine) to use.
Dan Simon Cleveland State University
Image Compression Using Neural Networks Vishal Agrawal (Y6541) Nandan Dubey (Y6279)
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
A Genetic Algorithms Approach to Feature Subset Selection Problem by Hasan Doğu TAŞKIRAN CS 550 – Machine Learning Workshop Department of Computer Engineering.
Automatic Pitch Tracking September 18, 2014 The Digitization of Pitch The blue line represents the fundamental frequency (F0) of the speaker’s voice.
Multimodal Interaction Dr. Mike Spann
Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Conceptual Foundations © 2008 Pearson Education Australia Lecture slides for this course are based on teaching materials provided/referred by: (1) Statistics.
NEURAL NETWORKS FOR DATA MINING
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
Transfer Learning Motivation and Types Functional Transfer Learning Representational Transfer Learning References.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
Korean Phoneme Discrimination Ben Lickly Motivation Certain Korean phonemes are very difficult for English speakers to distinguish, such as ㅅ and ㅆ.
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Look who’s talking? Project 3.1 Yannick Thimister Han van Venrooij Bob Verlinden Project DKE Maastricht University.
1 Research Question  Can a vision-based mobile robot  with limited computation and memory,  and rapidly varying camera positions,  operate autonomously.
CSE & CSE6002E - Soft Computing Winter Semester, 2011 Neural Networks Videos Brief Review The Next Generation Neural Networks - Geoff Hinton.
CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
CSC321: Lecture 7:Ways to prevent overfitting
Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Case Selection and Resampling Lucila Ohno-Machado HST951.
By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
State Tying for Context Dependent Phoneme Models K. Beulen E. Bransch H. Ney Lehrstuhl fur Informatik VI, RWTH Aachen – University of Technology, D
Soft Computing Lecture 15 Constructive learning algorithms. Network of Hamming.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Neural Networks Lecture 4 out of 4. Practical Considerations Input Architecture Output.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.
Today’s Lecture Neural networks Training
Evaluating Classifiers
Deep Feedforward Networks
Summary of “Efficient Deep Learning for Stereo Matching”
基于多核加速计算平台的深度神经网络 分割与重训练技术
Creating fuzzy rules from numerical data using a neural network
Neural Networks A neural network is a network of simulated neurons that can be used to recognize instances of patterns. NNs learn by searching through.
Final Year Project Presentation --- Magic Paint Face
Going Backwards In The Procedure and Recapitulation of System Identification By Ali Pekcan 65570B.
Department of Electrical Engineering
Chapter 11 Practical Methodology
LECTURE 15: REESTIMATION, EM AND MIXTURES
Using Clustering to Make Prediction Intervals For Neural Networks
Sanguthevar Rajasekaran University of Connecticut
Random Neural Network Texture Model
Presentation transcript:

Learning of Word Boundaries in Continuous Speech using Time Delay Neural Networks Colin Tan School of Computing, National University of Singapore.

Motivations Humans are able to automatically segment words and sounds in speech with little difficulty. The ability to automatically segment words and phonemes also useful in training speech recognition engines.

Principle Time-Delay Neural Network – Input nodes have shift registers that allow the TDNN to generalize not only between discrete input-output pairs, but also over time. –Ability to learn true word boundaries given reasonably good initial estimations. –We make use of this property for our work.

Why TDNN? Representational simplicity –Intuitively easy to understand what inputs to TDNN and outputs to TDNN represent. Ability to generalize over time Hidden Markov Models have been left out of this work for now.

Time Delay Neural Networks Diagram shows a 2- input TDNN node. Constrained weights allow generalization over time.

Boundary Shift Algorithm Initially: –The TDNN is trained on a small manually segmented set of data. –Given the expected number of words in a new, unseen utterance, the cepstral frames in the utterance is distributed evenly over all the words. For example, if there are 2,000 frames and 10 expected words, each word is allocated 200 frames. Convex-Hull and Spectral Variation Function methods may be used to estimate the number of words in the utterance. For our experiments we manually counted the number of words in each utterance.

Boundary Shift Algorithm 1.The minimally trained TDNN is retrained using both its original data and the new unseen data. 2.After retraining, a variable-sized window is placed around each boundary. –Window is initially +/- 16 frames 3.A search is made within the window for the highest scoring frame. The boundary is shifted to that frame. –This search is allowed to search past boundaries into neighboring words. 4.TDNN is retrained using new boundaries.

Boundary Shift Algorithm 5.Windows are adjusted by +/- 2 frames (i.e. reduced by a total of 4 frames), and steps 3 to 5 are repeated. 6.Algorithm ends when boundary shifts are negligible, or windows shrink to 0 frames.

Network Pruning Limited training data lead to the problem of over- fitting. Three parameters are used to decide which TDNN nodes to prune. –Significance  j(max),, which measures how much a particular node contributes to the final answer. A node with a small Significance value contributes little to the final answer and can be pruned.

Network Pruning Three parameters are used to prune the TDNN: –The variance  j, which measures how much a particular node changes over all the inputs. A node that changes very little over all the inputs is not contributing to the learning, and can be removed. –Pairwise node distance  ji, which measures how node changes with respect to another. A node that follows another node closely in value is redundant and can be removed.

Network Pruning Thresholds are set for each parameter. Nodes with parameters falling below these thresholds are pruned. Selection of thresholds is critical. Pruning is performed after the TDNN has been trained on the initial set for about 200 cycles.

Experiments TDNN Architecture –27 Inputs 13 dcep coefficients, 13 ddcep coefficients, power. –5 input delays –96 Hidden Nodes Arbitrarily chosen, to be pruned later. –2 Binary Output Nodes Represents word start and end boundaries.

Experiments Data gathered from 6 speakers –3 male, 3 female. –Solving task similar to CISD Trains Scheduling Problem (Ferguson 96). –About minutes of speech used to train TDNN. –20 utterances, previously unseen, chosen to evaluate performance.

Experiment Results Performance Before Pruning Results shown relative to hand-labeled samples. Inside TestOutside Test Precision: 66.88% Precision: 56.22% Recall: 67.33% Recall: 76.69% F-Number: 67.07% F-Number: 64.88%

Experiment Results Performance After Pruning Inside TestOutside Test Precision: 66.03% Precision: 57.10% Recall: 61.41% Recall: 72.16% F-Number: 63.61% F-Number: 61.71%

Example Utterances Subject: CK Utterance: Ok thanks, now I need to find out how long does it need to travel from Elmira to Corning (okay) (th-) (-anks) (now) (i need) (to) (find) (out how) (long) (does it need) (to) (travel) (f-) (-om) (emira) (to c-) (orning)

Example Utterances Subject: CT Utterance: May I know how long it takes to travel from Elmira to Corning? (may i) (know how) (long) (does it) (take) (to tr-) (- avel) (from) (el-) (-mira) (to) (corn-) (-ning)

Deletion Errors Most prominent in places framed by plosives. Algorithm able to detect boundaries at ends of the phrase but not in middle, due to presence of ‘d’ plosives at the ends.

Insertion Errors Most prominent in places where a vowel is stretched.

Recommendations for Further Work Results presented are early research results, and are promising. Future work will combine TDNN with other statistical methods like Expectation Maximization.