Download presentation
Presentation is loading. Please wait.
Published byAlan French Modified over 9 years ago
1
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006
2
Overview ► Problem Statement (Motivation) ► Conditional Random Fields ► Experiments & Results ► Future Work
3
Problem Statement ► ► Developed as part of the ASAT Project Automatic Speech Attribute Transcription Project to build tools to extract and parse speech attributes from a speech signal ► ► Goal: Develop a system for bottom-up speech recognition using 'speech attributes'
4
Speech Attributes? ► ► Any information that could be useful for recognizing the spoken language Phonetic attributes ► ► Consonants have manner, place of articulation, voicing ► ► Vowels have height, frontness, roundness, tenseness ► ► Speaker attributes (gender, age, etc.) Any other useful attributes that could be used for speech recognition /d/ manner: stop place of artic: dental voicing: voiced /t/ manner: stop place of artic: dental voicing: unvoiced /iy/ height: high frontness: front roundness: nonround tenseness: tense /ae/ height: low frontness: front roundness: nonround tenseness: tense
6
Feature Combination ► ► Our piece of this project is to find ways to combine speech attributes together and use them to recognize language Other groups are working on finding features to extract and methods of extracting them Note that there is no guarantee that attributes will be independent of each other In fact, many attributes will be strongly correllated or dependent on other attributes ► e.g. voicing for vowels
7
Evidence Combination ► ► Two basic ways to build hypotheses hyp data hyp data Top Down Generate a hypothesis See if the data fits the hypothesis Bottom Up Examine the data Search for a hypothesis that fits
8
Top Down ► ► Traditional Automated Speech Recogintion Systems (ASR) use a top- down approach Hypothesis is the phone we are predicting Data is some encoding of the acoustic speech signal A likelihood of the signal given the phone label is learned from data A prior probability for the phone label is learned from the data These are combined through Bayes Rule to give us the posterior probability P(label | data) /iy/ X P(/iy/) P(X|/iy/)
9
Bottom Up ► ► Bottom-up models have the same high-level goal – determine the label from the observation But instead of a likelihood, the posterior probability P(label | data) is learned directly from the data ► ► Neural Networks can be used to learn probabilities in this manner /iy/ X P(/iy/|X)
10
Speech is a Sequence ► ► Speech is not a single, independent event It is a combination of multiple events over time ► ► A model to recognize spoken language should take into account dependencies across time /k/ /iy/
11
Speech is a Sequence ► ► A top down model can be extended into a time sequence as a Hidden Markov Model (HMM) Now our likelihood of the data is over the entire sequence instead of a single phone /k/ /iy/ XXXXX
12
Conditional Random Fields ► ► A form of discriminative modelling Has been used successfully in various domains such as part of speech tagging and other Natural Language Processing tasks ► ► Processes evidence bottom-up Combines multiple features of the data Builds the probability P( sequence | data)
13
Conditional Random Fields ► ► Conceptual Overview Each attribute of the data we are trying to model fits into a feature function that associates the attribute and a possible label ► ► A positive value if the attribute appears in the data ► ► A zero value if the attribute is not in the data Each feature function carries a weight that gives the strength of that feature function for the proposed label ► ► High positive weights indicate a good association between the feature and the proposed label ► ► High negative weights indicate a negative association between the feature and the proposed label ► ► Weights close to zero indicate the feature has little or no impact on the identity of the label
14
Conditional Random Fields ► ► CRFs have transition feature functions and state feature functions Transition functions add associations between transitions from one label to another State functions help determine the identity of the state /k/ /iy/ XXXXX
15
Conditional Random Fields State Feature Function Association of an attribute with a phone label e.g. f(P(stop), /k/) State Feature Weight Indicates the strength of the association of this attribute with this label Transition Feature Function Association of an attribute with a phone-to-phone transition e.g. g(attr, /iy/,/k/) Transition Feature Weight Indicates the strength of the association of this attribute with this transition
16
Experiments ► ► Goal: Implement a Conditional Random Field Model on speech attribute data Perform phone recognition Compare results to those obtained via a Tandem system ► ► Experimental Data TIMIT read speech corpus Moderate-sized corpus of clean, prompted speech, complete with phonetic-level transcriptions
17
Attribute Selection ► ► Attribute Detectors Built using ICSI QuickNet Neural Network software ► ► Two different types of attributes Phonological feature detectors ► ► Place, Manner, Voicing, Vowel Height, Backness, etc. ► ► Features are grouped into eight classes, with each class having a variable number of possible values based on the IPA phonetic chart Phone detectors ► ► Neural networks output based on the phone labels – one output per label Classifiers were trained on 2960 utterances from the TIMIT training set ► Uses extracted 12 th order PLP coefficients (i.e. frequency coefficients) in a 9 frame window as inputs to the neural networks
19
Experimental Setup ► ► Code built on the Java CRF toolkit on Sourceforge http://crf.sourceforge.net Performs training to maximize the log-likelihood of the training set with respect to the model ► ► Does this via gradient descent – find the place where the gradient of the log-likelihood function goes to zero
20
Experimental Setup ► ► Output from the Neural Nets are themselves treated as feature functions for the observed sequence Each attribute/label combination gives us a value for one feature function ► ► We also use a bias feature for each label Currently, all combinations of features and labels are used as feature functions ► ► e.g. f(P(stop),/t/), f(P(stop),/ae/), etc. Phone class features are used in the same manner ► e.g f(P(/t/), /t/), f(P(/t/), /ae/), etc. Transition features use only a 0/1 bias feature ► 1 if the transition occurs at that timeframe in the training set ► 0 if the transition does not occur at that timeframe in the training set ► For comparison purposes, we compare to a baseline HMM- trained system that uses decorrellated features as inputs
21
Initial Results Model Label Space Phone Recog Accuracy HMM (phones) triphones67.32% CRF (phones) monophones67.27% HMM (features) triphones66.69% CRF (features) monophones65.25% HMM (phones/feas) (top 39) triphones67.96% CRF (phones/feas) monophones68.00%
22
Experimental Setup ► Initial CRF experiments show results comparable to triphone HMM results with only monophone labelling No decorrellation of features needed No assumptions about feature independence ► Comparison to HMM crippled in one way: HMM training allowed for shifting of phone boundaries during training CRF training used set phone boundaries for all training ► Another experiment – train the CRF, realign training labels, then retrain on realigned labels
23
Realignment Results Model Label Space Phone Recog Accuracy HMM (phones) triphones67.32% CRF (phones) base monophones67.27% CRF (phones) realign monophones69.63% HMM (features) triphones66.69% CRF (features) base monophones65.25% CRF (features) realign monophones67.52%
24
Experimental Setup ► CRFs can also make use of features on the transitions For the initial experiments, transition feature functions only used bias features (e.g. 1 or 0 based on label in the training corpus) ► What if the phone classifications were used as the state features, and the feature classes were used as transition features? Linguistic observation – feature spreading from phone to phone
25
Realignment Results Model Label Space Phone Recog Accuracy CRF (phones) base monophones67.27% CRF (phones) realign monophones69.63% CRF (features) base monophones65.25% CRF (features) realign monophones67.52% CRF (p+f) base monophones68.00% CRF (p + trans f) base monophones69.49% CRF (p + trans f) align monophones70.86%
26
Discussion & Future Work ► This seems to be a good model for the type of feature combination we want to perform Makes use of arbitrary, possibly correllated features Results on phone recognition task comparable or superior to the alternative sequence model (HMM) ► Future Work New features ► What kinds of features can we add to improve our transitions? ► We hope to get more from the other research groups New training methods ► Faster algorithms than the gradient descent method exist and need to be tested Word recogntion ► We are thinking about how to model word recogntion in this framework Larger corpora ► TIMIT is a comparably small corpus – we are looking to move to something bigger
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.