Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006

Overview ► Problem Statement (Motivation) ► Conditional Random Fields ► Experiments & Results ► Future Work

Problem Statement ► ► Developed as part of the ASAT Project   Automatic Speech Attribute Transcription   Project to build tools to extract and parse speech attributes from a speech signal ► ► Goal: Develop a system for bottom-up speech recognition using 'speech attributes'

Speech Attributes? ► ► Any information that could be useful for recognizing the spoken language   Phonetic attributes ► ► Consonants have manner, place of articulation, voicing ► ► Vowels have height, frontness, roundness, tenseness ► ► Speaker attributes (gender, age, etc.)   Any other useful attributes that could be used for speech recognition /d/ manner: stop place of artic: dental voicing: voiced /t/ manner: stop place of artic: dental voicing: unvoiced /iy/ height: high frontness: front roundness: nonround tenseness: tense /ae/ height: low frontness: front roundness: nonround tenseness: tense

Feature Combination ► ► Our piece of this project is to find ways to combine speech attributes together and use them to recognize language  Other groups are working on finding features to extract and methods of extracting them   Note that there is no guarantee that attributes will be independent of each other  In fact, many attributes will be strongly correllated or dependent on other attributes ► e.g. voicing for vowels

Evidence Combination ► ► Two basic ways to build hypotheses hyp data hyp data Top Down Generate a hypothesis See if the data fits the hypothesis Bottom Up Examine the data Search for a hypothesis that fits

Top Down ► ► Traditional Automated Speech Recogintion Systems (ASR) use a top- down approach   Hypothesis is the phone we are predicting   Data is some encoding of the acoustic speech signal   A likelihood of the signal given the phone label is learned from data   A prior probability for the phone label is learned from the data  These are combined through Bayes Rule to give us the posterior probability P(label | data) /iy/ X P(/iy/) P(X|/iy/)

Bottom Up ► ► Bottom-up models have the same high-level goal – determine the label from the observation   But instead of a likelihood, the posterior probability P(label | data) is learned directly from the data ► ► Neural Networks can be used to learn probabilities in this manner /iy/ X P(/iy/|X)

Speech is a Sequence ► ► Speech is not a single, independent event   It is a combination of multiple events over time ► ► A model to recognize spoken language should take into account dependencies across time /k/ /iy/

Speech is a Sequence ► ► A top down model can be extended into a time sequence as a Hidden Markov Model (HMM)   Now our likelihood of the data is over the entire sequence instead of a single phone /k/ /iy/ XXXXX

Conditional Random Fields ► ► A form of discriminative modelling   Has been used successfully in various domains such as part of speech tagging and other Natural Language Processing tasks ► ► Processes evidence bottom-up   Combines multiple features of the data   Builds the probability P( sequence | data)

Conditional Random Fields ► ► Conceptual Overview   Each attribute of the data we are trying to model fits into a feature function that associates the attribute and a possible label ► ► A positive value if the attribute appears in the data ► ► A zero value if the attribute is not in the data   Each feature function carries a weight that gives the strength of that feature function for the proposed label ► ► High positive weights indicate a good association between the feature and the proposed label ► ► High negative weights indicate a negative association between the feature and the proposed label ► ► Weights close to zero indicate the feature has little or no impact on the identity of the label

Conditional Random Fields ► ► CRFs have transition feature functions and state feature functions   Transition functions add associations between transitions from one label to another   State functions help determine the identity of the state /k/ /iy/ XXXXX

Conditional Random Fields State Feature Function Association of an attribute with a phone label e.g. f(P(stop), /k/) State Feature Weight Indicates the strength of the association of this attribute with this label Transition Feature Function Association of an attribute with a phone-to-phone transition e.g. g(attr, /iy/,/k/) Transition Feature Weight Indicates the strength of the association of this attribute with this transition

Experiments ► ► Goal: Implement a Conditional Random Field Model on speech attribute data   Perform phone recognition   Compare results to those obtained via a Tandem system ► ► Experimental Data   TIMIT read speech corpus   Moderate-sized corpus of clean, prompted speech, complete with phonetic-level transcriptions

Attribute Selection ► ► Attribute Detectors   Built using ICSI QuickNet Neural Network software ► ► Two different types of attributes   Phonological feature detectors ► ► Place, Manner, Voicing, Vowel Height, Backness, etc. ► ► Features are grouped into eight classes, with each class having a variable number of possible values based on the IPA phonetic chart   Phone detectors ► ► Neural networks output based on the phone labels – one output per label   Classifiers were trained on 2960 utterances from the TIMIT training set ► Uses extracted 12 th order PLP coefficients (i.e. frequency coefficients) in a 9 frame window as inputs to the neural networks

Experimental Setup ► ► Code built on the Java CRF toolkit on Sourceforge   http://crf.sourceforge.net   Performs training to maximize the log-likelihood of the training set with respect to the model ► ► Does this via gradient descent – find the place where the gradient of the log-likelihood function goes to zero

Experimental Setup ► ► Output from the Neural Nets are themselves treated as feature functions for the observed sequence   Each attribute/label combination gives us a value for one feature function ► ► We also use a bias feature for each label   Currently, all combinations of features and labels are used as feature functions ► ► e.g. f(P(stop),/t/), f(P(stop),/ae/), etc.  Phone class features are used in the same manner ► e.g f(P(/t/), /t/), f(P(/t/), /ae/), etc.  Transition features use only a 0/1 bias feature ► 1 if the transition occurs at that timeframe in the training set ► 0 if the transition does not occur at that timeframe in the training set ► For comparison purposes, we compare to a baseline HMM- trained system that uses decorrellated features as inputs

Initial Results Model Label Space Phone Recog Accuracy HMM (phones) triphones67.32% CRF (phones) monophones67.27% HMM (features) triphones66.69% CRF (features) monophones65.25% HMM (phones/feas) (top 39) triphones67.96% CRF (phones/feas) monophones68.00%

Experimental Setup ► Initial CRF experiments show results comparable to triphone HMM results with only monophone labelling  No decorrellation of features needed  No assumptions about feature independence ► Comparison to HMM crippled in one way:  HMM training allowed for shifting of phone boundaries during training  CRF training used set phone boundaries for all training ► Another experiment – train the CRF, realign training labels, then retrain on realigned labels

Realignment Results Model Label Space Phone Recog Accuracy HMM (phones) triphones67.32% CRF (phones) base monophones67.27% CRF (phones) realign monophones69.63% HMM (features) triphones66.69% CRF (features) base monophones65.25% CRF (features) realign monophones67.52%

Experimental Setup ► CRFs can also make use of features on the transitions  For the initial experiments, transition feature functions only used bias features (e.g. 1 or 0 based on label in the training corpus) ► What if the phone classifications were used as the state features, and the feature classes were used as transition features?  Linguistic observation – feature spreading from phone to phone

Realignment Results Model Label Space Phone Recog Accuracy CRF (phones) base monophones67.27% CRF (phones) realign monophones69.63% CRF (features) base monophones65.25% CRF (features) realign monophones67.52% CRF (p+f) base monophones68.00% CRF (p + trans f) base monophones69.49% CRF (p + trans f) align monophones70.86%

Discussion & Future Work ► This seems to be a good model for the type of feature combination we want to perform  Makes use of arbitrary, possibly correllated features  Results on phone recognition task comparable or superior to the alternative sequence model (HMM) ► Future Work  New features ► What kinds of features can we add to improve our transitions? ► We hope to get more from the other research groups  New training methods ► Faster algorithms than the gradient descent method exist and need to be tested  Word recogntion ► We are thinking about how to model word recogntion in this framework  Larger corpora ► TIMIT is a comparably small corpus – we are looking to move to something bigger

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Similar presentations

Presentation on theme: "Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Similar presentations

Presentation on theme: "Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006."— Presentation transcript:

Similar presentations

About project

Feedback