Download presentation
Published byCleopatra Maxwell Modified over 9 years ago
1
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences
Recurrent Neural Network-based Language Modeling for an Automatic Russian Speech Recognition System Irina Kipyatkova, Alexey Karpov 11 November, 2015, St. Petersburg, Russia
2
Introduction Automatic recognition of continuous Russian speech is a very challenging task due to several features of the language. Russian is a morphologically rich inflective language. Word formation is performed by morphemes, which carry grammatical meaning. This results in increasing the vocabulary size and perplexity of n-gram language models (LMs). Word order in Russian is not strictly fixed that complicates creation of LMs and decreases their efficiency. The most widely used n-gram LMs are good enough for languages with restricted word order (for example, English) but for the Russian language they are not so efficient. In our research we used recurrent neural network language model (RNN LM) for N-best list rescoring.
3
Statistical Language Models
n-gram is a sequence of n elements (for example, words), and the n-gram language model is used for prediction of an element in a sequence containing n-1 predecessors, word n-grams are aimed to estimate a probability of word sequence W=(w1,w2,…wm) in some text. In practice, the probability of n-gram occurrence is computed as follows: Type of Model Probability of ith word 0-gram P(w) = 1/|V| 1-gram P(w) 2-gram P(wi|wi-1) 3-gram P(wi|wi-2,wi-1)
4
Feedforward Neural Network Architecture
In feedforward NN, the input layer is a history of n-1 preceding words. Each word is associated with a vector, with length of V (vocabulary capacity). Drawback: Feedforward NNs use preceding context of a fixed length for word prediction
5
Recurrent Neural Network LM
input layer: hidden layer: output layer: f(z) is sigmoid activation function: g(z) is softmax function: RNN LM was trained using Recurrent Neural Network Language Modeling Toolkit (RNNLM toolkit) [Mikolov. T., Kombrink. S., Deoras. A., Burget. L., Černocký. J.: RNNLM - Recurrent Neural Network Language Modeling Toolkit. In: ASRU 2011 Demo Session (2011)]
6
Text Corpus and the Baseline LM
The corpus consists of several on-line newspapers. The volume of the corpus is over 350M words (after text normalization and deletion of doubling or short (<5 words) sentences). The corpus has above 1M unique word-forms The baseline model is 3-gram LM created using Kneser-Ney discounting method. LM was created with the help of the SRI Language Modeling Toolkit (SRILM). Vocabulary size is 150K words.
7
Perplexity of the Baseline, RNN, and Interpolated LMs
Perplexity of the obtained models was calculated on text data consisting of phrases (33M words in total) from online newspaper which was not used for training. Perplexity of the baseline 3-gram model is 553. Language model Interpolation coefficients 1.0 0.6 0.5 0.4 RNN with 100 hidden units + 3-gram LM 981 482 465 457 RNN with 300 hidden units + 3-gram LM 997 484 467 RNN with 500 hidden units + 3-gram LM 766 396 392 394
8
Architecture of Russian ASR system with RNN LM
9
Continuous Russian Speech Corpus
Characteristics of speech recordings: 22 hours of continuous speech and dialogs The corpus recorded in an acoustic studio with 44.1 KHz, 16 bits. SNR≈35dB A stereo pair of Oktava MK-012 cardioid microphones Presonus Firepod sound board Training corpus: 327 phonetically-balanced and meaningful phrases 50 native speakers from St. Petersburg (male and female voices are fifty-fifty) Test corpus: 100 continuously pronounced phrases from the on-line newspaper «Фонтанка.ru» ( 5 speakers
10
Evaluation of Performance of the ASR System
Evaluation of performance of the ASR system was carried by word error rate (WER): where S is a number of substitution errors, I is a number of insertion errors, D is a number of deletion errors, N is a total number of words is the recognizing phrase.
11
WER Obtained After Rescoring of N-best Lists with RNN LM
WER obtained with the baseline 3-gram LM was 26.54%. Number units in hidden layer Interpolation coefficient 10-best list 20-best list 50-best list 100 1.0 26.33 26.65 26.72 0.6 25.13 25.06 24.98 0.5 24.89 24.91 0.4 24.72 300 25.41 25.30 25.49 24.68 24.53 24.51 24.59 24.04 24.18 23.97 24.10 500 23.67 23.76 23.07 22.96 23.65 23.00 22.87 23.82 23.26 23.24
12
An example of N-best list of recognition hypotheses
An example of the 10-best list of ASR for the Russian phrase: "Чистота воздуха зависит и от ветра" ("Purity of the air also depends on the wind"): After rescoring of this 10-best list using RNN LM with 500 hidden units interpolated with the baseline 3-gram LM, the hypothesis #4 was selected as the best one. After N-best list rescoring we obtained the correct hypothesis for this utterance.
13
Conclusion Statistical n-gram LMs do not have efficiency for Russian ASR because of almost free word order in Russian. RNN LMs are able to store arbitrary long history of a given word that is their advantage over n-gram LMs. We have tried RNNs with various number of units in hidden layer, also we made the linear interpolation of the RNN LM with the baseline 3-gram LM. We achieved the relative WER reduction of 14% using RNN LM with respect to the baseline model.
14
Thank you for your attention!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.