Download presentation
Presentation is loading. Please wait.
Published byLindsey Austin Modified over 9 years ago
1
1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA, Shinsuke MORI, Masafumi NISHIMURA IBM Research, Tokyo Research Laboratory, IBM Japan Presenter: Hsuan-Sheng Chiu
2
2 Reference S. Mori and D. Takuma, “Word N-gram Probability Estimation From A Japanese Raw Corpus,” in Proc. of ICSLP 2004, pp. 201–207. H. Feng, K. Chen, X. Deng, and W. Zheng, “Accessor Variety Criteria for Chinese Word Extraction,” Computational Linguistics, vol. 30, no.1, pp. 75–93, 2004. M. Asahara and Y. Matsumoto, “Japanese Unknown Word Identification by Character-based Chunking,” in Proc. of COLING 2004, pp. 459–465.
3
3 Introduction Domain-specific words are likely to characterize their domain, misrecognition of these words causes a severe quality degradation of the LVCSR application It has been necessary to segment the target domain’s corpus into words because no space exists between words in Japanese A ideal method: –Experts manually segment a corpus of the target domain –Domain-specific words are added to the lexicon –The domain-specific LM is built from this correct segmented corpus This is not realistic because the target domain will change –A fully automatic method is necessary
4
4 Proposed method This method is fully automatic 1. segment the raw corpus stochastically 2. build a word n-gram model from stochastically segmented corpus 3. add probable word into the lexicon to LVCSR
5
5 Stochastic Segmentation Raw corpus: All of the words are concatenated and there is no word boundary information Deterministically segmented corpus: corpus with deterministic word boundary Stochastically segmented corpus: corpus with word boundary probability
6
6 Word Boundary Probability Estimate probability from a relatively small segmented corpus Introduce seven character classes since the number of characters in Japanese is large –Kanji, symbols, Arabic digits, Hiragana, Katakana, Latin characters (Cyrillic and Greek characters) Word Boundary Probability:
7
7 Word N-gram Probability The number of word in the corpus A character sequence in the raw corpus is treated as a word if and only if there is a word boundary before and after sequence and there is no word boundary inside the sequence –Unigram frequency –Unigram probability 今 天 天 氣 真 好 0.2 0.1 0.5 0.1 0.5 0.1 0.2
8
8 Word N-gram Probability (cont.) Word-based n-gram model –Probability of word sequence Since it is impossible to define the complete vocabulary, use a special token UW for unknown words Unknown word spelling is predicted by character-based n-gram model Probability of OOV
9
9 Probable Character Strings Added to the Lexicon All of the character string appearing in the domain- specific corpus can be treated as words However, a lot of meaningless character strings are also included Use traditional character-based approach to judge whether or not a character string is appropriate as a word –Accessor Variety Criteria Accessor Varieties Adhesive characters –Character-based Chunking POS feature for chunking SVM-based chunking
10
10 Accessor Variety Criteria We first discard those strings with accessor varieties that are smaller than a certain number The remaining strings are considered to be potentially meaningful words. In addition, we apply rules to remove strings that consist of a word and adhesive characters Example “ 的人們 ” AV is high, but it’s not a word h: Head-adhesive character t: Tail-adhesive character core: meaningful word h + core: 的我 core + t: 我的 h + core + t: 的過程是 should be discarded Rule-based discarding Example 門把手弄壞了 小明修好了門把手 這個門把手很漂亮 這個門把手壞了 小明弄壞門把手 Prefix of 門把手 (left AV): S, 了, 個, 壞 Suffix of 門把手 (right AV): 弄, E, 很, 壞 了,E Distinct words counted S, E repeatedly counted AV=min{ left AV, right AV}=4
11
11 Summary of the proposed method as regards time, the proposed method has an advantage, because it only requires a raw corpus and doesn’t need labor-intensive manual segmentation to adapt an LVCSR system to the target domain –OOV words can be treated as words –Proper n-gram probability are assigned to OOV words and word sequences containing OOV words
12
12 Basic Material Acoustic Model –83 hours from spontaneous speech corpus –Phones are represented as context-dependent, three state left-to-right HMMs –State are clustered using phonetic decision tree (2728) –Each state is modeled using 11 mixture of Gaussians General LM –Large corpus of a general domain –A small part of the corpus was segmented by experts –The rest was segmented automatically by the word segmenter and roughly checked by experts –24,442,503 words General Lexicon –45402 words
13
13 Experiments Experiment on lectures of the University of Air Domain-specific words which never appear in newspaper articles are often used Select 3 lectures for the experiments –Mainly composed of the textbooks Small: about 20 pages Large: one entire textbook
14
14 Experiment (cont.) Three methods are compared –Ideal –Automatic –Proposed OOV are added to the Lexicon Evaluation –Use CER instead of WER The reason is that in Japanese ambiguity exists in word segmentation –eWER Estimate WER based on CER and the average number of character per one word
15
15 Experimental Results
16
16 Conclusions Considering this result, the larger corpora are, the better the performances are with the proposed method It is not difficult to collect raw corpora but it will be an expensive and time-consuming task to manually segment a raw corpus Propose a new method to adapt an LVCSR system to specific domain based on stochastic segmentation The proposed method allows us to adapt LVCSR to various domains in much less time
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.