1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA, Shinsuke MORI, Masafumi NISHIMURA IBM Research, Tokyo.

Slides:



Advertisements
Similar presentations
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
Unsupervised Turkish Morphological Segmentation for Statistical Machine Translation Coskun Mermer and Murat Saraclar Workshop on Machine Translation and.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval 出處: institute of information science, academia sinica, taipei,
1 Unsupervised Discovery of Morphemes Presented by: Miri Vilkhov & Daniel Feinstein linja-autonautonkuljettajallakaan linja-auton auto kuljettajallakaan.
Document Centered Approach to Text Normalization Andrei Mikheev LTG University of Edinburgh SIGIR 2000.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Automatic Continuous Speech Recognition Database speech text Scoring.
Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.
Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.
Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.
Some Advances in Transformation-Based Part of Speech Tagging
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Chinese Word Segmentation and Statistical Machine Translation Presenter : Wu, Jia-Hao Authors : RUIQIANG.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.
Handing Uncertain Observations in Unsupervised Topic-Mixture Language Model Adaptation Ekapol Chuangsuwanich 1, Shinji Watanabe 2, Takaaki Hori 2, Tomoharu.
1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Chapter 23: Probabilistic Language Models April 13, 2004.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using the Web for Automated Translation Extraction in.
Yuya Akita , Tatsuya Kawahara
National Taiwan University, Taiwan
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG.
Estimating N-gram Probabilities Language Modeling.
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
Learning a Monolingual Language Model from a Multilingual Text Database Rayid Ghani & Rosie Jones School of Computer Science Carnegie Mellon University.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
HANGMAN OPTIMIZATION Kyle Anderson, Sean Barton and Brandyn Deffinbaugh.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Using the Web for Language Independent Spellchecking and Auto correction Authors: C. Whitelaw, B. Hutchinson, G. Chung, and G. Ellis Google Inc. Published.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Gaussian Mixture Language Models for Speech Recognition Mohamed Afify, Olivier Siohan and Ruhi Sarikaya.
Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者:郝柏翰 2013/05/23.
Paul van Mulbregt Sheera Knecht Jon Yamron Dragon Systems Detection at Dragon Systems.
An Adaptive Learning with an Application to Chinese Homophone Disambiguation from Yue-shi Lee International Journal of Computer Processing of Oriental.
An AV Control Method Using Natural Language Understanding Author : M. Matsuda, T. Nonaka, and T. Hase Date : Speaker : Sian-Lin Hong IEEE Transactions.
NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY Discriminative pronunciation modeling using the MPE criterion Meixu SONG, Jielin PAN, Qingwei ZHAO, Yonghong.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
N-best list reranking using higher level phonetic, lexical, syntactic and semantic knowledge sources Mithun Balakrishna, Dan Moldovan and Ellis K. Cave.
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Automatic Speech Recognition: Conditional Random Fields for ASR
Introduction to Text Analysis
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Presentation transcript:

1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA, Shinsuke MORI, Masafumi NISHIMURA IBM Research, Tokyo Research Laboratory, IBM Japan Presenter: Hsuan-Sheng Chiu

2 Reference S. Mori and D. Takuma, “Word N-gram Probability Estimation From A Japanese Raw Corpus,” in Proc. of ICSLP 2004, pp. 201–207. H. Feng, K. Chen, X. Deng, and W. Zheng, “Accessor Variety Criteria for Chinese Word Extraction,” Computational Linguistics, vol. 30, no.1, pp. 75–93, M. Asahara and Y. Matsumoto, “Japanese Unknown Word Identification by Character-based Chunking,” in Proc. of COLING 2004, pp. 459–465.

3 Introduction Domain-specific words are likely to characterize their domain, misrecognition of these words causes a severe quality degradation of the LVCSR application It has been necessary to segment the target domain’s corpus into words because no space exists between words in Japanese A ideal method: –Experts manually segment a corpus of the target domain –Domain-specific words are added to the lexicon –The domain-specific LM is built from this correct segmented corpus This is not realistic because the target domain will change –A fully automatic method is necessary

4 Proposed method This method is fully automatic 1. segment the raw corpus stochastically 2. build a word n-gram model from stochastically segmented corpus 3. add probable word into the lexicon to LVCSR

5 Stochastic Segmentation Raw corpus: All of the words are concatenated and there is no word boundary information Deterministically segmented corpus: corpus with deterministic word boundary Stochastically segmented corpus: corpus with word boundary probability

6 Word Boundary Probability Estimate probability from a relatively small segmented corpus Introduce seven character classes since the number of characters in Japanese is large –Kanji, symbols, Arabic digits, Hiragana, Katakana, Latin characters (Cyrillic and Greek characters) Word Boundary Probability:

7 Word N-gram Probability The number of word in the corpus A character sequence in the raw corpus is treated as a word if and only if there is a word boundary before and after sequence and there is no word boundary inside the sequence –Unigram frequency –Unigram probability 今 天 天 氣 真 好

8 Word N-gram Probability (cont.) Word-based n-gram model –Probability of word sequence Since it is impossible to define the complete vocabulary, use a special token UW for unknown words Unknown word spelling is predicted by character-based n-gram model Probability of OOV

9 Probable Character Strings Added to the Lexicon All of the character string appearing in the domain- specific corpus can be treated as words However, a lot of meaningless character strings are also included Use traditional character-based approach to judge whether or not a character string is appropriate as a word –Accessor Variety Criteria Accessor Varieties Adhesive characters –Character-based Chunking POS feature for chunking SVM-based chunking

10 Accessor Variety Criteria We first discard those strings with accessor varieties that are smaller than a certain number The remaining strings are considered to be potentially meaningful words. In addition, we apply rules to remove strings that consist of a word and adhesive characters Example “ 的人們 ” AV is high, but it’s not a word h: Head-adhesive character t: Tail-adhesive character core: meaningful word h + core: 的我 core + t: 我的 h + core + t: 的過程是 should be discarded Rule-based discarding Example 門把手弄壞了 小明修好了門把手 這個門把手很漂亮 這個門把手壞了 小明弄壞門把手 Prefix of 門把手 (left AV): S, 了, 個, 壞 Suffix of 門把手 (right AV): 弄, E, 很, 壞 了,E Distinct words counted S, E repeatedly counted AV=min{ left AV, right AV}=4

11 Summary of the proposed method as regards time, the proposed method has an advantage, because it only requires a raw corpus and doesn’t need labor-intensive manual segmentation to adapt an LVCSR system to the target domain –OOV words can be treated as words –Proper n-gram probability are assigned to OOV words and word sequences containing OOV words

12 Basic Material Acoustic Model –83 hours from spontaneous speech corpus –Phones are represented as context-dependent, three state left-to-right HMMs –State are clustered using phonetic decision tree (2728) –Each state is modeled using 11 mixture of Gaussians General LM –Large corpus of a general domain –A small part of the corpus was segmented by experts –The rest was segmented automatically by the word segmenter and roughly checked by experts –24,442,503 words General Lexicon –45402 words

13 Experiments Experiment on lectures of the University of Air Domain-specific words which never appear in newspaper articles are often used Select 3 lectures for the experiments –Mainly composed of the textbooks Small: about 20 pages Large: one entire textbook

14 Experiment (cont.) Three methods are compared –Ideal –Automatic –Proposed OOV are added to the Lexicon Evaluation –Use CER instead of WER The reason is that in Japanese ambiguity exists in word segmentation –eWER Estimate WER based on CER and the average number of character per one word

15 Experimental Results

16 Conclusions Considering this result, the larger corpora are, the better the performances are with the proposed method It is not difficult to collect raw corpora but it will be an expensive and time-consuming task to manually segment a raw corpus Propose a new method to adapt an LVCSR system to specific domain based on stochastic segmentation The proposed method allows us to adapt LVCSR to various domains in much less time