Lexical Trigger and Latent Semantic Analysis for Cross-Lingual Language Model Adaptation WOOSUNG KIM and SANJEEV KHUDANPUR 2005/01/12 邱炫盛.

Slides:

Advertisements

Similar presentations

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Advertisements

1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.

A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering.

Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Latent Dirichlet Allocation a generative model for text

Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.

The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Introduction to Language Models Evaluation in information retrieval Lecture 4.

1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization.

Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.

A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.

MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.

Large Language Models in Machine Translation Conference on Empirical Methods in Natural Language Processing 2007 報告者：郝柏翰 2013/06/04 Thorsten Brants, Ashok.

Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.

Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.

Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:

Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.

COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,

Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.

Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.

1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.

SINGULAR VALUE DECOMPOSITION (SVD)

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.

CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,

Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.

Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.

1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.

DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.

Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.

Yuya Akita , Tatsuya Kawahara

National Taiwan University, Taiwan

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.

Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.

A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems Saeedeh Momtazi, Dietrich Klakow University of Saarland,Germany.

Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

Relevance Language Modeling For Speech Recognition Kuan-Yu Chen and Berlin Chen National Taiwan Normal University, Taipei, Taiwan ICASSP /1/17.

A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

10.0 Latent Semantic Analysis for Linguistic Processing References : 1. “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings.

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.

Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.

A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者：郝柏翰 2013/05/23.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

Paul van Mulbregt Sheera Knecht Jon Yamron Dragon Systems Detection at Dragon Systems.

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Presentation transcript:

Lexical Trigger and Latent Semantic Analysis for Cross-Lingual Language Model Adaptation WOOSUNG KIM and SANJEEV KHUDANPUR 2005/01/12 邱炫盛

Outline Introduction Cross-Lingual Story-Specific Adaptation Training and Test Corpora Experimental Results Conclusions

Introduction Statistic language models are indispensable components of many human language technologies. e.g. ASR,IR,MT. The best-known techniques for estimating LMs require large amounts of text in the domain and language of interest, making this a bottleneck resource. e.g. Arabic. There have been attempts to overcome this data scarcity problem in the components of speech and language processing systems. E.g. acoustic modeling, linguistic analysis from resource-rich language to resource- deficient language.

Introduction (cont.) For language modeling, if sufficiently good MT is available between resource-rich language, such as English and a resource-deficient language, say Chinese, then one may choose English documents, translate, and use resulting Chinese word statistics to adapt LMs. Yet, the assumption of some MT capability presupposes linguistic resources may not be available for some languages. –Modest sentence-aligned parallel corpus Two primary means of exploiting cross-lingual information for language modeling are investigated, neither of which requires any explicit MT capability –Cross-Lingual Lexical Triggers –Cross-Lingual Latent Semantic Analysis

Introduction (cont.) Cross-Lingual Lexical Triggers: several content-bearing English words will signal the existence of a number of content-bearing Chinese counterparts in the story. If a set of matched English-Chinese stories is provided for training, one can infer which Chinese words an English word would trigger by using statistic measure. Cross-Lingual Latent Semantic Analysis: LSA of a collection of bilingual document-pairs provides a representation of words in both languages in a common low-dimensional Euclidean space. This provides another means for using English word-frequencies to improve the Chinese language model from English text. It is shown through empirical evidence that while both techniques yield good statistics for adapting a Chinese Language model to a particular story, the goodness of the information varies from story to story.

Cross-Lingual Story-Specific Adaptation

Our aim is to sharpen a language model in a resource-deficient language, by using data from a resource-rich language. Assume for the time being that a sufficiently good Chinese-English story alignment is given. Assume further that we have a stochastic translation lexicon-a probabilistic model P T (c|e) Cross-Lingual Unigram Distribution

Use cross-lingual unigram statistic to sharpen a statistic Chinese LM used for processing the test story d i C. Linear interpolation: Variation:

Obtaining the Matching English Documents d i E Assume that we have a stochastic reverse translation lexicon P T (e|c). Compute: An English bag-of-words representation of the Mandarin story d i C as used in standard vector-based information retrieval. The document d i E with highest TF-IDF weighted consine-similarity is the selected: Called query-translation approach to CLIR

Obtaining Stochastic Translation Lexicons Translation lexicons: P T (c|e) and P T (e|c) –Created out of multiple translations of a word –Stemming and other morphological analyses may be applied to increase the vocabulary coverage. –Alternately, they may be obtained from parallel corpus using MT techniques, such as GIZA++ tools. –Apply the translation models to entire articles, one word at a time, to get a bag of translated words. –However, obtaining translation probabilities using very long (document- sized) sentence-pair has its own issues. For truly resource-deficient language, one may obtain a translation lexicon via optical character recognition from a printed bilingual dictionary.

Cross-Lingual Lexical Triggers It seems plausible that most of information one gets from the cross- lingual unigram LM is in the form of the altered statistics of topic-specific Chinese words conveyed by the statistics of content-bearing English words in the matching story. The translation lexicon used for obtaining the information is an expensive resource. If one were only interested in the conditional distribution of Chinese words given some English words, there is no reason to require translation as an intermediate step. In monolingual setting, the mutual information between lexical-pairs co- occurring anywhere within a long “window” of each other has been used to capture statistical dependencies not covered by N-gram LMs.

Cross-Lingual Lexical Triggers (cont.) A pair of words (a, b) is considered a trigger-pair if, given a word-position in a sentence, the occurrence of a in any of the preceding word-positions significantly alter the probability that the following word in the sentence is b: a is said to trigger b. (The set of preceding word-positions is variably defined. e.g. sentence, paragraph, document.) In the cross-lingual setting, a pair of words (e, c) to be a trigger-pair.( Given an English-Chinese pair of aligned documents) Translation-pair will be natural candidates for translation-pair, however, it is not necessary for a trigger-pair to also be a translation pair. –E.g. Belgrade may trigger the Chinese translation of Serbia, Kosovo, China, embassy and bomb.

Cross-Lingual Lexical Triggers (cont.) Average mutual information, which measures how much knowing the value of one random variable reduces the uncertainty of about another, has been used to identify trigger-pairs. Compute the average mutual information for every English-Chinese word- pair (e, c): There are |E|x|C| possible English-Chinese word-pairs which may be prohibitively large to search for the pairs with the highest mutual information. So filter out infrequent words in each language, ex.<5, then measure I(e;c) for all possible pairs, sort them by I(e;c) and select top one million pairs.

Cross-Lingual Lexical Triggers (cont.)

Estimating Trigger LM Probabilities Estimate probability P Trig (c|e) and P Trig (e|c) in lieu of the translation probability P T (c|e) and P T (e|c). P Trig (c|e) is based on the unigram frequency of c among Chinese word tokens in that subset of aligned documents d i C which have e in d i E. Alternative: I(e;c)=0 whenever (e,c) is not a trigger-pair, and find it to be more effective.

Estimating Trigger LM Probabilities (cont.) Interpolated model:

Cross-Lingual Latent Semantic Analysis CL-LSA is a standard automatic technique to extract corpus-based relations between words or documents. Assume that a document-aligned Chinese-English bilingual corpus is provided. First step is to represent the corpus as a word-document co- occurrence frequency matrix W in which each row represent a word I one of the two language, and each column a document-pair. W is M×N matrix. M=|C ∪ E|, N is the number of document-pairs. Each element wij of W contains the count of the ith word in the jth document-pair. Next, each row of W is weighted by some function, which deemphasizes frequent (function) words in either language, such as the inverse of the number of documents in which the word appears.

CL-LSA (cont.) Then SVD is performed on W, and some R <<min {M, N}.. In the rank-R approximation, the jth column W *j of W or document- pair d j E and d j C, is a linear combination of the columns of U×S, the weight for the linear combination being provided by the jth column of V T Similarly,

Cross-Language IR CL-LSA provides a way to measure the similarity between a Chinese query and English document without using a translation lexicon P T (e|c). Construct a word-document matrix using the English corpus. All rows corresponding the Chinese vocabulary item have zeros in this matrix. Project d j E into semantic space and obtain the R-dimensional representations Similarly, project Chinese query d i C and calculate consine-similarity between query and documents.

LSA-Derived Translation Probabilities Use CL-LSA framework to construct the translation model P T (c|e). In matrix W, each word is represented as a row no matter whether it is English or Chinese. Project words into R-dimensional space yields row of U, and measure the semantic similarity by consine-similarity. Word-word translation model Exploit a large English corpus to improve Chinese LMs, as well as the use of a document-aligned Chinese-English corpus to overcome the need for a translation lexicon.

Topic-dependent language models The combination of the story-dependent unigram models with a story- independent trigram model using linear interpolation seems to be a good choice as they are complementary. Construct monolingual topic-dependent LMs and contrast performance with CL-lexical triggers and CL-LSA. Use well-known k-means clustering algorithm. Use a bag-of-words centroid to represent each topic. Each topic-centroid t i has highest TF-IDF weighted consine-similarity. We believe that the topic-trigram model is a better model, making for informative, even if unfair comparison.

Training and Test Corpora Parallel Corpus: Hong Kong News –Used for training of GIZA++, construction of trigger-pairs and cross- lingual experiment. –Contains 18,147 document-aligned documents. (actually a sentence- aligned corpus) –Dates from July 1997 to April –Removes a few articles containing nonstandard Chinese characters. –16,010 for training, 750 for testing. –4.2M-word Chinese training set, 177K-word Chinese test set. –4.3M-word English training set, 182K-word English test set.

Training and Test Corpora (cont.) Monolingual Corpora: –XINHUA:13 million words. Estimate baseline trigram LM –HUB-4NE: estimate a trigram model from 96K words in the transcription for training acoustic model. –NAB-TDT: contemporaneous English texts, articles containing about 30 million words.

Experimental Results Cross-Lingual Mate Retrieval: CL-LSA vs. Vector-based IR use well-tuned translation dictionary PT(e|c) (by GIZA++) in Vector-based IR. Due to memory limitation, 693 was the maximum.

Experimental Results (cont.) Baseline ASR Performance of Cross-Lingual LMs P-value are based on the standard NIST MAPSSWE test.MAPSSWE The improvement brought by CL-interpolated LM is not statistically significant on XINHUA. On HUB-4NE, Chinese LM text is scare, the CL-interpolated LM delivers considerable benefits via the large English Corpus.

Experimental Results (cont.) Likelihood-Based Story-Specific Selection of Interpolation Weight and the Number of English Documents per Mandarin Story N-best documents: –experimented with values of 1,10,30,50,80,100 and found that N=30 is best for LM performance, but only marginally better than N=1. All documents above a similarity threshold: –the argument against always taking a predetermined number of the best matching documents may be that it ignores the goodness of match. –Threshold=0.12 gives the lowest perplexity, the reduction is insignificant. –the number of documents selected now varies story to story. Some stories even the best matching document falls below the threshold. –This points to the need for a story-specific strategy for choosing the number of English documents.

Experimental Results (cont.) Likelihood-based selection of the number of English documents:

Experimental Results (cont.) The perplexity varies according to the number of English documents, and the best performance is achieved at different points for each story. For each choice of the number of documents, also λ, is chosen to maximize the likelihood of the first pass output. Choose 1000-best-matching English documents and divide the dynamic range of their similarity score into 10 interval. Choose top one-tenth, not necessarily the top 100 documents, compute P CL-unigram (c|d i E ), determine λ that maximizes the likelihood of the first pass output of only the utterances in that story, and record this likelihood. Repeat this in top two-tenth, three-tenth, and so on. Obtain the likelihood as a function of similarity threshold. Called Likelihood-based story-specific adaptation scheme

Experimental Results (cont.)

Comparison of Cross-Lingual Triggers and CL-LSA with Stochastic Transition Dictionaries

Experimental Results (cont.) Comparison of Stochastic Translation with Manually Created Dictionaries MRD: machine-readable dictionary, 18K English-to-Chinese entries and 24K Chinese-to-English entries from LDC translation lexicon. Use MRD in place of a stochastic translation lexicon P T (e|c). MRD leads to a reduction in perplexity, no reduction in WER.

Conclusions A statistically significant improvement in ASR WER and in perplexity. Our methods are even more effective when LM training text is hard to come by. We have proposed methods to build cross-lingual language model, which do not require MT. By using mutual information statistics and latent semantic analysis form document-aligned corpus, we can extract a significant amount of information for language modeling. Future work –Develop maximum entropy models to more effectively combine the multiple information source.

Separability between intra- and inter-topic pairs is much better in the LSA space than in the original space.

,(rule),(drawing)

W=7x4 matrix (word-command matrix),R=2