Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Slides:

Advertisements

Similar presentations

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.

Advertisements

Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi.

Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.

N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.

1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.

Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.

Language Modeling Approaches for Information Retrieval Rong Jin.

Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.

Natural Language Understanding

1 Advanced Smoothing, Evaluation of Language Models.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.

1 The Hidden Vector State Language Model Vidura Senevitratne, Steve Young Cambridge University Engineering Department.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.

11/24/2006 CLSP, The Johns Hopkins University Random Forests for Language Modeling Peng Xu and Frederick Jelinek IPAM: January 24, 2006.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.

6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.

March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

Presented by Tienwei Tsai July, 2005

1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.

Discriminative Syntactic Language Modeling for Speech Recognition Michael Collins, Brian Roark Murat, Saraclar MIT CSAIL, OGI/OHSU, Bogazici University.

NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.

Lexical Trigger and Latent Semantic Analysis for Cross-Lingual Language Model Adaptation WOOSUNG KIM and SANJEEV KHUDANPUR 2005/01/12 邱炫盛.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.

COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.

1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.

1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab

Chapter 23: Probabilistic Language Models April 13, 2004.

1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.

DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.

Recurrent neural network based language model Tom´aˇs Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan “Honza” Cˇernocky, Sanjeev Khudanpur INTERSPEECH 2010.

Yuya Akita , Tatsuya Kawahara

Language and Statistics

Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.

Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,

Relevance Language Modeling For Speech Recognition Kuan-Yu Chen and Berlin Chen National Taiwan Normal University, Taipei, Taiwan ICASSP /1/17.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

10.0 Latent Semantic Analysis for Linguistic Processing References : 1. “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings.

Natural Language Processing Statistical Inference: n-grams

Center for Language and Speech Processing, The Johns Hopkins University. May Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational.

Integrating Multiple Knowledge Sources For Improved Speech Understanding Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering,

Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

An Empirical Study on Language Model Adaptation Jianfeng Gao, Hisami Suzuki, Microsoft Research Wei Yuan Shanghai Jiao Tong University Presented by Patty.

Gaussian Mixture Language Models for Speech Recognition Mohamed Afify, Olivier Siohan and Ruhi Sarikaya.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者：郝柏翰 2013/05/23.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Smoothing Issues in the Strucutred Language Model

True/False questions (3pts*2)

Jun Wu Department of Computer Science and

Jun Wu and Sanjeev Khudanpur Center for Language and Speech Processing

Language and Statistics

Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13

Presentation transcript:

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for Language and Speech Processing, JHU

Abstract Combine two sources of long-range statistical dependence –the syntactic structure –the topic of a sentence These dependencies are integrated using the maximum entropy technique

Topic information Using word frequencies to construct separate N-gram models for each individual topic –Fragmentation of the training text by topic –Remedy: interpolate each topic-specific N-gram model with a topic-independent model, which constructed using all the available data Latent semantic analysis (Bellegarda, 1998)

Syntactic information Chelba and Jelinek (1998, 1999) have used a left-to- right parser to extract syntactic heads Financial officials in the former British colony consider the contract essential to the revival of the Hong Kong futures exchange officials consider vs. colony consider Head words

Combining topic and N-gram dependencies topic-dependent features: unigram frequencies collected from all documents on a specific topic t in training corpus topic-independent features: overall N-gram frequencies in training corpus P(w i | w i-1, w i-2, t i )

ME parameters We use the long-range history w 1,…,w i-1 to assign a topic t i = t (w 1,…,w i-1 ) to a conversation h i = [w i-1, w i-2, t i ] empirical distribution constraints

ME parameters Z is a suitable normalization constant The first three numerator terms correspond to standard N-gram constraints, while the fourth one is a topic-unigram parameter determined by word frequencies in particular topic

Computational issues GIS algorithm

Topic assignment for test utterances Topic assignment must be based on recognizer hypotheses Topic of a conversation may change as the conversation progresses Assign topic to –an entire test conversation –each utterance –parts of an utterance

Experimental setup |V| = 22K words Training corpus –nearly 1200 Switchboard conversations, 2.1M words –Each conversation is annotated with one of about 70 topics Testing corpus –19 conversations (38 conversation sides), 18K words, over 2400 utterances 100-best hypotheses is generated by HTK recognizer using a back-off bigram LM WER for rescoring these hypotheses and perplexity of the transcriptions

Baseline experiments It can be seen that when only N-gram constraints are used, the ME model essentially replicates the performance of the corresponding back-off N-gram model

Estimation of topic-conditional models Each conversation side in the training corpus is processed to obtain a representative vector of weighted frequencies of vocabulary term excluding stop words, where a stop word is any of a list of about 700 words These vectors are then clustered using a K-means procedure (K~70) f t : relative word frequency in a cluster t Choose words which related to topic t: about 16K words in the 22K vocabulary, and they constitute about 8% of the 2.1M training tokens

Topic assignment during testing Hard decision cosine similarity measure Four options for assignment –Manual assignment –Reference transcriptions –10-best hypotheses –Assignment by an oracle to minimize perplexity (or WER) –Null topic, which defaults to a topic-independent baseline model, is available as one of the choices to the topic classifier

Topic assignment during testing (conversation – level) There is only a small loss in perplexity and a negligible loss in WER when the topic assignment is based on recognizer hypotheses instead of the correct transcriptions

Topic assignment during testing (utterance – level) Best recognition performance is achieved by assigning a topic to each utterance based on the 10-best hypotheses of the current and the three preceding utterances Absolute WER reduction 0.7%, relative perplexity reduction 7%

Topic of two levels 8 out of 10 utterances prefer the topic-independent model Null topic

Analysis of recognition performance Divide the vocabulary into two set: –Words which have topic-conditional unigram constraints for any of the topics –The others –About 7% of test set tokens have topic-dependent constraints

Analysis of recognition performance Divide the vocabulary simply into content-bearing words and stop words –25% in the test set tokens are content bearing words

ME vs. interpolated topic N-gram Interpolation weight –Minimize the perplexity of the test set under each interpolated model

ME vs. cache-based models N-best hypotheses for the preceding utterances in a conversation side are considered in estimating the cache model –N = 100 –interpolation weight = 0.1 Cache model caches the recognition errors  repeated error. Cache-based model has about a 0.6% higher rate of repeated errors then baseline trigram model.

Combining syntactic and N-gram dependencies This paper parsed all sentences in the training data by the left- to-right parser presented by Chelba and Jelinek (1998). This parser generates a stack S i of candidate parse trees T ij for a sentence prefix W 1 i-1 = w 1,w 2,…,w i-1 at position i. And also assigns a probability P(w i | W 1 i-1,T ij ) to each possible following word w i given the jth partial parse T ij, and a likelihood function ρ(W 1 i-1,T ij ) for the jth partial parse, according to

Head-word Assume that the immediate history (w i-2,w i-1 ) and last two head-words h i-2, h i-1 of the partial parse T ij carry most of the useful information :

ME parameters The first two kinds of constraints involve regular N-gram counts, and the last two involve head-word N-gram counts constraints

ME parameters Z is a normalization constant. Again, the first three terms in the numerator correspond to standard N-gram constraints. The last two terms represent head-word N-gram constraints

Recognition performance of the syntactic model Compare topic and syntactic models

Analysis of recognition performance Divide all histories in the test sentences into two categories: –(h i-2, h i-1 ) = (w i-2, w i-1 ) –h i-2 ≠ w i-2 or h i-1 ≠ w i-1 –About 75% of the histories belong to the former category

ME vs. interpolated syntactic models The maximum entropy technique slightly but consistently outperforms interpolation

Combining topic, syntactic and N-gram dependencies Z is a normalization constant, and the parameters λ are computed to satisfy constraints on the marginal probability of N-grams, head-word N-grams and topic-conditional unigrams

Analysis of recognition performance Topic-dependent model improved prediction of content-bearing words Syntactic models improved prediction when the two immediately preceding head words were not within trigram range

Analysis of recognition performance These two kinds of information are independent

Conclusion Using ME to combine 2 diverse source of long-range dependence with N-gram models Topic information : content word Syntactic information : information out of trigram range These two information are independent