Bag-Of-Word normalized n-gram models ISCA 2008 Abhinav Sethy, Bhuvana Ramabhadran IBM T. J. Watson Research Center Yorktown Heights, NY Presented by Patty.

Slides:

Advertisements

Similar presentations

Google News Personalization: Scalable Online Collaborative Filtering

Advertisements

Metadata in Carrot II Current metadata –TF.IDF for both documents and collections –Full-text index –Metadata are transferred between different nodes Potential.

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

Improved Neural Network Based Language Modelling and Adaptation J. Park, X. Liu, M.J.F. Gales and P.C. Woodland 2010 INTERSPEECH Bang-Xuan Huang Department.

Fall 2002CMSC Discrete Structures1 Permutations How many ways are there to pick a set of 3 people from a group of 6? There are 6 choices for the.

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

Latent Dirichlet Allocation a generative model for text

Ensemble Learning: An Introduction

1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.

British Museum Library, London Picture Courtesy: flickr.

Introduction to Language Models Evaluation in information retrieval Lecture 4.

LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.

Scalable Text Mining with Sparse Generative Models

Modular 13 Ch 8.1 to 8.2.

Time Series Data Analysis - II

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

L7.1b Continuous Random Variables CONTINUOUS RANDOM VARIABLES NORMAL DISTRIBUTIONS AD PROBABILITY DISTRIBUTIONS.

A Survey of ICASSP 2013 Language Model Department of Computer Science & Information Engineering National Taiwan Normal University 報告者：郝柏翰 2013/06/19.

11/24/2006 CLSP, The Johns Hopkins University Random Forests for Language Modeling Peng Xu and Frederick Jelinek IPAM: January 24, 2006.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari Joint work with: Maxim Roy, Anoop Sarkar Simon Fraser University NAACL.

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

Module 1: Statistical Issues in Micro simulation Paul Sousa.

Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.

CSCI 115 Chapter 3 Counting. CSCI 115 §3.1 Permutations.

A Language Independent Method for Question Classification COLING 2004.

3. Counting Permutations Combinations Pigeonhole principle Elements of Probability Recurrence Relations.

COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.

Principles of programming languages 5: An operational semantics of a small subset of C Department of Information Science and Engineering Isao Sasano.

A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,

HIERARCHICAL SEARCH FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION Author :Neeraj Deshmukh, Aravind Ganapathiraju and Joseph Picone.

Round-Robin Discrimination Model for Reranking ASR Hypotheses Takanobu Oba, Takaaki Hori, Atsushi Nakamura INTERSPEECH 2010 Min-Hsuan Lai Department of.

Communication System A communication system can be represented as in Figure. A message W, drawn from the index set {1, 2,..., M}, results in the signal.

Handing Uncertain Observations in Unsupervised Topic-Mixture Language Model Adaptation Ekapol Chuangsuwanich 1, Shinji Watanabe 2, Takaaki Hori 2, Tomoharu.

Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.

Positional Relevance Model for Pseudo–Relevance Feedback Yuanhua Lv & ChengXiang Zhai Department of Computer Science, UIUC Presented by Bo Man 2014/11/18.

Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.

Source Coding Efficient Data Representation A.J. Han Vinck.

NEW EVENT DETECTION AND TOPIC TRACKING STEPS. PREPROCESSING Removal of check-ins and other redundant data Removal of URL’s maybe Stemming of words using.

Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.

The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech Frank Seide IEEE Transactions on Speech and Audio Processing 2005.

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

Relevance Language Modeling For Speech Recognition Kuan-Yu Chen and Berlin Chen National Taiwan Normal University, Taipei, Taiwan ICASSP /1/17.

Rounding scheme if r * j  1 then r j := 1  When the number of processors assigned in the continuous solution is between 0 and 1 for each task, the speed.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Nonparametric tests: Tests without population parameters (means and standard deviations)

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

An Empirical Study on Language Model Adaptation Jianfeng Gao, Hisami Suzuki, Microsoft Research Wei Yuan Shanghai Jiao Tong University Presented by Patty.

Huffman Coding (2 nd Method). Huffman coding (2 nd Method)  The Huffman code is a source code. Here word length of the code word approaches the fundamental.

1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Gaussian Mixture Language Models for Speech Recognition Mohamed Afify, Olivier Siohan and Ruhi Sarikaya.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者：郝柏翰 2013/05/23.

Pruning Analysis for the Position Specific Posterior Lattices for Spoken Document Search Jorge Silva University of Southern California Ciprian Chelba and.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu.

Mohamed Kamel Omar and Lidia Mangu ICASSP 2007

Presentation transcript:

Bag-Of-Word normalized n-gram models ISCA 2008 Abhinav Sethy, Bhuvana Ramabhadran IBM T. J. Watson Research Center Yorktown Heights, NY Presented by Patty Liu

2 BOW-normalized n-gram model (1/3) The BOW model is expected to capture the latent topic information while the n-gram model captures the speaking style and linguistic elements from the training data that it is built on. By combining the two models, we now have a semantic language model that also effectively captures local dependencies.

3 BOW-normalized n-gram model (2/3) I. BOW Model ： In a BOW model, probability of a text sequence is independent of the order of words. ： the set of all distinct permutes of the words with count represented by the BOW vector ex:

4 BOW-normalized n-gram model (3/3) II. N-gram Model ： For an n-gram model, the probability assigned to text is dependent on the sequence of words.

5 Computational complexity (1/5) ◎ Generating the permutes in an lexicographic order For a text sequence of length K, each computation of requires K lookups. However, for a n-gram model, the probability of a word is conditioned only on the previous words, which allows us to share computations across subsequences which share the same history. ： the joint probability of the first 0..i words in the sequence, Thus, by generating the permutes in an alphabetically sorted (lexicographic) order, we can significantly reduce the number of lookups required.

6 Computational complexity (2/5)

7 Computational complexity (3/5) Without lexicographic ordering, we will need to do 5 lookup operations to compute the probabilities. However this computation can be reduced by sharing subsequence probabilities. For example the last entry in Table 1 requires only two lookups as it shares the first 3 words with the previous sequence in the ordered list.

8 Computational complexity (4/5) ◎ A position based threshold To further speed up the computation of the permute sum we can use a position based threshold. If a change at index in the lexicographic permute sequence reduces below a fraction of the average, we can disregard all permutes which share the sequence and move to the next lexicographic sequence which has a different word at index. Let us consider the example sequence. The next sequence in lexicographic order is. In this case, if i.e is low then we can disregard all the permutes. The next sequence that needs to be considered is.

9 Computational complexity (5/5) ◎ Block size For large blocks of words, even with lexicographic ordering and pruning, the calculation of the permute sum becomes computationally demanding. In practice for the case of language modeling for ASR, we have found that chunking text into blocks of uniform size and calculating the permute sum individually for each chunk provides results comparable to the full computation. The size of the blocks serves as a tradeoff between the computation cost (which is approximately m! for a block of size m) and the approximation error.

10 Experiments (1/2) The LVCSR system used for our experiments is based on the 2007 IBM Speech transcription system for GALE Distillation Go/No-go Evaluation. We used the 1996 CSR Hub4 Language Model data (LDC98T31) for building the models in our experiments. Kneyser-ney smoothing was used for building the baseline 4- gram language model. We used Gibbs sampling and LDA to build a 40 topic LDA-based BOW model.

11 Experiments (2/2) We chose a block size of 10 as a good tradeoff between the perplexity and computational complexity. At a block size of 10, the composite BOW-normalized n-gram model gives a relative perplexity improvement of 12%. Rescoring with the proposed composite BOW/n-gram model reduced WER from 17.1% to 16.8% on the RT-04 set. This corresponds to a 1.7% relative reduction in WER.