Japanese Spontaneous Spoken Document Retrieval Using NMF-Based Topic Models Xinhui Hu, Hideki Kashioka, Ryosuke Isotani, and Satoshi Nakamura National.

Slides:

Advertisements

Similar presentations

Chapter 5: Introduction to Information Retrieval

Advertisements

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.

IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.

Vector Space Model CS 652 Information Extraction and Integration.

Paper Summary of: Modelling Retrieval and Navigation in Context by: Massimo Melucci Ahmed A. AlNazer May 2008 ICS-542: Multimedia Computing – 072.

Probabilistic Latent Semantic Analysis

Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.

Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.

Chapter 5: Information Retrieval and Web Search

JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 30, (2014) BERLIN CHEN, YI-WEN CHEN, KUAN-YU CHEN, HSIN-MIN WANG2 AND KUEN-TYNG YU Department of Computer.

DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.

An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.

Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.

A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.

COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,

Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.

1 Sentence-extractive automatic speech summarization and evaluation techniques Makoto Hirohata, Yosuke Shinnaka, Koji Iwano, Sadaoki Furui Presented by.

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Generic text summarization using relevance measure and latent semantic analysis Gong Yihong and Xin Liu SIGIR, April 2015 Yubin Lim.

Chapter 6: Information Retrieval and Web Search

Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

SINGULAR VALUE DECOMPOSITION (SVD)

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Round-Robin Discrimination Model for Reranking ASR Hypotheses Takanobu Oba, Takaaki Hori, Atsushi Nakamura INTERSPEECH 2010 Min-Hsuan Lai Department of.

1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.

2005/12/021 Content-Based Image Retrieval Using Grey Relational Analysis Dept. of Computer Engineering Tatung University Presenter: Tienwei Tsai ( 蔡殿偉.

National Taiwan University, Taiwan

Vector Space Models.

Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret

Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.

Relevance Language Modeling For Speech Recognition Kuan-Yu Chen and Berlin Chen National Taiwan Normal University, Taipei, Taiwan ICASSP /1/17.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

Link Distribution on Wikipedia [0407]KwangHee Park.

A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.

10.0 Latent Semantic Analysis for Linguistic Processing References : 1. “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings.

Citation-Based Retrieval for Scholarly Publications 指導教授：郭建明學生：蘇文正 M

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

Pruning Analysis for the Position Specific Posterior Lattices for Spoken Document Search Jorge Silva University of Southern California Ciprian Chelba and.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Representation of documents and queries

CS 430: Information Discovery

Presentation transcript:

Japanese Spontaneous Spoken Document Retrieval Using NMF-Based Topic Models Xinhui Hu, Hideki Kashioka, Ryosuke Isotani, and Satoshi Nakamura National Institute of Information and Communications Technology, Japan {xinhui.hu,hideki.kashioka,ryosuke.isotani, Asia Information Retrieval Societies Conference Presenter: KAI -WUN SHIH

Outline Introduction Term-Document Matrix Built on N-Best NMF-Based Document Topic Model for Spoken Document Retrieval Experiments Conclusions 2

Introduction(1/4) The search and retrieval of a document is generally conducted by matching keywords in the query to those in the target documents. A fundamental problem of information retrieval is term mismatch. Users and authors of documents often use different terms to refer to the same concepts and this produces an incorrect relevance ranking of documents with regard to the information need expressed in the query. For spoken document retrieval (SDR), it faces a new problem besides the term mismatch. The SDR is generally carried out by using textual approaches to speech transcripts. In SDR, terms misrecognized will not match the query and the document representations. We call this problem as the term misrecognition. 3

Introduction(2/4) Advanced users need tools that can find underlying concepts and not just search for keywords appearing in the query. Topic models are very popular for presenting the content of documents. The probabilistic latent topic modeling approaches, such as probabilistic latent semantic analysis (PLSA) have been demonstrated effective in the tasks of spoken document retrieval. Chen proposed a word topic model (WTM) to explore the co-occurrence relationship between words, as well as the long-span latent topical information, for language modeling in spoken document retrieval and transcription, and verified that the WTM is a feasible alternative to the existing models, i.e. PLSA. 4

Introduction(3/4) Non-negative matrix factorization (NMF) is also an approach in latent semantic space. Different from the other similar decomposition approaches such as singular value decomposition(SVD), the NMF uses non-negativity constraints; the decomposition is purely additive; no cancellations between components are allowed. It is regarded to be suitable for finding the latent semantic structure from the document corpus and to identify document clusters in the derived latent semantic space. 5

Introduction(4/4) We adopt the NMF-based document topic model (DTM) approach for spontaneous spoken document retrieval (SDR) in this study. Since the approaches of latent semantic indexing are based on the semantic relations, a relevant document can be retrieved even if a query word does not appear in that document. In this study, the focuses are mainly on dealing with the term misrecognitions, investigating the effectiveness of this DTM for SDR. The comparisons are conducted between this model and the conventional vector space model (VSM), since we presently limit on investigating the difference between the semantic matching and the keyword matching. 6

Term-Document Matrix Built on N-Best (1/4) In the speech-based processing phase: The spoken documents are transcribed by an automatic speech recognizer (ASR). The transcription of the ASR is in the form of an N-best list, in which the top N hypotheses of the ASR results are stored in the recognition result lattice. The reason to select the N-best is that it needs less computation and less memory than the original lattice in search a recognition hypothesis. The usage of N hypotheses is to utilize those correct term candidates hidden in other hypotheses, and to compensate the effectiveness of term misrecognitions. 7

Term-Document Matrix Built on N-Best (2/4) In the text-based processing phase: The term-document matrix used for NMF is built on an updated tf-idf-based vector space model (VSM). In tf-idf-based VSM, term frequency tf, which is defined as the number of a term occurs in a document and the inverse document frequency idf, are the two fundamental parameters. For the N-best, we introduce a stochastic method to compute these two parameters. This method is described as follows: Let D be a document modeled by a segment of the N- Best. P(w|o,D) is defined as the posterior probability or confidence of a term w at position o in D in order to refer to the occurrence of w in the N-Best. 8

Term-Document Matrix Built on N-Best (3/4) The tf is evaluated by summing the posterior probabilities of all occurrences of the term in the N-Best. Furthermore, we update it with Robertson’s 2-Poisson model as follows: Where the tf’ is the conventional term frequency, and is defined as follows: The length(D) is the length of document D, ⊿ is the average length of the whole document set. 9

Term-Document Matrix Built on N-Best (4/4) Similarly, the idf is calculated on the basis of the posterior probability of w, as shown in the following equation: Here, N D is the total number of documents contained in the corpus. C is the entire document set of the corpus. The term-document matrix A for NMF is finally built by using tf idf. By using the processing of NMF, a topic model is constructed, and is used for computing the relevance of target documents to the input query. 10

NMF-Based Document Topic Model for Spoken Document Retrieval -Document Topic Model and Information Retrieval(1/3) In information retrieval (IR), the relevance measure between a query Q and a document D can be expressed as P(D|Q). By applying the Bayes theorem, it can be transformed into: With the invariability of P(Q) over all documents, and assuming that document probability P(D) has a uniform distribution, ranking the documents by the P(D|Q) can be realized using P(Q|D), the probability of query Q being generated by the document D. If the query Q is composed of a sequence of terms Q = w 1 w 2...w Nq, the P(Q|D) can be further decomposed as a product of the probabilities of the query words generated by the document : 11

NMF-Based Document Topic Model for Spoken Document Retrieval -Document Topic Model and Information Retrieval(2/3) Each individual document D can be interpreted as a generative document topic model (DTM), denoted as M D, and is embodied with K latent topics. Each latent topic is expressed by the word distribution of the language. So two probabilities are associated with this topic model: the probability of a latent topic given a document and the probability of word in a latent topic. So the probability of a query word w i generated by D is expressed by Where P(w i | k) denotes the probability of a query word w i occurring in a specific latent topic k, and ( k| M D ) is the posterior probability of the topic k generated by the document model M D. 12

NMF-Based Document Topic Model for Spoken Document Retrieval -Document Topic Model and Information Retrieval(3/3) Therefore, considering on the equation (7) and (8), the likelihood of a query Q generated by D is thus represented by In this study, we compare the retrieval performance of the NMF with the conventional vector space model (VSM) where the similarity between the query Q and document D is computed by following equation: 13

NMF-Based Document Topic Model for Spoken Document Retrieval - Link NMF to Topic Model(1/3) Let A be the matrix produced in section 2 to stand for relationships among the terms and documents, with dimension m × n. Let S be the sum of all elements in A. Then  A = A/ S forms a normalized table to approximate the joint probability p(w,d) of term w, and document d. NMF is a matrix factorization algorithm that finds the positive factorization of a given positive matrix. Assume that the given document corpus consists of K topics. The general form of NMF is defined as: The matrix factorization of  A will result in an approximation by a product of two non-negative matrices, G and H with dimension m × k and dimension k × n respectively. 14

NMF-Based Document Topic Model for Spoken Document Retrieval - Link NMF to Topic Model(2/3) So from the equation (11), the joint probability p(w,d) can be expressed by To normalize G by α k = Σ w G w,k, H by β k = Σ d H k,d, and define p(k) = α k β k, the p(w,d) can be rewritten as: Each entity g w,k (1 ≤ k ≤ K) of the matrix accounts for the probability of a word w i that would be generated by a latent topic k, that is P(w i |k) of the equation (9). Each entity h k,n of the matrix accounts for the probability of document D by a latent topic k, that is P(M D | k). 15

NMF-Based Document Topic Model for Spoken Document Retrieval - Link NMF to Topic Model(3/3) The ( k| M D ) of equation (8) can be obtained by using Bayes theorem Based on the above equations, the equation (9) for relevance can be computed by the matrices G and H, the factorized matrices of the NMF. 16

Experiments - Experimental setups(1/2) A test collection of CSJ is used for evaluation.It contains 658 hours of speech consisting of approximately 7.5 million words. The speech materials were provided by more than 1,400 speakers of various ages. This collection consists of a set of 39 textual queries, the corresponding relevant segment lists, and transcriptions by an automatic speech recognition (ASR) system, allowing retrieval of 2702 spoken documents of the CSJ. The large vocabulary continuous ASR system use an engine in which the acoustical model is trained by a corpus in the domain of travel, but the language model is trained by the manually-built transcript of the CSJ corpus. 17

Experiments - Experimental setups(2/2) The word accuracy of the recognition system is evaluated as 60.5%. Because the criteria in determining relevant is not merely dependent on query’s keywords, the semantic content needs to be taken into consideration. For the data structure of test transcript, three types of transcripts of the same spoken documents are used for evaluations. (1) N-best (here 10-best is used, denoted as nbst). (2) 1-best (denoted as 1bst). (3) Manual transcript (denoted as tran). The mean average precision (MAP) is used as the performance measure in this study. 18

Experiments - Retrieval Performance with Topic Number Figure 1 shows the MAPs of retrieving nbst using the proposed NMF-based model, and conventional tf-idf- based VSM in different topics number. As the topic number increases, the nbst’s MAP increases nearly monotonously. After the topic number is over a threshold (here it is 700), the MAP of the NMF-based model surpasses the VSM. Therefore it can be concluded that the NMF mainly functions in the high dimensional semantic space. 19

Experiments - Effectiveness on different Data Type Table 1 shows the retrieval performance for different data types using the NMF and VSM methods. In this experiment, the number of topics was selected to be For all of 3 data types, the NMF method proved to be superior to the VSM. Although the improvement is not so large, the significance of the NMF is that its retrieval is on the semantic level, so it has a potential ability to deal with the problem of misrecognition. Meanwhile, the performance of systems that use the N- best are better than the those that use the 1-bst, they show the same characteristics as in other research on lattice-based spoken document retrieval that the N-best can search or retrieve correct speech segment by utilizing multiple recognition hypotheses even if the 1- best one is incorrect. 20

Conclusions In this paper, we proposed a NMF-based document topic model to explore the Japanese spoken document retrieval. By experiments on the CSJ spoken corpus, the retrieval performance of the NMF-based topic model is found to be steadily improved with the increases in the number of topics. When the topic number becomes sufficiently large, the NMF-based model outperforms the conventional tf-idf- based VSM. We show that as in the case of the VSM, the N-best is also effective to compensate for the misrecognition for the proposed NMF-based model. Moreover, its improvement (6.2%) from 1-best to N-best is also larger than the VSM(3.4%). This achievement is due to the characteristics of topic model – matching at the topic level. 21