Better Punctuation Prediction with Dynamic Conditional Random Fields Wei Lu and Hwee Tou Ng National University of Singapore.

Slides:

Advertisements

Similar presentations

Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds.

Advertisements

Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

John Lafferty, Andrew McCallum, Fernando Pereira

Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.

Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.

Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.

A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.

Conditional Random Fields

Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.

SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.

Conditional Random Fields   A form of discriminative modelling   Has been used successfully in various domains such as part of speech tagging and other.

SoundSense: Scalable Sound Sensing for People-Centric Application on Mobile Phones Hon Lu, Wei Pan, Nocholas D. lane, Tanzeem Choudhury and Andrew T. Campbell.

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

1 The Hidden Vector State Language Model Vidura Senevitratne, Steve Young Cambridge University Engineering Department.

Graphical models for part of speech tagging

Exploiting lexical information for Meeting Structuring Alfred Dielmann, Steve Renals (University of Edinburgh) {

The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.

Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.

A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.

On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.

Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.

Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.

TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

Beyond Nouns Exploiting Preposition and Comparative adjectives for learning visual classifiers.

1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.

Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.

Tell Me What You See and I will Show You Where It Is Jia Xu 1 Alexander G. Schwing 2 Raquel Urtasun 2,3 1 University of Wisconsin-Madison 2 University.

National Taiwan University, Taiwan

Xinhao Wang, Jiazhong Nie, Dingsheng Luo, and Xihong Wu Speech and Hearing Research Center, Department of Machine Intelligence, Peking University September.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

John Lafferty Andrew McCallum Fernando Pereira

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.

Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.

Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.

Natural Language Generation with Tree Conditional Random Fields Wei Lu, Hwee Tou Ng, Wee Sun Lee Singapore-MIT Alliance National University of Singapore.

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Wei Lu, Hwee Tou Ng, Wee Sun Lee National University of Singapore

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Gaussian Mixture Language Models for Speech Recognition Mohamed Afify, Olivier Siohan and Ruhi Sarikaya.

1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),

Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.

A NONPARAMETRIC BAYESIAN APPROACH FOR

Language Identification and Part-of-Speech Tagging

CNN-RNN: A Uniﬁed Framework for Multi-label Image Classiﬁcation

Xiaolin Wang Andrew Finch Masao Utiyama Eiichiro Sumita

Conditional Random Fields for ASR

Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD

Recognizing Structure: Sentence, Speaker, andTopic Segmentation

Jeremy Morris & Eric Fosler-Lussier 04/19/2007

Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

Word embeddings (continued)

Presentation transcript:

Better Punctuation Prediction with Dynamic Conditional Random Fields Wei Lu and Hwee Tou Ng National University of Singapore

Talk Overview Background Related Work Approaches –Previous approach: Hidden Event Language Model –Previous approach: Linear-Chain CRF –This work: Factorial CRF Evaluation Conclusion 2

Automatically insert punctuation symbols into transcribed speech utterances Widely studied in speech processing community Example: >> Original speech utterance: >> Punctuated (and cased) version: You are quite welcome. And by the way, we may get other reservations, so could you please call us as soon as you fix the date ? you are quite welcome and by the way we may get other reservations so could you please call us as soon as you fix the date Punctuation Prediction 3

Our Task Processing prosodic features requires access to the raw speech data, which may be unavailable Tackles the problem from a text processing perspective Perform punctuation prediction for conversational speech texts without relying on prosodic features 4

Related Work With prosodic features –Kim and Woodland (2001): a decision tree framework –Christensen et al. (2001): a finite state and a multi- layer perceptron –Huang and Zweig (2002): a maximum entropy-based approach –Liu et al. (2005): linear-chain conditional random fields Without prosodic features –Beeferman et al. (1998): comma prediction with a trigram language model –Gravano et al. (2009): n-gram based approach 5

Related Work (continued) One well-known approach that does not exploit prosodic features –Stolcke et al. (1998) presented a hidden event language model –It treats boundary detection and punctuation insertion as an inter-word hidden event detection task –Widely used in many recent spoken language translation tasks as either a pre-processing (Wang et al., 2008) or post-processing (Kirchhoff and Yang, 2007) step 6

Hidden Event Language Model 7 HMM (Hidden Markov Model)-based approach –A joint distribution over words and inter-word events –Observations are the words, and word/event pairs are hidden states Implemented in the S RILM toolkit (Stolcke, 2002) Variant of this approach –Relocates/duplicates the ending punctuation symbol to be closer to the indicative words –Works well for predicting English question marks where is the nearest bus stop ? ? where is the nearest bus stop

Linear-Chain CRF 8 Linear-chain conditional random fields (L-CRF): Undirected graphical model used for sequence learning –Avoid the strong assumptions about dependencies in the hidden event language model –Capable of modeling dependencies with arbitrary non-independent overlapping features Y1Y1 Y2Y2 Y3Y3 YnYn X1X1 X2X2 X3X3 XnXn … word-layer tags utterance

An Example L-CRF A linear-chain CRF assigns a single tag to each individual word at each time step –Tags: N ONE, C OMMA, P ERIOD, Q MARK, E MARK –Factorized features Sentence: no, please do not. would you save your questions for the end of my talk, when i ask for them ? C OMMA N ONE N ONE P ERIOD N ONE N ONE … N ONE C OMMA N ONE … Q MARK no please do not would you … my talk when … them 9

Features for L-CRF Feature factorization (Sutton et al., 2007) –Product of a binary function on assignment of the set of cliques at each time step, and a feature function solely defined on the observation sequence –Feature functions: n-gram (n = 1,2,3) occurrences within 5 words from the current word Example: for the word “ do” : C OMMA N ONE N ONE P ERIOD N ONE N ONE … N ONE C OMMA N ONE … Q MARK no please do not would you … my talk when … them 10

Problems with L-CRF Long-range dependency between the punctuation symbols and the indicative words cannot be captured properly For example: no please do not would you save your questions for the end of my talk when i ask for them It is hard to capture the long range dependency between the ending question mark ( ? ) and the initial phrase “ would you ” with a linear-chain CRF 11

Problems with L-CRF What humans might do –no please do not would you save your questions for the end of my talk when i ask for them –no, please do not. would you save your questions for the end of my talk, when i ask for them ? Sentence level punctuation (. ? ! ) are associated with the complete sentence, and therefore should be assigned at the sentence level 12

What Do We Want? A model that jointly performs all the following three tasks together –Sentence boundary detection (or sentence segmentation) –Sentence type identification –Punctuation insertion 13

Factorial CRF 14 An instance of dynamic CRF –Two-layer factorial CRF (F-CRF) jointly annotates an observation sequence with two label sequences –Models the conditional probability of the label sequence pairs given the observation sequence X Y1Y1 Y2Y2 Y3Y3 YnYn X1X1 X2X2 X3X3 XnXn … Z1Z1 Z2Z2 Z3Z3 ZnZn … sentence-layer tags word-layer tags utterance

Example of F-CRF D E B EG D E I N D E I N D E I N Q N B EG Q N I N … Q N I N Q N I N Q N I N … Q N I N C OMMA N ONE N ONE P ERIOD N ONE N ONE … N ONE C OMMA N ONE … Q MARK no please do not would you … my talk when … them Propose two sets of tags for this joint task –Word-layer: N ONE, C OMMA, P ERIOD, Q MARK, E MARK –Sentence-layer: D E B EG, D E I N, Q N B EG, Q N I N, E X B EG, E X I N –Analogous feature factorization and the same feature functions used in L-CRF are used 15

Why Does it Work? The sentence-layer tags are used for sentence segmentation and sentence type identification The word-layer tags are used for punctuation insertion Knowledge learned from the sentence-layer can guide the word-layer tagging process The two layers are jointly learned, thus providing evidences that influence each other’s tagging process [no please do not] declarative sent. [would you save your questions for the end of my talk when i ask for them] question sent. ? Q N B EG Q N I N … 16

Evaluation Datasets BTECCT CNENCNEN Number of utterance pairs19,97210,061 Percentage of declarative sentences64%65%77%81% Percentage of question sentences36%35%22%19% Multiple sentences per utterance14%17%29%39% Average words per utterance IWSLT 2009 BTEC and CT datasets Consists of both English (EN) and Chinese (CN) 90% used for training, 10% for testing

Experimental Setup Designed extensive experiments for Hidden Event Language Model –Duplication vs. No duplication –Single-pass vs. Cascaded –Trigram vs. 5-gram Conducted the following experiments –Accuracy on CRR texts (F1 measure) –Accuracy on ASR texts (F1 measure) –Translation performance with punctuated ASR texts (B LEU metric) 18

Precision # correctly predicted punctuation symbols # predicted punctuation symbols Recall # correctly predicted punctuation symbols # expected punctuation symbols F1 measure 2 1/Precision + 1/Recall Punctuation Prediction: Evaluation Metrics 19

BTEC NO DUPLICATIONUSE DUPLICATION L-CRFF-CRF Single PassCascadedSingle PassCascaded LM ORDER CN Prec Rec F EN Prec Rec F Punctuation Prediction Evaluation: Correctly Recognized Texts (I) 20 The “duplication” trick for hidden event language model is language specific Unlike English, indicative words can appear anywhere in a Chinese sentence

CT NO DUPLICATIONUSE DUPLICATION L-CRFF-CRF Single PassCascadedSingle PassCascaded LM ORDER CN Prec Rec F EN Prec Rec F Punctuation Prediction Evaluation: Correctly Recognized Texts (II) 21 Significant improvement over L-CRF (p<0.01) Our approach is general: requires minimal linguistic knowledge, consistently performs well across different languages

BTEC NO DUPLICATIONUSE DUPLICATION L-CRFF-CRF Single PassCascadedSingle PassCascaded LM ORDER CN Prec Rec F EN Prec Rec F Punctuation Prediction Evaluation: Automatically Recognized Texts Chinese utterances, and 498 English utterances Recognition accuracy: 86% and 80% respectively Significant improvement (p < 0.01)

BTEC NO DUPLICATIONUSE DUPLICATION L-CRFF-CRF Single PassCascadedSingle PassCascaded LM ORDER CN  EN EN  CN Punctuation Prediction Evaluation: Translation Performance 23 This tells us how well the punctuated ASR outputs can be used for downstream NLP tasks Use Berkeley aligner and Moses (lexicalized reordering) Averaged B LEU -4 scores over 10 MERT runs with random initial parameters

Conclusion 24 We propose a novel approach for punctuation prediction without relying on prosodic features –Jointly performs punctuation prediction, sentence boundary detection, and sentence type identification –Performs better than the hidden event language model and a linear-chain CRF model –A general approach that consistently works well across different languages –Effective when incorporated with downstream NLP tasks