11.0 Spoken Document Understanding and Organization for User-content Interaction References: 1. “Spoken Document Understanding and Organization”, IEEE.

Slides:



Advertisements
Similar presentations
A Human-Centered Computing Framework to Enable Personalized News Video Recommendation (Oh Jun-hyuk)
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.
ECG Signal processing (2)
Neural networks Introduction Fitting neural networks
Toward Automatic Music Audio Summary Generation from Signal Analysis Seminar „Communications Engineering“ 11. December 2007 Patricia Signé.
15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Speaker: Yun-Nung Chen 陳縕儂 Advisor: Prof. Lin-Shan Lee 李琳山 National Taiwan University Automatic Key Term Extraction and Summarization from Spoken Course.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.
Presented by Zeehasham Rasheed
Scalable Text Mining with Sparse Generative Models
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Information Retrieval in Practice
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Graphical models for part of speech tagging
11.0 Spoken Document Understanding and Organization for User-content Interaction References: 1. “Spoken Document Understanding and Organization”, IEEE.
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Subject (Exam) Review WSTA 2015 Trevor Cohn. Exam Structure Worth 50 marks Parts: – A: short answer [14] – B: method questions [18] – C: algorithm questions.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
12.0 Spoken Document Understanding and Organization References: 1. “Spoken Document Understanding and Organization”, IEEE Signal Processing Magazine, Sept.
National Taiwan University, Taiwan
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Post-Ranking query suggestion by diversifying search Chao Wang.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Introduction to Machine Learning, its potential usage in network area,
Information Retrieval in Practice
A NONPARAMETRIC BAYESIAN APPROACH FOR
Neural Machine Translation
CS 9633 Machine Learning Support Vector Machines
Machine Learning for Computer Security
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Sentiment analysis algorithms and applications: A survey
Online Multiscale Dynamic Topic Models
Deep learning David Kauchak CS158 – Fall 2016.
Deep Learning Amin Sobhani.
Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.
Artificial Intelligence for Speech Recognition
Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
Personalized Social Image Recommendation
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Multimedia Information Retrieval
Statistical Models for Automatic Speech Recognition
Introduction into Knowledge and information
CSE 635 Multimedia Information Retrieval
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Hsien-Chin Lin, Chi-Yu Yang, Hung-Yi Lee, Lin-shan Lee
CS249: Neural Language Model
Presentation transcript:

11.0 Spoken Document Understanding and Organization for User-content Interaction References: 1. “Spoken Document Understanding and Organization”, IEEE Signal Processing Magazine, Sept. 2005, Special Issue on Speech Technology in Human-Machine Communication 2. “Multi-layered Summarization of Spoken Document Archives by Information Extraction and Semantic Structuring”, Interspeech 2006, Pittsburg, USA

User-Content Interaction for Spoken Content Retrieval Problems Unlike text content, spoken content not easily summarized on screen, thus retrieved results difficult to scan and select User-content interaction always important even for text content Possible Approaches Automatic summary/title generation and key term extraction for spoken content Semantic structuring for spoken content Multi-modal dialogue with improved interaction Key Terms/ Titles/Summaries User Interface Semantic Structuring User Query Spoken Archives Retrieved Results Retrieval Engine Multi-modal Dialogue

Multi-media/Spoken Document Understanding and Organization Key Term/Named Entity Extraction from Multi-media/Spoken Documents — personal names, organization names, location names, event names — key phrase/keywords in the documents — very often out-of-vocabulary (OOV) words, difficult for recognition Multi-media/Spoken Document Segmentation — automatically segmenting a multi-media/spoken document into short paragraphs, each with a central topic Information Extraction for Multi-media/Spoken Documents — extraction of key information such as who, when, where, what and how for the information described by multi-media/spoken documents. — very often the relationships among the key terms/named entities Summarization for Multi-media/Spoken Documents — automatically generating a summary (in text or speech form) for each short paragraph Title Generation for Multi-media/Spoken Documents — automatically generating a title (in text or speech form) for each short paragraph — very concise summary indicating the topic area Topic Analysis and Organization for Multi-media/Spoken Documents — analyzing the subject topics for the short paragraphs — clustering and organizing the subject topics of the short paragraphs, giving the relationships among them for easier access

Integration Relationships among the Involved Technology Areas Semantic Analysis Information Indexing, Retrieval And Browsing Key Term Extraction from Spoken Documents Keyterms/Named Entity Extraction from Spoken Documents

hidden Markov model Key Term Extraction from Spoken Content (1/2) Key Terms : key phrases and keywords Key Phrase Boundary Detection An Example Left/right boundary of a key phrase detected by context statistics hidden Markov model represent is can : of in boundary “hidden” almost always followed by the same word “hidden Markov” almost always followed by the same word “hidden Markov model” is followed by many different words

Key Term Extraction from Spoken Content (2/2) Prosodic Features key terms probably produced with longer duration, wider pitch range and higher energy Semantic Features (e.g. PLSA) key terms usually focused on smaller number of topics Lexical Features TF/IDF, POS tag, etc. Not key term P(Tk|ti) k key term topics

Extractive Summarization of Spoken Documents document d: summary of document d: Correctly recognized word X1 t1 t2 X1 X2 X3 X3 Selecting most representative utterances in the original document but avoiding redundancy Wrongly recognized word X4 X5 - Scoring sentences based on prosodic, semantic, lexical features and confidence measures, etc. - Based on a given summarization ratio X6

Title Generation for Spoken Documents Titles for retrieved documents/segments helpful in browsing and selection of retrieved results Short, readable, telling what the document/segment is about One example: Scored Viterbi Search Training corpus Term Ordering Model Selection Title Length Spoken document Recognition and Summarization Viterbi Algorithm Output Summary Scored Viterbi Integrated with Adaptive K-Nearest Neighbors (AKNN) Starting with the automatically generated summary Three sets of scores used : term selection scores ( key terms, title terms ), term ordering scored, title length scores Select the top K training documents nearest to the new document using AKNN Choose the best human-generated title of the selected training documents Constructing the title with Viterbi Algorithm using terms selected from the automatically generated summary and the best human-generated title

Semantic Structuring (1/2) Example 1: retrieved results clustered by Latent Topics and organized in a two-dimensional tree structure (multi-layered map) each cluster labeled by a set of key terms representing a group of retrieved documents/segments each cluster expanded into a map in the next layer We try to analyze and organize of those spoken documents based on techniques of PLSA, In which, terms and documents all have probabilities based on the set of latent topics We can cluster documents in a Two-dimensional Tree Structure such as this, those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map When a cluster has many documents, we can further analyze it into an other map on the next layer

Semantic Structuring (2/2) Example 2: Key-term Graph each retrieved spoken document/segment labeled by a set of key terms relationships between key terms represented by a graph ----- --------- --- ------- ---- retrieved spoken documents key term graph Acoustic Modeling Viterbi search HMM Language Modeling Perplexity We try to analyze and organize of those spoken documents based on techniques of PLSA, In which, terms and documents all have probabilities based on the set of latent topics We can cluster documents in a Two-dimensional Tree Structure such as this, those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map When a cluster has many documents, we can further analyze it into an other map on the next layer

Multi-modal Dialogue An example: user-system interaction modeled as a Markov Decision Process (MDP) Key Terms/ Titles/Summaries Spoken Archives User Retrieved Results Retrieval Engine Query Interface Multi-modal Dialogue Semantic Structuring Example goals small average number of dialogue turns (average number of user actions taken) for successful tasks (success: user’s information need satisfied) less effort for user, better retrieval quality We try to analyze and organize of those spoken documents based on techniques of PLSA, In which, terms and documents all have probabilities based on the set of latent topics We can cluster documents in a Two-dimensional Tree Structure such as this, those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map When a cluster has many documents, we can further analyze it into an other map on the next layer

Spoken Document Summarization Why summarization? Huge quantities of information Spoken content difficult to be shown on the screen and difficult to browse Mails News articles Books Meeting Social Media Broadcast News Websites Lecture

Spoken Document Summarization More difficult than text summarization Recognition errors, Disfluency, etc. Extra information not in text Prosody, speaker identity, emotion, etc. Audio Recording SN: Summary dN: document . . . . . 𝑥 1 , 𝑥 2 …. 𝑥 1 , 𝑥 2 …. . Summarization System d2: document S2: Summary 𝑥 1 , 𝑥 2 …. 𝑥 1 , 𝑥 2 …. 𝑥 1 , 𝑥 2 …. 𝑠 1 , 𝑠 2 …. 𝑥 𝑚 : utterance 𝑥 𝑚 : utterance 𝑥 𝑚 : utterance d1: document 𝑥 1 , 𝑥 2 …. 𝑠 1 , 𝑠 2 …. S1: Summary ASR System 𝑥 1 , 𝑥 2 …. 𝑥 1 , 𝑥 2 …. 𝑥 1 , 𝑥 2 …. 𝑠 𝑖 : selected utterance 𝑥 𝑚 : utterance 𝑥 𝑚 : utterance 𝑥 𝑚 : utterance 𝑥 1 , 𝑥 2 …. 𝑠 𝑖 : selected utterance 𝑠 1 , 𝑠 2 …. 𝑥 𝑚 : utterance 𝑥 𝑚 : utterance 𝑥 𝑚 : utterance 𝑠 𝑖 : selected utterance

Unsupervised Approach: Maximum Margin Relevance (MMR) Select relevant and non-redundant sentences 𝑀𝑀𝑅 𝑥 𝑖 =𝑅𝑒𝑙 𝑥 𝑖 −λ 𝑅𝑒𝑑( 𝑥 𝑖 ,𝑆) Relevance : 𝑅𝑒𝑙 𝑥 𝑖 =𝑆𝑖𝑚 𝑥 𝑖 ,𝑑 Redundancy : 𝑅𝑒𝑑 𝑥 𝑖 ,𝑆 =𝑆𝑖𝑚 𝑥 𝑖 ,𝑆 Sim 𝑥 𝑖 ,• : Similarity measure Ranked by 𝑀𝑀𝑅 𝑥 𝑖 Presently Selected Summary S Spoken Document 𝐝 𝑢𝑡𝑡𝑒𝑟𝑎𝑛𝑐𝑒 𝑥 1 𝑢𝑡𝑡𝑒𝑟𝑎𝑛𝑐𝑒 𝑥 2 𝑢𝑡𝑡𝑒𝑟𝑎𝑛𝑐𝑒 𝑥 3 𝑢𝑡𝑡𝑒𝑟𝑎𝑛𝑐𝑒 𝑥 4 𝑥 4 𝑥 2 𝑥 1 𝑥 8 𝑥 4 𝑥 8 𝑥 1 𝑥 3 𝑥 4 𝑥 8 𝑥 3 …… ………… …… 𝑥 2

Supervised Approach: SVM or Similar Trained with documents with human labeled summaries Binary classification problem : 𝑥 𝑖 ∈𝑆 , or 𝑥 𝑖 ∉𝑆 Training data v( 𝑥 𝑖 ) : Feature vector of 𝑥 𝑖 dN: document . . . SN: Summary . . . Binary Classification model Human labeled d2: document S2: Summary Feature Extraction d1: document S1: Summary 𝑠 1 , 𝑠 2 …. 𝑥 1 , 𝑥 2 …. 𝑠 𝑖 : selected utterance 𝑥 𝑚 : utterance Training phase Testing phase Testing data v( 𝑥 𝑖 ) : Feature vector of 𝑥 𝑖 Binary Classification model 𝑑 𝑁 : document Ranked utterances Feature Extraction 𝑥 1 , 𝑥 2 …. 𝑥 𝑚 : utterance ASR System

Domain Adaptation of Supervised Approach Problem Hard to get high quality training data In most cases, we have labeled out-of-domain references but not labeled target domain references Goal Taking advantage of out-of-domain data Out-of-domain (News) Target Domain (Lecture) ?

Domain Adaptation of Supervised Approach 𝑀𝑜𝑑𝑒𝑙 0 trined by out-of-domain data, used to obtain 𝑠𝑢𝑚𝑚𝑎𝑟𝑦 0 for target domain model training 𝑀𝑜𝑑𝑒𝑙 0 Target domain data without labeled document/summary Out-of-domain data with labeled document/summary Summary 0 Spoken Document Summary 𝑑 𝑀 : document . dN: document . . . . SN: Summary . . 𝑆 𝑀 : Summary . . Human labeled Summary Extraction d2: document 𝑑 2 : document S2: Summary 𝑆 2 : Summary d1: document S1: Summary 𝑆 1 : Summary 𝑑 1 : document 𝑥 1 , 𝑥 2 …. 𝑥 1 , 𝑥 2 …. 𝑥 𝑚 : utterance 𝑥 𝑚 : utterance

Domain Adaptation of Supervised Approach 𝑀𝑜𝑑𝑒𝑙 0 trined by out-of-domain data, used to obtain 𝑠𝑢𝑚𝑚𝑎𝑟𝑦 0 for target domain 𝑠𝑢𝑚𝑚𝑎𝑟𝑦 0 together with out-of-domain data jointly used to train 𝑀𝑜𝑑𝑒𝑙 1 model training 𝑀𝑜𝑑𝑒𝑙 1 Target domain data without labeled document/summary Out-of-domain data with labeled document/summary Summary 0 Spoken Document Summary 𝑑 𝑀 : document . dN: document . . . . SN: Summary . . 𝑆 𝑀 : Summary . . Human labeled Summary Extraction d2: document 𝑑 2 : document S2: Summary 𝑆 2 : Summary d1: document S1: Summary 𝑆 1 : Summary 𝑑 1 : document 𝑥 1 , 𝑥 2 …. 𝑥 1 , 𝑥 2 …. 𝑥 𝑚 : utterance 𝑥 𝑚 : utterance

Document Summarization Extractive Summarization select sentences in the document Abstractive Summarization Generate sentences describing the content of the document 彰化 檢方 偵辦 芳苑 鄉公所 道路 排水 改善 工程 弊案 拘提 芳苑 鄉長 陳 聰明 檢方 認為 陳 聰明 等 人和 包商 勾結 涉嫌 貪污 和 圖利 罪嫌 凌晨 向 法院 聲請羈押 以及 公所 秘書 楊 騰 煌 獲准 彰化 鄉公所 陳聰明  涉嫌 貪污 Extractive Abstractive e.g. Summarization System

Document Summarization Extractive Summarization select sentences in the document Abstractive Summarization Generate sentences describing the content of the document 彰化 檢方 偵辦 芳苑 鄉公所 道路 排水 改善 工程 弊案 拘提 芳苑 鄉長 陳 聰明 檢方 認為 陳 聰明 等 人和 包商 勾結 涉嫌 貪污 和 圖利 罪嫌 凌晨 向 法院 聲請羈押 以及 公所 秘書 楊 騰 煌 獲准 彰化 鄉公所 陳聰明  涉嫌 貪污 Extractive Abstractive e.g. Summarization System

Abstractive Summarization (1/4) An Example Approach Generating candidate sentences by a graph Selecting sentences by topic models, language models of words, parts-of-speech(POS), length constraint, etc. 1) Generating Candidate sentences 2) Sentence selection d1: document 𝑥 3 … Ranked list 𝑥 1 𝑥 1 , 𝑥 2 …. … 𝑥 𝑘 𝑥 6 𝑥 𝑗 𝑥 𝑘 𝑥 𝑖 𝑥 5 𝑥 𝑚 : utterance 𝑥 2 𝑥 𝑗 𝑥 8 ..… 𝑥 𝑖 : candidate sentence

Abstractive Summarization (2/4) 1) Generating Candidate sentences Graph construction + search on graph Node : “word” in the sentence Edge : word ordering in the sentence X1 : 這個 飯店 房間 算 舒適. X2 : 這個 飯店 的 房間 很 舒適 但 離 市中心 太遠 不方便 X3 : 飯店 挺 漂亮 但 房間 很 舊 X4 : 離 市中心 遠

Abstractive Summarization (3/4) 1) Generating Candidate sentences Graph construction + search on graph X1 : 這個 飯店 房間 算 舒適 X2 : 這個 飯店 的 房間 很 舒適 但 離 市中心 太遠 不方便 X3 : 飯店 挺 漂亮 但 房間 很 舊 X4 : 離 市中心 遠 很 舊 的 這個 飯店 房間 算 舒適 但 離 市中心 太 挺 漂亮 遠 不方便

Abstractive Summarization (3/4) 1) Generating Candidate sentences Graph construction + search on graph X1 : 這個 飯店 房間 算 舒適 X2 : 這個 飯店 的 房間 很 舒適 但 離 市中心 太遠 不方便 X3 : 飯店 挺 漂亮 但 房間 很 舊 X4 : 離 市中心 遠 但 離 市中心 太 遠 不方便 這個 房間 漂亮 挺 很 舊 的 Start node 飯店 算 舒適

Abstractive Summarization (3/4) 1) Generating Candidate sentences Graph construction + search on graph X1 : 這個 飯店 房間 算 舒適 X2 : 這個 飯店 的 房間 很 舒適 但 離 市中心 太遠 不方便 X3 : 飯店 挺 漂亮 但 房間 很 舊 X4 : 離 市中心 遠 但 離 市中心 太 遠 不方便 這個 房間 漂亮 挺 很 舊 的 Start node End node 飯店 算 舒適

Abstractive Summarization (4/4) 1) Generate Candidate sentences Graph construction + search on graph Search : find Valid path on graph Valid path : path from start node to end node X1 : 這個 飯店 房間 算 舒適 X2 : 這個 飯店 的 房間 很 舒適 但 離 市中心 太遠 不方便 X3 : 飯店 挺 漂亮 但 房間 很 舊 X4 : 離 市中心 遠 e.g. 飯店 房間 很 舒適 但 離 市中心 遠 很 舊 的 這個 飯店 房間 算 舒適 Start node End node 但 離 市中心 太 挺 漂亮 遠 不方便

Abstractive Summarization (4/4) 1) Generating Candidate sentences Graph construction + search on graph Search : find Valid path on graph Valid path : path from start node to end node X1 : 這個 飯店 房間 算 舒適 X2 : 這個 飯店 的 房間 很 舒適 但 離 市中心 太遠 不方便 X3 : 飯店 挺 漂亮 但 房間 很 舊 X4 : 離 市中心 遠 e.g. 飯店 房間 很 舒適 但 離 市中心 遠 飯店 挺 漂亮 但 房間 很舊 很 舊 的 飯店 這個 房間 舒適 算 Start node End node 但 離 市中心 太 挺 漂亮 遠 不方便

Sequence-to-Sequence Learning (1/3) Both input and output are sequences with different lengths. machine translation (machine learning→機器學習) summarization, title generation spoken dialogues speech recognition 中翻英 英翻中 沒有哪一個一定比較長或比較短 machine learning Containing all information about input sequence

Sequence-to-Sequence Learning (2/3) Both input and output are sequences with different lengths. machine translation (machine learning→機器學習) summarization, title generation spoken dialogues speech recognition 器 學 …… 機 習 慣 性 …… 中翻英 英翻中 沒有哪一個一定比較長或比較短 machine learning Don’t know when to stop

Sequence-to-Sequence Learning (3/3) Both input and output are sequences with different lengths. machine translation (machine learning→機器學習) summarization, title generation spoken dialogues speech recognition 器 學 機 習 === 中翻英 英翻中 沒有哪一個一定比較長或比較短 More applications machine learning Add a symbol “===“ (斷) [Ilya Sutskever, NIPS’14][Dzmitry Bahdanau, arXiv’15]

Multi-modal Interactive Dialogue USA President More precisely please? document 305 document 116 document 298 ... Spoken Archive Retrieval Engine Query 1 System response Interactive dialogue: retrieval engine interacts with the user to find out more precisely his information need User entering the query When the retrieved results are divergent, the system may ask for more information rather than offering the results

Multi-modal Interactive Dialogue International Affairs Regarding Middle East? document 496 document 275 document 312 ... Spoken Archive Retrieval Engine Query 2 System response Interactive dialogue: retrieval engine interacts with the user to find out more precisely his information need User entering the second query when the retrieved results are still divergent, but seem to have a major trend, the system may use a key word representing the major trend asking for confirmation User may reply: “Yes” or “No, Asia”

Markov Decision Process (MDP) A mathematical framework for decision making, defined by (S,A,T,R,π) S: Set of states, current system status A: Set of actions the system can take at each state T: transition probabilities between states when a certain action is taken R: reward received when taking an action π: policy, choice of action given the state Objective : Find a policy that maximizes the expected total reward 𝑠 1 , 𝑠 2 , 𝑠 3 , ⋯ 𝑠 1 , 𝑠 2 , 𝑠 3 , ⋯ 𝐴 1 , 𝐴 2 , 𝐴 3 , ⋯ 𝐴 1 , 𝐴 2 , 𝐴 3 , ⋯ 𝑅 1 , 𝑅 2 , 𝑅 3 , ⋯ 𝑅 1 , 𝑅 2 , 𝑅 3 , ⋯ π: 𝑠 𝑖 → 𝐴 𝑗

Multi-modal Interactive Dialogue S1 S2 S3 A1 R1 R2 A2 R End Show Model as Markov Decision Process (MDP) A3 A2 After a query entered, the system starts at a certain state States: retrieval result quality estimated as a continuous variable (e.g. MAP) plus the present dialogue turn Action: at each state, there is a set of actions which can be taken: asking for more information, returning a keyword or a document, or a list of keywords or documents asking for selecting one, or showing results…. User response corresponds to a certain negative reward (extra work for user) when the system decides to show to the user the retrieved results, it earns some positive reward (e.g. MAP improvement) Learn a policy maximizing rewards from historical user interactions( π: Si → Aj)

Reinforcement Learning Example approach: Value Iteration Define value function: 𝑄 𝜋 𝑠,𝑎 =𝐸 Σ 𝑘=0 ∞ 𝛾 𝑘 𝑟 𝑘 𝑠 0 =𝑠, 𝑎 0 =𝑎 the expected discounted sum of rewards given π started from 𝑠,𝑎 The real value of Q can be estimated iteratively from a training set: 𝑄 ∗ 𝑠,𝑎 = 𝐸 𝑠 ′ 𝑠,𝑎 𝑅 𝑠,𝑎, 𝑠 ′ + 𝛾 𝑏∈𝐴 𝑚𝑎𝑥 𝑄 ∗ 𝑠 ′ ,𝑏 𝑄 ∗ 𝑠,𝑎 :estimated value function based on the training set Optimal policy is learned by choosing the best action given each state such that the value function is maximized

Question-Answering (QA) in Speech Knowledge Source Answer Question, Answer, Knowledge Source can all be in text form or in Speech Spoken Question Answering becomes important spoken questions and answers are attractive the availability of large number of on-line courses and shared videos today makes spoken answers by distinguished instructors or speakers more feasible, etc. Text Knowledge Source is always important

Three Types of QA Factoid QA: Definitional QA : Complex Question: What is the name of the largest city of Taiwan? Ans: Taipei. Definitional QA : What is QA? Complex Question: How to construct a QA system?

Factoid QA Question Processing Passage Retrieval Answer Processing Query Formulation: transform the question into a query for retrieval Answer Type Detection (city name, number, time, etc.) Passage Retrieval Document Retrieval, Passage Retrieval Answer Processing Find and rank candidate answers

Factoid QA – Question Processing Query Formulation: Choose key terms from the question Ex: What is the name of the largest city of Taiwan? “Taiwan”, “largest city ” are key terms and used as query Answer Type Detection “city name” for example Large number of hierarchical classes hand-crafted or automatically learned

An Example Factoid QA Watson: a QA system develop by IBM (text-based, no speech), who won “Jeopardy!”

Definitional QA Definitional QA ≈ Query-focused summarization Use similar framework as Factoid QA Question Processing Passage Retrieval Answer Processing is replaced by Summarization

References Key terms Title Generation “Automatic Key Term Extraction From Spoken Course Lectures Using Branching Entropy and Prosodic/Semantic Features”, IEEE Workshop on Spoken Language Technology, Berkeley, California, U.S.A., Dec 2010, pp. 253-258. “Unsupervised Two-Stage Keyword Extraction from Spoken Documents by Topic Coherence and Support Vector Machine”, International Conference on Acoustics, Speech and Signal Processing, Kyoto, Japan, Mar 2012, pp. 5041-5044. Title Generation “Automatic Title Generation for Spoken Documents with a Delicate Scored Viterbi Algorithm”, 2nd IEEE Workshop on Spoken Language Technology, Goa, India, Dec 2008, pp. 165-168. “Abstractive Headline Generation for Spoken Content by Attentive Recurrent Neural Networks with ASR Error Modeling” IEEE Workshop on Spoken Language Technology (SLT), San Diego, California, USA, Dec 2016, pp. 151-157.

References Summarization “Supervised Spoken Document Summarization Jointly Considering Utterance Importance and Redundancy by Structured Support Vector Machine”, Interspeech, Portland, U.S.A., Sep 2012. “Unsupervised Domain Adaptation for Spoken Document Summarization with Structured Support Vector Machine”, International Conference on Acoustics, Speech and Signal Processing, Vancouver, Canada, May 2013. “Supervised Spoken Document Summarization Based on Structured Support Vector Machine with Utterance Clusters as Hidden Variables”, Interspeech, Lyon, France, Aug 2013, pp. 2728-2732. “Semantic Analysis and Organization of Spoken Documents Based on Parameters Derived from Latent Topics”, IEEE Transactions on Audio, Speech and Language Processing, Vol. 19, No. 7, Sep 2011, pp. 1875-1889. "Spoken Lecture Summarization by Random Walk over a Graph Constructed with Automatically Extracted Key Terms," InterSpeech 2011

References Summarization “Speech-to-text and Speech-to-speech Summarization of Spontaneous Speech”, IEEE Transactions on Speech and Audio Processing, Dec. 2004 “The Use of MMR, diversity-based reranking for reordering document and producing summaries” SIGIR, 1998 “Using Corpus and Knowledge-based Similarity Measure in Maximum Marginal Relevance for Meeting Summarization” ICASSP, 2008 “Opinosis: A Graph-Based Approach to Abstractive Summarization of Highly Redundant Opinions”, International Conference on Computational Linguistics , 2010

References Interactive Retrieval “Interactive Spoken Content Retrieval by Extended Query Model and Continuous State Space Markov Decision Process”, International Conference on Acoustics, Speech and Signal Processing, Vancouver, Canada, May 2013. “Interactive Spoken Content Retrieval by Deep Reinforcement Learning”, Interspeech, San Francisco, USA, Sept 2016. Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto, The MIT Press, 1999. Partially observable Markov decision processes for spoken dialog systems, Jason D. Williams and Steve Young, Computer Speech and Language, 2007.

Reference Question Answering Rosset, S., Galibert, O. and Lamel, L. (2011) Spoken Question Answering, in Spoken Language Understanding: Systems for Extracting Semantic Information from Speech Pere R. Comas, Jordi Turmo, and Lluís Màrquez. 2012. “Sibyl, a factoid question-answering system for spoken documents.” ACM Trans. Inf. Syst. 30, 3, Article 19 (September 2012), 40  “Towards Machine Comprehension of Spoken Content: Initial TOEFL Listening Comprehension Test by Machine”, Interspeech, San Francisco, USA, Sept 2016, pp. 2731-2735. “Hierarchical Attention Model for Improved Comprehension of Spoken Content”, IEEE Workshop on Spoken Language Technology (SLT), San Diego, California, USA, Dec 2016, pp. 234-238.

Reference Sequence-to-sequence Learning “Sequence to Sequence Learning with Neural Networks”, NIPS, 2014 “Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition”, ICASSP 2016