National Taiwan University, Taiwan

National Taiwan University, Taiwan
Automatic Key Term Extraction from Spoken Course Lectures Using Branching Entropy and Prosodic/Semantic Features Speaker: 黃宥、陳縕儂 Hello, everybody. I am Vivian Chen, coming from National Taiwan University. Today I’m going to present my work about automatic key term extraction from spoken course lectures using branching entropy and prosodic/semantic features.

Outline Introduction Proposed Approach Experiments & Evaluation
Key Term Extraction, NTU Outline Introduction Proposed Approach Branching Entropy Feature Extraction Learning Method Experiments & Evaluation Conclusion First I will define what is key term. A key term is a term that has higher term frequency and includes core content. There are two types of key terms. One of them is phrase, we call it key phrase. For example, “language model” is a key phrase. Another type is single word, we call it keyword, like “entropy”. Then there are two advantages about key term extraction. They can help us index and retrieve. We also can construct the relationships between key terms and segment documents. Here’s an example.

Key Term Extraction, NTU
Introduction

Definition Key Term Two types Advantage Higher term frequency
Key Term Extraction, NTU Definition Key Term Higher term frequency Core content Two types Keyword Key phrase Advantage Indexing and retrieval The relations between key terms and segments of documents First I will define what is key term. A key term is a term that has higher term frequency and includes core content. There are two types of key terms. One of them is phrase, we call it key phrase. For example, “language model” is a key phrase. Another type is single word, we call it keyword, like “entropy”. Then there are two advantages about key term extraction. They can help us index and retrieve. We also can construct the relationships between key terms and segment documents. Here’s an example.

Introduction Key Term Extraction, NTU
We can show some key terms related to acoustic model. If the key term and acoustic model co-occurs in the same document, they are relevant, so that we can show them for users.

Introduction acoustic model language model hmm n gram phone
Key Term Extraction, NTU Introduction acoustic model language model hmm n gram phone hidden Markov model Then we can construct the key term graph to represent the relationships between these key terms like this.

Target: extract key terms from course lectures
Key Term Extraction, NTU Introduction bigram hmm acoustic model language model n gram hidden Markov model phone Similarly, we can also construct the relation between language model and other terms. Then we can show the whole graph to know the organization of key terms. Target: extract key terms from course lectures

Proposed Approach

Automatic Key Term Extraction
Key Term Extraction, NTU Automatic Key Term Extraction ▼ Original spoken documents Learning Methods K-means Exemplar AdaBoost Neural Network Archive of spoken documents ASR trans Branching Entropy Feature Extraction ASR speech signal Here’s the flow chart. Now there are a lot of spoken documents.

Key Term Extraction, NTU Automatic Key Term Extraction Learning Methods K-means Exemplar AdaBoost Neural Network Archive of spoken documents ASR trans Branching Entropy Feature Extraction ASR speech signal Here’s the flow chart. Now there are a lot of spoken documents.

Key Term Extraction, NTU Automatic Key Term Extraction Phrase Identification Learning Methods K-means Exemplar AdaBoost Neural Network Archive of spoken documents ASR trans Branching Entropy Feature Extraction ASR speech signal Here’s the flow chart. Now there are a lot of spoken documents. First using branching entropy to identify phrases

Key Term Extraction, NTU Automatic Key Term Extraction Phrase Identification Key Term Extraction Learning Methods K-means Exemplar AdaBoost Neural Network Archive of spoken documents ASR trans Branching Entropy Feature Extraction ASR speech signal Key terms entropy acoustic model : Here’s the flow chart. Now there are a lot of spoken documents. Learning to extract key terms by some features

Key Term Extraction, NTU Automatic Key Term Extraction Phrase Identification Key Term Extraction Learning Methods K-means Exemplar AdaBoost Neural Network Archive of spoken documents ASR trans Branching Entropy Feature Extraction ASR speech signal Key terms entropy acoustic model : Here’s the flow chart. Now there are a lot of spoken documents.

Branching Entropy hidden Markov model
Key Term Extraction, NTU Branching Entropy How to decide the boundary of a phrase? is of in : represent is can : hidden Markov model “hidden” is almost always followed by the same word The target of this work is to decide the boundary of a phrase, but where’s the boundary? We can observe some characteristics first. Hidden is almost always followed by Markov.

Branching Entropy hidden Markov model
Key Term Extraction, NTU Branching Entropy How to decide the boundary of a phrase? is of in : represent is can : hidden Markov model “hidden” is almost always followed by the same word “hidden Markov” is almost always followed by the same word The target of this work is to decide the boundary of a phrase, but where’s the boundary? We can observe some characteristics first. Hidden is almost always followed by Markov.

Define branching entropy to decide possible boundary
Key Term Extraction, NTU Branching Entropy How to decide the boundary of a phrase? is of in : represent is can : hidden Markov model boundary “hidden” is almost always followed by the same word “hidden Markov” is almost always followed by the same word “hidden Markov model” is followed by many different words The target of this work is to decide the boundary of a phrase, but where’s the boundary? We can observe some characteristics first. Hidden is almost always followed by Markov. Define branching entropy to decide possible boundary

Branching Entropy hidden Markov model X xi
Key Term Extraction, NTU Branching Entropy How to decide the boundary of a phrase? is of in : represent is can : xi hidden Markov model X Definition of Right Branching Entropy Probability of children xi for X Right branching entropy for X The target of this work is to decide the boundary of a phrase, but where’s the boundary? We can observe some characteristics first. Hidden is almost always followed by Markov.

Branching Entropy hidden Markov model X Decision of Right Boundary
Key Term Extraction, NTU Branching Entropy How to decide the boundary of a phrase? is of in : represent is can : hidden Markov model X boundary Decision of Right Boundary Find the right boundary located between X and xi where The target of this work is to decide the boundary of a phrase, but where’s the boundary? We can observe some characteristics first. Hidden is almost always followed by Markov.

Branching Entropy hidden Markov model is represent of is in can : :
Key Term Extraction, NTU Branching Entropy How to decide the boundary of a phrase? is of in : represent is can : hidden Markov model The target of this work is to decide the boundary of a phrase, but where’s the boundary? We can observe some characteristics first. Hidden is almost always followed by Markov.

Using PAT Tree to implement
Key Term Extraction, NTU Branching Entropy How to decide the boundary of a phrase? is of in : represent is can : hidden Markov model X boundary Decision of Left Boundary Find the left boundary located between X and xi where X: model Markov hidden The target of this work is to decide the boundary of a phrase, but where’s the boundary? We can observe some characteristics first. Hidden is almost always followed by Markov. Using PAT Tree to implement

Branching Entropy Implementation in the PAT tree X
Key Term Extraction, NTU Branching Entropy How to decide the boundary of a phrase? Implementation in the PAT tree Probability of children xi for X Right branching entropy for X hidden X : hidden Markov x1: hidden Markov model x2: hidden Markov chain state 5 Markov variable 4 X Then in the PAT tree, we can compute right branching entropy for each node. We can take an example to explain p(xi). X is the node representing a phrase hidden Markov, x_1 is X’s child hidden Markov model, x_2 is X’s another child hidden Markov chain. Previously described p of x_i is shown as this one, and then compute right branching entropy of this node. We compute H_r of X for all X in PAT tree and H_l of X bar for all X bar in the reverse PAT tree. We can co chain 3 model 2 distribution 6 1 x2 x1

Key Term Extraction, NTU Automatic Key Term Extraction Phrase Identification Key Term Extraction Learning Methods K-means Exemplar AdaBoost Neural Network Archive of spoken documents ASR trans Branching Entropy Feature Extraction ASR speech signal Key terms entropy acoustic model : Here’s the flow chart. Now there are a lot of spoken documents. Extract some features for each candidate term

Speaker tends to use longer duration to emphasize key terms
Key Term Extraction, NTU Feature Extraction Prosodic features For each candidate term appearing at the first time Speaker tends to use longer duration to emphasize key terms Feature Name Feature Description Duration (I – IV) normalized duration (max, min, mean, range) using 4 values for duration of the term duration of phone “a” normalized by avg duration of phone “a” For each word, we also compute some prosodic features. First, we believe that lecturer would use longer duration to emphasize key terms. For the candidate term first appearing, we compute the duration for each phone in each candidate term. Then we normalize the duration of specific phone by the average duration of this phone. In this feature, we only use four values to represent this term. We can compute maximum, minimum, mean and range over all phones in a single term to be the features.

Higher pitch may represent significant information
Key Term Extraction, NTU Feature Extraction Prosodic features For each candidate term appearing at the first time Higher pitch may represent significant information Feature Name Feature Description Duration (I – IV) normalized duration (max, min, mean, range) We believe that higher pitch may represent important information. So we can extract the pitch contour like this.

Higher pitch may represent significant information
Key Term Extraction, NTU Feature Extraction Prosodic features For each candidate term appearing at the first time Higher pitch may represent significant information Feature Name Feature Description Duration (I – IV) normalized duration (max, min, mean, range) Pitch (I - IV) F0 The method is like duration, but the segment unit is changed to single frame. We also use these four values to represent the features.

Higher energy emphasizes important information
Key Term Extraction, NTU Feature Extraction Prosodic features For each candidate term appearing at the first time Higher energy emphasizes important information Feature Name Feature Description Duration (I – IV) normalized duration (max, min, mean, range) Pitch (I - IV) F0 Similarly, we think that higher energy may represent important information. We also can extract energy for each frame in a candidate term.

Higher energy emphasizes important information
Key Term Extraction, NTU Feature Extraction Prosodic features For each candidate term appearing at the first time Higher energy emphasizes important information Feature Name Feature Description Duration (I – IV) normalized duration (max, min, mean, range) Pitch (I - IV) F0 Energy energy The features are like pitch. The first set of features is shown in this.

Feature Extraction Lexical features
Key Term Extraction, NTU Feature Extraction Lexical features Feature Name Feature Description TF term frequency IDF inverse document frequency TFIDF tf * idf PoS the PoS tag Using some well-known lexical features for each candidate term The second set of features is lexical features. These features are well-known, which may indicate the importance of term. We just use these features to represent each term.

Key terms tend to focus on limited topics
Key Term Extraction, NTU Feature Extraction Semantic features Probabilistic Latent Semantic Analysis (PLSA) Latent Topic Probability Key terms tend to focus on limited topics Di: documents Tk: latent topics tj: terms Third set of features is semantic features. The assumption is that key terms tend to focus on limited topics. We use PLSA to compute some semantic features. We can get the probability of each topic given a candidate term.

Feature Extraction Semantic features
Key Term Extraction, NTU Feature Extraction Semantic features Probabilistic Latent Semantic Analysis (PLSA) Latent Topic Probability Key terms tend to focus on limited topics non-key term key term How to use it? Third set of features is semantic features. The assumption is that key terms tend to focus on limited topics. We use PLSA to compute some semantic features. We can get the probability of each topic given a candidate term. Feature Name Feature Description LTP (I - III) Latent Topic Probability (mean, variance, standard deviation) describe a probability distribution

Key Term Extraction, NTU Feature Extraction Semantic features Probabilistic Latent Semantic Analysis (PLSA) Latent Topic Significance Within-topic to out-of-topic ratio Key terms tend to focus on limited topics non-key term within-topic freq. key term out-of-topic freq. Third set of features is semantic features. The assumption is that key terms tend to focus on limited topics. We use PLSA to compute some semantic features. We can get the probability of each topic given a candidate term. Feature Name Feature Description LTP (I - III) Latent Topic Probability (mean, variance, standard deviation)

Key Term Extraction, NTU Feature Extraction Semantic features Probabilistic Latent Semantic Analysis (PLSA) Latent Topic Significance Within-topic to out-of-topic ratio Key terms tend to focus on limited topics non-key term within-topic freq. key term out-of-topic freq. Third set of features is semantic features. The assumption is that key terms tend to focus on limited topics. We use PLSA to compute some semantic features. We can get the probability of each topic given a candidate term. Feature Name Feature Description LTP (I - III) Latent Topic Probability (mean, variance, standard deviation) LTS (I - III) Latent Topic Significance (mean, variance, standard deviation)

Key Term Extraction, NTU Feature Extraction Semantic features Probabilistic Latent Semantic Analysis (PLSA) Latent Topic Entropy Key terms tend to focus on limited topics non-key term key term Third set of features is semantic features. The assumption is that key terms tend to focus on limited topics. We use PLSA to compute some semantic features. We can get the probability of each topic given a candidate term. Feature Name Feature Description LTP (I - III) Latent Topic Probability (mean, variance, standard deviation) LTS (I - III) Latent Topic Significance (mean, variance, standard deviation)

Key Term Extraction, NTU Feature Extraction Semantic features Probabilistic Latent Semantic Analysis (PLSA) Latent Topic Entropy Key terms tend to focus on limited topics non-key term Higher LTE key term Lower LTE Third set of features is semantic features. The assumption is that key terms tend to focus on limited topics. We use PLSA to compute some semantic features. We can get the probability of each topic given a candidate term. Feature Name Feature Description LTP (I - III) Latent Topic Probability (mean, variance, standard deviation) LTS (I - III) Latent Topic Significance (mean, variance, standard deviation) LTE term entropy for latent topic

Key Term Extraction, NTU Automatic Key Term Extraction Phrase Identification Key Term Extraction Learning Methods K-means Exemplar AdaBoost Neural Network Archive of spoken documents ASR trans Branching Entropy Feature Extraction ASR speech signal Key terms entropy acoustic model : Here’s the flow chart. Now there are a lot of spoken documents. Using learning approaches to extract key terms

Learning Methods Unsupervised learning K-means Exemplar
Key Term Extraction, NTU Learning Methods Unsupervised learning K-means Exemplar Transform a term into a vector in LTS (Latent Topic Significance) space Run K-means Find the centroid of each cluster to be the key term The terms in the same cluster focus on a single topic The first one is an unsupervised learning, K-means Examplar. First, we can transform each word into a vector in latent topic significance space, as this equation. With these vectors, we run K-means. The terms in the same cluster focus on a single topic. So we can extract the centroid of each cluster to be the key term, because the terms in the same group are related to this key term. The term in the same group are related to the key term The key term can represent this topic

Learning Methods Supervised learning
Key Term Extraction, NTU Learning Methods Supervised learning Adaptive Boosting Neural Network Automatically adjust the weights of features to produce a classifier We also use two supervised learning, adaptive boosting and neural network to automatically adjust the weights of features to produce a classifier.

Experiments & Evaluation
Key Term Extraction, NTU Experiments & Evaluation

Experiments 我們的solution是viterbi algorithm
Key Term Extraction, NTU Experiments Corpus NTU lecture corpus Mandarin Chinese embedded by English words Single speaker 45.2 hours 我們的solution是viterbi algorithm (Our solution is viterbi algorithm) Then we do some experiments to evaluate our approach. The corpus is NTU lecture, which includes Mandarin Chinese and some English words, like this example. This sentence is ~~, which means our solution is viterbi algorithm, The lecture is from a single speaker, and total corpus is about 45 hours.

Out-of-domain corpora
Key Term Extraction, NTU Experiments ASR Accuracy some data from target speaker CH EN SI Model Bilingual AM and model adaptation AM Out-of-domain corpora Background trigram interpolation LM In-domain corpus Adaptive In the ASR system, we train two acoustic models for chinese and english, and use some data to adapt, finally getting a bilingual acoustic model. Language model is trigram interpolation of out-of-domain corpora and some in-domain corpus. Here’s ASR accuracy. Language Mandarin English Overall Char Acc (%) 78.15 53.44 76.26

Experiments Reference Key Terms
Key Term Extraction, NTU Experiments Reference Key Terms Annotations from 61 students who have taken the course If the k-th annotator labeled Nk key terms, he gave each of them a score of , but 0 to others Rank the terms by the sum of all scores given by all annotators for each term Choose the top N terms form the list (N is average Nk) N = 154 key terms 59 key phrases and 95 keywords To evaluate our result, we need to generate reference key term list. The reference key terms are from students’ annotations, and these students have taken the course. Then we sort all terms and decide top N to be key terms. N is the average numbers of key terms extracted by students, and N is 154. The reference key term list includes 59 key phrases and 95 keywords.

Experiments Evaluation Unsupervised learning
Key Term Extraction, NTU Experiments Evaluation Unsupervised learning Set the number of key terms to be N Supervised learning 3-fold cross validation Finally we evaluate the results. For unsupervised learning, we set the number of key terms to be N. And using 3-fold cross validation evaluates supervised learning.

Experiments Prosodic features and lexical features are additive
Key Term Extraction, NTU Experiments Feature Effectiveness Neural network for keywords from ASR transcriptions F-measure 56.55 48.15 42.86 35.63 20.78 At this experiment, we are going to see feature effectiveness. The results only include keywords, and it is from neural network using ASR transcriptions. Row a, b, c perform F1 measure from 20% to 42%. Then row d shows that prosodic and lexical features are additive. Row e proves that adding semantic features can further improve the performance so that three sets of features are all useful. Pr: Prosodic Lx: Lexical Sm: Semantic Prosodic features and lexical features are additive Each set of these features alone gives F1 from 20% to 42% Three sets of features are all useful

Experiments K-means Exempler outperforms TFIDF
Key Term Extraction, NTU Experiments AB: AdaBoost NN: Neural Network Overall Performance F-measure 67.31 62.39 55.84 Conventional TFIDF scores w/o branching entropy stop word removal PoS filtering 51.95 23.38 Finally, we show the overall performance for manual and ASR transcriptions. These are baseline, conventional TFIDF without extracting phrases using branching entropy. From the better performance, we can see that branching entropy is very useful. This proves our assumption that the term with higher branching entropy is more likely to be key term. And the best results are from supervised learning using neural network, achieving F1 measure of 67 and 62. Branching entropy performs well Supervised approaches are better than unsupervised approaches K-means Exempler outperforms TFIDF

Experiments Overall Performance
Key Term Extraction, NTU Experiments AB: AdaBoost NN: Neural Network Overall Performance F-measure 67.31 62.39 62.70 57.68 55.84 51.95 52.60 43.51 23.38 20.78 Finally, we show the overall performance for manual and ASR transcriptions. These are baseline, conventional TFIDF without extracting phrases using branching entropy. From the better performance, we can see that branching entropy is very useful. This proves our assumption that the term with higher branching entropy is more likely to be key term. And the best results are from supervised learning using neural network, achieving F1 measure of 67 and 62. The performance of ASR is slightly worse than manual but reasonable Supervised learning using neural network gives the best results

Conclusion

Conclusion We propose the new approach to extract key terms
Key Term Extraction, NTU Conclusion We propose the new approach to extract key terms The performance can be improved by Identifying phrases by branching entropy Prosodic, lexical, and semantic features together The results are encouraging From the above experiments, the conclusion is that we proposed new approach to extract key terms efficiently. The performance can be improved by two ideas, using branching entropy to extract key phrases and using three sets of features.

Thanks for your attention!  Q & A
Key Term Extraction, NTU Thanks for your attention!  Q & A NTU Virtual Instructor:

National Taiwan University, Taiwan

Similar presentations

Presentation on theme: "National Taiwan University, Taiwan"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

National Taiwan University, Taiwan

Similar presentations

Presentation on theme: "National Taiwan University, Taiwan"— Presentation transcript:

Similar presentations

About project

Feedback