Learning Within-Sentence Semantic Coherence Elena Eneva Rose Hoberman Lucian Lita Carnegie Mellon University
Semantic (in)Coherence Trigram: content words unrelated Effect on speech recognition: –Actual Utterance: “ THE BIRD FLU HAS AFFECTED CHICKENS FOR YEARS BUT ONLY RECENTLY BEGAN MAKING HUMANS SICK” –Top Hypothesis: “THE BIRD FLU HAS AFFECTED SECONDS FOR YEARS BUT ONLY RECENTLY BEGAN MAKING HUMAN SAID” Our goal: model semantic coherence
A Whole Sentence Exponential Model [Rosenfeld 1997] P 0 (s) is an arbitrary initial model (typically N-gram) f i (s)’s are arbitrary computable properties of s (aka features) Z is a universal normalizing constant def
A Methodology for Feature Induction Given corpus T of training sentences: 1. Train best-possible baseline model, P 0 (s) 2. Use P 0 (s) to generate corpus T 0 of “pseudo sentences” 3. Pose a challenge: find (computable) differences that allow discrimination between T and T 0 4. Encode the differences as features f i (s) 5. Train a new model:
Discrimination Task: feel - - sacrifice - - sense meant trust truth kind - free trade agreements living - - ziplock bag university japan's daiwa bank stocks step – Are these content words generated from a trigram or a natural sentence?
Building on Prior Work Define “content words” (all but top 50) Goal: model distribution of content words in sentence Simplify: model pairwise co-occurrences (“content word pairs”) Collect contingency tables; calculate measure of association for them
Q Correlation Measure Q values range from –1 to +1 W 1 yes W 1 no W 2 yesc 11 c 21 W 2 noc 12 c 22 Derived from Co-occurrence Contingency Table
Density Estimates We hypothesized: –Trigram sentences: wordpair correlation completely determined by distance –Natural sentences: wordpair correlation independent of distance kernel density estimation – distribution of Q values in each corpus – at varying distances
Q Distributions Q Value Density ---- Trigram Generated Broadcast News Distance = 1Distance = 3
Likelihood Ratio Feature she is a country singer searching for fame and fortune in nashville Q(country,nashville) = 0.76 Distance = 8 Pr (Q=0.76|d=8,BNews) = 0.32 Pr(Q=0.76|d=8,Trigram) = 0.11 Likelihood ratio = 0.32/0.11 = 2.9
Simpler Features Q Value based –Mean, median, min, max of Q values for content word pairs in the sentence (Cai et al 2000) –Percentage of Q values above a threshold –High/low correlations across large/small distances Other –Word and phrase repetition –Percentage of stop words –Longest sequence of consecutive stop/content words
Datasets LM and contingency tables (Q values) derived from 103 million words of BN From remainder of BN corpus and sentences sampled from trigram LM: –Q value distributions estimated from ~100,000 sentences –Decision tree trained and test on ~60,000 sentences Disregarded sentences with < 7 words –“Mike Stevens says it’s not real” –“We’ve been hearing about it”
Experiments Learners: –C5.0 decision tree –Boosting decision stumps with Adaboost.MH Methodology: –5-fold cross validation on ~60,000 sentences –Boosting for 300 rounds
Results Feature SetClassification Accuracy Q mean, median, min, max (Previous Work) ± 0.36 Likelihood Ratio77.76 ± 0.49 All but Likelihood Ratio80.37 ± 0.42 All Features80.37 ± 0.46 Likelihood Ratio + non-Q
Shannon-Style Experiment 50 sentences –½ “real” and ½ trigram-generated –Stopwords replaced by dashes 30 participants –Average accuracy of 73.77% ± 6 –Best individual accuracy 84% Our classifier: –Accuracy of 78.9% ± 0.42
Summary Introduced a set of statistical features which capture aspects of semantic coherence Trained a decision tree to classify with accuracy of 80% Next step: incorporate features into exponential LM
Future Work Combat data sparsity –Confidence intervals –Different correlation statistic –Stemming or clustering vocabulary Evaluate derived features –Incorporate into an exponential language model –Evaluate the model on a practical application
Agreement among Participants
Expected Perplexity Reduction Semantic coherence feature –78% of broadcast news sentences –18% of trigram-generated sentences Kullback-Leibler divergence:.814 Average perplexity reduction per word =.0419 (2^.814/21) per sentence? Features modify probability of entire sentence Effect of feature on per-word probability is small
Likelihood Value Density ---- Trigram Generated Broadcast News Distribution of Likelihood Ratio
Discrimination Task Natural Sentence: –but it doesn't feel like a sacrifice in a sense that you're really saying this is you know i'm meant to do things the right way and you trust it and tell the truth Trigram-Generated: –they just kind of free trade agreements which have been living in a ziplock bag that you say that i see university japan's daiwa bank stocks step though
Q Value Density ---- Trigram Generated Broadcast News Q Values at Distance 1
Q Value Density ---- Trigram Generated Broadcast News Q Values at Distance 3
Outline The problem of semantic (in)coherence Incorporating this into the whole- sentence exponential LM Finding better features for this model using machine learning Semantic coherence features Experiments and results