Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania.

Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Motivation A well-written text is a mix of general statements and sentences providing details In information retrieval: find relevant and well-written documents Writing support: visualize general and specific areas

Supervised sentence-level classifier for general/specific Training data Used existing annotations for discourse relations from PDTB Features Lexical, language model, syntax, etc Testing data Annotators judged more sentences Applications to analysis of summarization output Automatic summaries too specific, worse for that

Training data Penn discourse tree bank

Penn Discourse Treebank (PDTB) Largest annotated corpus of explicit and implicit discourse relations 1 million words of Wall Street Journal Arguments – spans linked by a relation (Arg1, Arg2) Sense – semantics of the relation (3 level hierarchy) I love ice-cream but I hate chocolates. (discourse connectives) I came late. I missed the train. (adjacent sentences in the same paragraph) 5

Distribution of relations between adjacent sentences (Adjacent sentences linked by an entity. Not considered a true discourse relation.) 6

7 Training data from PDTB Expansions Expansion Conjunction [Also, Further] Restatement [Specifically, Overall] Instantiation [For example] List [And] Alternative [Or, Instead] Exception [except] Specification Equivalence GeneralizationConjunctiveDisjunctive Chosen alternative 7

Instantiation example The 40 year old Mr. Murakami is a publishing sensation in Japan. A more recent novel, Norwegian wood, has sold more than forty million copies since Kodansha published it in 1987. 8

Examples of general /specific sentences Despite recent declines in yields, investors continue to pour cash into money funds. Assets of the 400 taxable funds grew by $1.5 billion during the latest week, to $352 billion. [Instantiation] By most measures, the nations industrial sector is now growing very slowlyif at all. Factory payrolls fell in September. [Specification] 9

Experimental setupTwo classifiers Instantiations-based Arg1: General, Arg2: specific 1403 examples Restatement#Specifications-based Arg1: General, Arg2: specific 2370 examples Implicit relations only 50% baseline accuracy; 10 fold-cross validation; Logistic regression 10

Features Developed from a small development set 10 pairs of specification 10 pairs of instantiation

Features for general vs specific Sentence length: no. of tokens, no. of nouns Expected general sentences to be shorter Polarity: no. of positive/ negative/ polarity words, also normalized by length General Inquirer MPQA subjectivity lexicon In dev set, sentences with strong opinion are general Language models: unigram/ bigram/ trigram probability & perplexity Trained on one year of New York Times news In dev set, general sentences contained unexpected, catchy phrases 12

Features for general vs specific Specificity min/ max/ avg IDF WordNet: hypernym distance to root for nouns and verbs min/ max/ avg Syntax: No. of adjectives, adverbs, ADJP, ADVP, verb phrases, avg VP length Entities: Numbers, proper names, $ sign, plural nouns Words: count of each word in the sentence 13

Accuracy of general/specific classifier using Instantiations 14 Best: 76% accuracy

Accuracy of general/specific classifier using Specifications 15 Best: 60% accuracy

Instantiation based classifier gave better performance Best individual feature set: words (74.8%) Non-lexical features are equally good: 74.1% No improvement by combining: 75.8% 16

Feature analysis Words with highest weight [Instantiation-based] General: number, but, also, however, officials, some, what, lot, prices, business, were… Specific: one, a, to, co, I, called, we, could, get… General sentences are characterized by Plural nouns Dollar sign Lower probability More polarity words and more adjectives and adverbs Specific sentences are characterized by Numbers and names

More testing data Direct judgments of WSJ and AP sentences on Amazon Mechanical Turk ~ 600 sentences 5 judgments per sentence

AgreeTotal WSJ General WSJ Specific WSJ Total AP General AP Specific AP 59651451083375 41025745913556 3955243884939 Total294160133292117170 In WSJ, more sentences are general (55%) In AP, more sentences are specific (60%)

Why the difference between Instantiation and Specification? Some of the annotations were on our initial training data 20 Instantiation (32) GeneralSpecific Arg1293 Arg2626 Specification (16) GeneralSpecific Arg1106 Arg288 Has more detectable properties associated with Arg1 and Arg2

Accuracy of classifier on new data ExamplesAll features Non lexical Word s All features Non lexical Words 5 Agree90.696.884.369.494.478.7 4+5 Agree80.888.877.765.889.974.8 All73.776.771.659.281.167.5 Non-lexical features work better on this data Performance is almost the same as in cross validation Classifier is more accurate on examples where people agree Classifier confidence correlates with annotator agreement

22 Application of our classifier to full articles Distribution of general/specific sentences in news documents Can the classifier detect differences in general/specific summaries by people Do summaries have more general/specific content compared to input? How does it impact summary quality? Compare different types of summaries Human abstracts: written from scratch Human extracts: select sentences as a whole from inputs System summaries: all extracts 22

Seismologists said the volcano had plenty of built-up magma and even more severe eruptions could come later. [general] The volcano's activity -- measured by seismometers detecting slight earthquakes in its molten rock plumbing system -- is increasing in a way that suggests a large eruption is imminent, Lipman said. [specific] Example general and specific predictions 23

24 Example predictions The novel, a story of a Scottish low-life narrated largely in Glaswegian dialect, is unlikely to prove a popular choice with booksellers who have damned all six books shortlisted for the prize as boring, elitist and – worse of all – unsaleable. … The Booker prize has, in its 26-year history, always provoked controversy. 24 Specific General

Computing specificity for a text Sentences in summary are of varying length, so we compute a score on word level Average specificity of words in the text 25 S1:S1:w 12 w 11 …w 13 S2:S2:w 22 w 21 …w 23 S3:S3:w 32 w 31 …w 33 Confidence for being in specific class 0.23 0.81 0.68 0.23 0.81 Average score on tokens Specificity score

50 specific and general human summaries TextGeneral categorySpecific category Summaries0.550.63 Inputs0.630.65 No significant differences in specificity of the input Significant differences in specificity of summaries in the two categories Our classifier is able to detect the differences

Data: DUC 2002 Generic multidocument summarization task 59 input sets 5 to 15 news documents 3 types of summaries 200 words Manually assigned content and linguistic quality scores 1. Human abstracts 27 2. Human extracts 3. System extracts 2 assessors * 59 9 systems * 59

Specificity analysis of summaries 1. More general content is preferred in abstracts 2. Simply the process of extraction makes summaries more specific 3. System summaries are overly specific 28 0.70.80.6 Inputs (0.65) H. Abs (0.62) S.ext (0.74) H.ext (0.72) [Avg. specificity]

Histogram of specificity scores Human summaries are more general Is the aspect related to summary quality?

Analysis of system summaries: specificity and quality 1. Content quality Importance of content included in the summary 2. Linguistic quality How well-written the summary is perceived to be 3. Quality of general/specific summaries When a summary is intended to be general or specific 30

31 Relationship to content selection scores Coverage score: closeness to human summary Clause level comparison For system summaries Correlation between coverage score and average specificity -0.16*, p-value = 0.0006 Less specific ~ better content

But the correlation is not very high Specificity is related to realization of content Different from importance of the content Content quality = content importance + appropriate specificity level Content importance: ROUGE scores N-gram overlap of system summary and human summary Standard evaluation of automatic summaries 32

Specificity as one of the predictors Coverage score ~ ROUGE-2 (bigrams) + specificity Linear regression Weights for predictors in the regression model 33 Mean β Significance (hypothesis β = 0) (Intercept)0.2122.3e-11 ROUGE-21.299< 2.0e-16 Specificity-0.1663.1e-05 Is the combination a better predictor than ROUGE alone?

2. Specificity and linguistic quality Used different data: TAC 2009 DUC 2002 only reported number of errors Were also specified as a range: 1-5 errors TAC 2009 linguistic quality score Manually judged: scale 1 – 10 Combines different aspects coherence, referential clarity, grammaticality, redundancy 34

What is the avg specificity in different score categories? More general ~ lower score! General content is useful but need proper context! 35 Ling scoreNo. summaries Poor (1, 2)202 Mediocre (5)400 Best (9, 10)79 If a summary starts as follows: We are quite a ways from that, actually. As ice and snow at the poles melt, … Specificity = low Linguistic quality = 1 Average specificity 0.71 0.72 0.77

Data for analysing generalization operation Aligned pairs of abstract and source sentences conveying the same content Traditional data used for compression experiments Ziff-Davis tree alignment corpus 15964 sentence pairs Any number of deletions, up to 7 substitutions Only 25% abstract sentences are mapped But beneficial to observe the trends 36 [Galley & McKeown (2007)]

Generalization operation in human abstracts Transition SS SG GG GS 37 One-third of all transformations are specific to general Human abstracts involve a lot of generalization No. pairs% pairs 637139.9 567935.6 356222.3 3522.2

How specific sentences get converted to general? SG SS GG GS 38 Orig. length 33.5 33.4 21.5 22.7 New/orig length 40.8 56.6 60.8 66.0 Avg. deletions (words) 21.4 16.3 9.3 8.4 Choose long sentences and compress heavily! A measure of generality would be useful to guide compression Currently only importance and grammaticality are used

Use of general sentences in human extracts Details of Maxwells death were sketchy. Folksy was an understatement. Long live democracy! Instead it sank like the Bismarck. Example use of a general sentence in a summary … With Towers qualifications for the job, the nominations should have sailed through with flying colors. [Specific] Instead it sank like the Bismarck. [General] … Future: can we learn to generate and select general sentences to include in automatic summaries?

Conclusions Built a classifier for general and specific sentences Used existing annotations to do that But tested on new data and task-based evaluation The confidence of the classifier is highly correlated with human agreement Analyzed human and machine summaries Machine summaries are too specific But adding general sentences is difficult because the context has to be right

Further details in Annie Louis and Ani Nenkova, Automatic identification of general and specific sentences by leveraging discourse annotations, Proceedings of IJCNLP, 2011 (To Appear). Annie Louis and Ani Nenkova, Text specificity and impact on quality of news summaries, Proceedings of ACL-HLT Workshop on Monolingual Text to Text Generation, 2011.Text specificity and impact on quality of news summaries, Proceedings of ACL-HLT Workshop on Monolingual Text to Text Generation, 2011. Annie Louis and Ani Nenkova, Creating Local Coherence: An Empirical Assessment, Proceedings of NAACL-HLT 2010.Creating Local Coherence: An Empirical Assessment, Proceedings of NAACL-HLT 2010.

Two types of local coherence Entity & Rhetorical Local coherence: Adjacent sentences in a text flow from one to another Entity – same topic John was hungry. He went to a restaurant. But only 42% sentence pairs are entity-linked [previous corpus studies] Will core discourse relations connect the non-entity sharing sentence pairs? Popular hypothesis in prior work 42

Investigations into text quality The mix of discourse relations in a text is highly predictive of the perceived quality of the text Both implicit and explicit relations are needed to predict text quality Predicting the sense of implicit discourse relations is a very difficult task; most predicted to be expansion How is local coherence created?

Joint analysis by combining PDTB and Ontonotes annotations 590 articles Noun phrase coreference from Ontonotes 40 to 50% of sentence pairs do not share entities in articles of different lengths 44

Expansions cover most of non-entity sharing instances 45

Expansions have the least rate of coreference 46

Rate of coreference in 2 nd level elaboration relations 47

Example instantiations and list relations Instantiation The economy is showing signs of weakness, particularly among manufacturers. Exports which played a key role in fueling growth over the last two years, seem to have stalled. List Many of Nasdaq's biggest technology stocks were in the forefront of the rally. - Microsoft added 2 1/8 to 81 3/4 and Oracle Systems rose 1 1/2 to 23 1/4. - Intel was up 1 3/8 to 33 3/4. 48

Overall distribution of sentence pairs among the two coherence devices 49 30% sentence pairs have no coreference and are in a weak discourse relation (expansion/entrel) We must explore elaboration more closely to identify how they create coherence

Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania.

Similar presentations

Presentation on theme: "Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania.

Similar presentations

Presentation on theme: "Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania."— Presentation transcript:

Similar presentations

About project

Feedback