Albert Gatt Corpora and Statistical Methods Lecture 12.

Albert Gatt Corpora and Statistical Methods Lecture 12

Automatic summarisation Part 2

The task Given a single document or collection of documents, return an abridged version that distils the most important information (possibly for a particular task/user) Summarisation systems perform: 1. Content selection: choosing the relevant information in the source document(s), typically in the form of sentences/clauses. 2. Information ordering 3. Sentence realisation: cleaning up the sentences to make them fluent. Note the similarity to NLG architectures. Main difference: summarisation input is text, whereas NLG input is non-linguistic data.

Types of summaries Extractive vs. Abstractive Extractive: select informative sentences/clauses in the source document and reproduce them most current systems (and our focus today) Abstractive: summarise the subject matter (usually using new sentences) much harder, as it involves deeper analysis & generation Dimensions Single-document vs. multi-document Context Query-specific vs. query-independent

Extracts vs Abstracts: Lincoln’s Gettsyburg Address Source: Jurafsky & Martin (2009), p. 823 Extract Abstract

A Summarization Machine EXTRACTS ABSTRACTS ? MULTIDOCS ExtractAbstract Indicative Generic Background Query-oriented Just the news 10% 50% 100% Very Brief Brief Long Headline Informative DOC QUERY CASE FRAMES TEMPLATES CORE CONCEPTS CORE EVENTS RELATIONSHIPS CLAUSE FRAGMENTS INDEX TERMS Adapted from: Hovy & Marcu (1998). Automated text summarization. COLING-ACL Tutorial. http://www.isi.edu/~marcu/

7 The Modules of the Summarization Machine EXTRACTIONEXTRACTION INTERPRETATIONINTERPRETATION EXTRACTS ABSTRACTS ? CASE FRAMES TEMPLATES CORE CONCEPTS CORE EVENTS RELATIONSHIPS CLAUSE FRAGMENTS INDEX TERMS MULTIDOC EXTRACTS GENERATIONGENERATION FILTERINGFILTERING DOC EXTRACTS

Unsupervised single-document summarisation I “bag of words” approaches

Basic architecture for single-doc The central task in single- document summarisation. Can be supervised or unsupervised. Less critical. Since we have only one document, we can rely on the order in which sentences occur in the source itself.

Unsupervised content selection I: Topic Signatures Simplest unsupervised algorithm: Split document into sentences. Select those sentences which contain the most salient/informative words. Salient term = a term in the topic signature (words that are crucial to identifying the topic of the document) Topic signature detection: Represent sentences (documents) as word vectors Compute the weight of each word Weight sentences by the average weight of their (non-stop) words.

Vector space revisited Document collectionKey terms * document Doc 1: To make fried chicken, take the chicken, chop it up and put it in a pan until golden. Remove the fried chicken pieces and serve hot. Doc 2: To make roast chicken, take the chicken and put in the oven until golden. Remove the chicken and serve hot. Columns = documents Rows = term frequencies NB: Stop list to remove v. high frequency words!

Term weighting: tf-idf Common term weighting scheme used in the information retrieval literature. tf (term frequency) = freq. of term in document idf (inverse document frequency) = log(N/n i ) N = no. of documents n i = no. of docs in which term i occurs Method: 1.Count frequency of term in the doc being considered. 2.Count inverse doc frequency over whole document collection 3.Compute tf-idf score

Term weighting: log likelihood ratio Requirements: A background corpus In our case, for a term w, LLR is the ratio between: Prob. of observing w in the input corpus Prob. of observing w in the background corpus Since LLR is asymptotically chi-square distributed, if the LLR value is significant, we treat the term as a key term. Chi-square values tend to be significant at p =.001 if they are greater than 10.8

Sentence centrality Instead of weighting sentences by averaging individual term weights, we can compute pairwise distance between sentences and choose those sentences which are closer to eachother on average. Example: represent sentences as tf-idf vectors and compute cosine for each sentence x in relation to all other sentences y where K = total no. of sentences

Unsupervised single-document summarisation II Using rhetorical structure

Rhetorical Structure Theory RST (Mann and Thompson 1988) is a theory of text structure Not about what texts are about but How bits of the underlying content of a text are structured so as to hang together in a coherent way. The main claim of RST: Parts of a text are related to eachother in predetermined ways. There is a finite set of such relations. Relations hold between two spans of text Nucleus Satellite

A small example You should visit the new exhibition. It’s excellent. It got very good reviews. It’s completely free. You should... It’s completely...It’s excellent...It got... MOTIVATION EVIDENCE ENABLEMENT

An RST relation definition MOTIVATION Nucleus represents an action which the hearer is meant to do at some point in future. You should go to the exhibition Satellite represents something which is meant to make the hearer want to carry out the nucleus action. It’s excellent. It got a good review. Note: Satellite need not be a single clause. In our example, the satellite has 2 clauses. They themselves are related to eachother by the EVIDENCE relation. Effect: to increase the hearer’s desire to perform the nucleus action.

RST relations more generally An RST relation is defined in terms of the Nucleus + constraints on the nucleus (e.g. Nucleus of motivation is some action to be performed by H) Satellite + constraints on satellite Desired effect. Other examples of RST relations: CAUSE: the nucleus is the result; the satellite is the cause ELABORATION: the satellite gives more information about the nucleus Some relations are multi-nuclear Do not relate a nucleus and satellite, but two or more nuclei (i.e. 2 pieces of information of the same status). Example: SEQUENCE John walked into the room. He turned on the light.

Some more on RST RST relations are neutral with respect to their realisation. E.g. You can express EVIDENCE in lots of different ways. It’s excellent...It got... EVIDENCE It’s excellent. It got very good reviews. You can see that it’s excellent from its great reviews. It’s excellence is evidenced by the good reviews it got. It must be excellent since it got good reviews.

RST for unsupervised content selection 1. Compute coherence relations between units (= clauses) Can use a discourse parser and/or rely on cue phrases Corpora annotated with RST relations exist 2. Use the intuition that the nucleus of a relation is more central to the content than the satellite to identify the set of salient units Sal: Base case: If n consists of a leaf node, then Sal(n) = {n} Recursive case: if n is non-leaf, then 3. Rank nodes in Sal(n): the higher the node of which n is a nucleus, the more salient it is

Rhetorical structure: example Ranking of a nodes: 2 > 8 > 3...

Supervised content selection

Basic idea Input: a training set consisting of: Document + human-produced (extractive) summaries So sentences in each doc can be marked with a binary feature (1 = included in summary; 0 = not included) Train a machine learner to classify sentences as 1 (extract- worthy) or 0, based on features.

Features Position: important sentences tend to occur early in a document (but this is genre dependent). E.g. news articles: most important sentence is the title. Cue phrases: sentences with phrases like to summarise give important summary info. (Again, genre dependent: different genres have different cue phrases). Word informativeness: words in the sentence which belong to the doc’s topic signature Sentence length: we usually want to avoid very short sentences Cohesion: we can use lexical chains to compute how many words are in a sentence which are also in the document lexical chain Lexical chain: a series of words that are indicative of the document’s topic

Algorithms Once we have the feature set F, we want to compute: Many methods we’ve discussed will do! Naive Bayes Maximum Entropy...

Which corpus? There are some corpora with extractive summaries, but often we come up against the problem of not having the right data. Many types of text in themselves contain summaries, e.g. scientific articles have abstracts But these are not purely extractive! (though people tend to include sentences in abstracts that are very similar to the sentences in their text). Possible method: align sentences in an abstract with sentences in the document, by computing their overlap (e.g. using n-grams)

Realisation Sentence simplification

Realisation With single-doc summarisation, realisation isn’t a big problem (we’re reproducing sentences from summaries). But we may want to simplify (or compress) the sentences. Simplest method is to use heuristics, e.g.: Appositives: Rajam, 28, an artist who lives in Philadelphia, found inspiration in the back of city magazines. Sentential adverbs: As a matter of fact, this policy will be ruinous. A lot of current research on simplification/compression, often using parsers to identify dependencies that can be omitted with little loss of information. Realisation is much more of an issue in multi-document summarisation.

Multi-document summarisation

Why multi-document Very useful when: queries return multiple documents from the web Several articles talk about the same topic (e.g. a disease)... The steps are the same as for single-doc summarisation, but: We’re selecting content from more than one source We can’t rely on the source documents only for ordering Realisation is required to ensure coherence.

Content selection Since we have multiple docs, we have a problem with redundancy: repeated info in several documents; overlapping words, sentences, phrases... We can modify sentence scoring methods to penalise redundancy, by comparing a candidate sentence to sentences already selected. Methods: Modify sentence score to penalise redundancy: (sentence is compared to sentences already chosen in the summary) Use clustering to group related sentences, and then perform selection on clusters. More on clustering next week.

Information ordering If sentences are selected from multiple documents, we risk creating an incoherent document. 1. Rhetorical structure: *Therefore, I slept. I was tired. I was tired. Therefore, I slept. 2. Lexical cohesion: *We had chicken for dinner. Paul was late. It was roasted. We had chicken for dinner. I was roasted. Paul was late. 3. Referring expressions: *He said that.... George W. Bush was speaking at a meeting. George W. Bush said that.... He was speaking at a meeting. These heuristics can be combined. We can also do information ordering during the content selection process itself.

Information ordering based on reference Referring expressions (NPs that identify objects) include pronouns, names, definite NPs... Centering Theory (Grosz et al 1995): every discourse segment has a focus (what the segment is “about”). Entities are salient in discourse depending on their position in the sentence: SUBJECT >> OBJECT >> OTHER A coherent discourse is one which, as far as possible, maintains smooth transitions between sentences.

Information ordering based on lexical cohesion Sentences which are “about” the same things tend to occur together in a document. Possible method: use tf-idf cosine to compute pairwise similarity between selected sentences attempt to order sentences to maximise the similarity between adjacent pairs.

Realisation Compare: Source: Jurafsky & Martin (2009), p. 835

Uses of realisation Since sentences come from different documents, we may end up with infelicitous NP orderings (e.g. pronoun before definite). One possible solution: 1. run a coreference resolver on the extracted summary 2. Identify reference chains (NPs referring to the same entity) 3. Replace or reorder NPs if they violate coherence. E.g. use full name before pronoun Another interesting problem is sentence aggregation or fusion, where different phrases (from different sources) are combined into a single phrase.

Evaluating summarisation

Evaluation baselines Random sentences: If we’re producing summaries of length N, we use as baseline a random extractor that pulls out N sentences. Not too difficult to beat. Leading sentences: Choose the first N sentences. Much more difficult to beat! A lot of informative sentences are at the beginning of documents.

Some terminology (reminder) Intrinsic evaluation: evaluation of output in its own right, independent of a task (e.g. Compare output to human output). Extrinsic evaluation: evaluation of output in a particular task (e.g. Humans answer questions after reading a summary) We’ve seen the uses of BLEU (intrinsic) for realisation in NLG. A similar metric in Summarisation is ROUGE (Recall- Oriented Understudy for Gisting Evaluation)

BLEU vs ROUGE BLEUROUGE Precision-oriented Looks at n-gram overlap for different values of n up to some maximum. Measures the average n- gram overlap between an output text and a set of reference texts. Recall-oriented N-gram is fixed: ROUGE-1, ROUGE-2 etc (for different n-gram lengths) Measures how many n- grams an output summary contains from the source summary.

ROUGE Generalises easily to any n-gram length. Other versions: ROUGE-L: measures longest common subsequence between reference summary and output ROUGE-SU: uses skip bigrams

Intrinsic vs. Extrinsic again Problem: ROUGE assumes that reference summaries are “gold standards”, but people often disagree about summaries, including wording. Same questions arise as for NLG (and MT): To what extent does this metric actually tell us about the effectiveness of a summary? Some recent work has shown that the correlation between ROUGE and a measure of relevance given by humans is quite low. See: Dorr et al. (2005). A Methodology for Extrinsic Evaluation of Text Summarization: Does ROUGE Correlate? Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 1– 8, Ann Arbor, June 2005

The Pyramid method (Nenkova et al) Also intrinsic, but relies on semantic content units instead of n- grams. 1. Human annotators label SCUs in sentences from human summaries. Based on identifying the content of different sentences, and grouping together sentences in different summaries that talk about the same thing. Goes beyond surface wording! 2. Find SCUs in the automatic summaries. 3. Weight SCUs 4. Compute the ratio of the sum of weights of SCUs in the automatic summary to the weight of an optimal summary of roughly the same length.

Albert Gatt Corpora and Statistical Methods Lecture 12.

Similar presentations

Presentation on theme: "Albert Gatt Corpora and Statistical Methods Lecture 12."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Albert Gatt Corpora and Statistical Methods Lecture 12.

Similar presentations

Presentation on theme: "Albert Gatt Corpora and Statistical Methods Lecture 12."— Presentation transcript:

Similar presentations

About project

Feedback