Summary Generation Keith Trnka. The approach ● Apply Marcu's basic summarizer (1999) to perform content selection ● Re-generate the selected content so.

Summary Generation Keith Trnka

The approach ● Apply Marcu's basic summarizer (1999) to perform content selection ● Re-generate the selected content so that it's more natural

RST Refresher ● A text is composed of elementary discourse units (EDUs) – What constitutes an EDU varies from author to author – Common consensus that they are no larger than sentences ● Text spans – An EDU is a text span – A sequence of adjacent text spans in some rhetorical relation is a text span

RST Refresher (cont'd) ● A rhetorical relation is the relationship between text spans – Some relations have the notion of nuclearity: one sub-span (nucleus) is the one to which all other sub-spans (satellites) relate ● These relations are called mononuclear ● Example: [When I got home,] circumstance-for [I was tired] – Other spans are called multinuclear ● There is no most-important sub-span ● Example: [Cats scratch] contrast-with [, but dogs bite.]

RST Discourse Treebank ● RST analyses of 385 WSJ articles from Penn Treebank ● Available from LDC (http://www.ldc.upenn.edu)http://www.ldc.upenn.edu ● Overview can be found in (Carlson et. al. 2001) ● Annotation manual is (Carlson, Marcu 2001) ● Thanks to the department for buying it

● Notes about the annotation – EDUs are clause-like – Mono-nuclear relations were forced to be binary – Relative clauses and appositives can be embedded relations RST Discourse Treebank (cont'd)

● Statistical analysis of 335 training documents – 98% of spans are binary (two children) – For binary mononuclear relations: ● Nucleus-satellite order can be predicted with 87% accuracy, given the relation, using predict-majority

Marcu's Content Selection Algorithm ● Described in (Marcu 1999) ● Promotion sets – The promotion set of each span is the union of all promotion sets of nuclear sub-spans – The promotion set of an EDU is the EDU itself

Marcu's Content Selection Algorithm (cont'd) ● Build a partial ordering of EDUs * – For each EDU, find the topmost span in which it's in the promotion set. Let d be the tree depth of this span. – The rank of each EDU is ● If the EDU is in an embedded relation, d + 1 ● Otherwise, d – Example of the partial ordering * re-worded from Marcu's description

Marcu's Content Selection Algorithm (cont'd) ● Given a summary length requirement – Select the topmost EDU groups until it isn't possible to select more and honor the length requirement – Effect: can't always generate a summary as close to desired length as possible

Generation desiderata ● Removal of problems – Dangling references – Dangling discourse markers ● Introduction of coherence – Generate smaller referring expressions – Generate discourse markers when appropriate

Example Claude Bebear, chairman and chief executive officer, of Axa-Midi Assurances, pledged to retain employees and management of Farmers Group Inc.. Mr. Bebear made his remarks at a breakfast meeting with reporters here yesterday as part of a tour. Farmers was quick yesterday to point out the many negative aspects. For one, Axa plans to do away with certain tax credits.

The theoretical approach ● Content selection – Marcu's summarization algorithm ● Paragraph generation – Organize sentences into paragraphs ● Sentence generation – Construct complete sentences from EDUs

The theoretical approach (cont'd) ● Discourse marker generation – Remove discourse markers that refer to removed text spans – Generate discourse markers when none exists and one is appropriate ● Referring expression generation – Generate the best unambiguous referring expressions ● Shorter is better ● Faster to interpret is better

The implemented approach ● Content selection – Marcu's algorithm as stated ● Paragraph generation – Not implemented

Implementation: Sentence “generation” ● If a selected group of EDUs is an entire text span – select them all as-is, uppercase the front and make sure it ends with punctuation ● If a selected group of EDUs is an entire text span, except for some embedded relations – Remove punctuation associated with embeddings, add sentence terminators from embeddings ● If a selected group of EDUs is a sentence – Select as-is ● If a selected EDU isn't part of such a group – uppercase the front and end with punctuation

Implementation: Discourse marker generation ● Train to see which discourse markers go with which relations ● In generation, select discourse markers with a probability > 80%

Training on discourse markers ● Discourse markers identified by string matching at beginning and ending of each EDU ● List of markers taken from (Knott 1994)

Training on discourse markers (cont'd) ● Three statistics trained on binary, atomic spans with zero or one markers – Inclusion – Usage – Position

Rough evaluation ● Sentence “generation” isn't much different from not changing it at all – Except embedded relation removal ● Out of 347 summaries, a discourse marker was only generated once – Ms. Johnson is awed by the earthquake's destructive force. "It really brings you down to a human level," Though "It's hard to accept all the suffering but you have to.

Desired approach: Content selection ● Marcu's algorithm can only select groups of EDUs – Sometimes produces overly short summaries or nothing at all – If a preferential ordering could be defined within equivalence, summaries could meet the desired length better ● EDUs tied to more salient EDUs have their score boosted

Desired approach: Paragraph generation ● Paragraphs in the source document are marked – Leave paragraph boundaries intact if they form large enough paragraphs – A shallow method, but has potential ● Correlate paragraph boundaries with something – RS-tree structure – Co-reference chain beginnings/endings – Topical text segments, by an extension of Heart's text segmentation algorithm (Hearst 1994)

Desired approach: Sentence generation ● Apply shallow parsing to understand the rough syntactic structure of an EDU ● Relative clauses can be attached and full sentences generated like (Siddharthan 2004)

Desired approach: Discourse marker generation ● The probabilities computed in DM training aren't the best – Need to attach discourse markers and recompute, repeat until stable – The attachment algorithm involves a constraint- satisfaction problem ● DM attachment needed to perform DM removal ● A DM generator should understand syntax better – When should commas be included and where?

Desired approach: Referring expression generation ● Requires good co-reference resolution – A reference resolver requires (at least) a base noun phrase chunker – EDUs might be used in conjunction with a shallow parse to approximate Hobbs' naïve approach ● Mitkov (2002) describes Hobbs' naïve approach ● Generation algorithm only adds the creation of a list of referring expressions, ordered by preference

Conclusions ● Document length is poorly defined – Quite a bit of variation between EDU length, word length, and character length ● Attaching discourse markers to the relation they realize is tough ● Representing natural language in programs can be tough ● Summarization of quotations requires special treatment

References ● Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski (2001). Building a Discourse- Tagged Corpus in the Framework of Rhetorical Structure Theory. Proceedings of the 2nd SIGDIAL Workshop on Discourse and Dialogue, Eurospeech 2001, Denmark, September 2001. ● Lynn Carlson and Daniel Marcu. (2001). Discourse Tagging Manual. ISI Tech Report ISI-TR-545. July 2001. ● Marti Hearst (1994). Multi-Paragraph Segmentation of Expository Text. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM, June 1994. ● Alistair Knott and Robert Dale (1994). Using Linguistic Phenomena to Motivate a Set of Coherence Relations. Discourse Processes 18(1): 35-62. ● William Mann and Sandra Thompson (1988). Rhetorical Structure Theory: Toward a functional theory of text organization. Text 8(3): 243-281.

References (cont'd) ● Daniel Marcu (1999). Discourse trees are good indicators of importance in text. In I. Mani and M. Maybury editors, Advances in Automatic Text Summarization, pages 123- 136, The MIT Press. – I think this is a cleanup of his earlier work from 1997. ● Ruslan Mitkov (2002). Anaphora Resolution. Pearson Education. ● Advaith Siddharthan (2004). Syntactic Simplification and Text Cohesion. To appear in the Journal of Language and Computation, Kluwer Academic Publishers, the Netherlands.

Summary Generation Keith Trnka. The approach ● Apply Marcu's basic summarizer (1999) to perform content selection ● Re-generate the selected content so.

Similar presentations

Presentation on theme: "Summary Generation Keith Trnka. The approach ● Apply Marcu's basic summarizer (1999) to perform content selection ● Re-generate the selected content so."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Summary Generation Keith Trnka. The approach ● Apply Marcu's basic summarizer (1999) to perform content selection ● Re-generate the selected content so.

Similar presentations

Presentation on theme: "Summary Generation Keith Trnka. The approach ● Apply Marcu's basic summarizer (1999) to perform content selection ● Re-generate the selected content so."— Presentation transcript:

Similar presentations

About project

Feedback