Processing of large document collections Fall 2002, Part 2.

Slides:



Advertisements
Similar presentations
Academic Writing Writing an Abstract.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Review of HTML Ch. 1.
Processing of large document collections Part 6 (Text summarization: discourse- based approaches) Helena Ahonen-Myka Spring 2006.
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
Information Retrieval in Practice
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Efficient Web Browsing on Handheld Devices Using Page and Form Summarization Orkut Buyukkokten, Oliver Kaljuvee, Hector Garcia-Molina, Andreas Paepcke.
Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Media: Text “Words and symbols in any form, spoken or written, are the most common system of communication.” ~ unknown.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Introduction to Machine Learning Approach Lecture 5.
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
1 Lab Session-III CSIT-120 Fall 2000 Revising Previous session Data input and output While loop Exercise Limits and Bounds Session III-B (starts on slide.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
Search Engines and Information Retrieval Chapter 1.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
A Compositional Context Sensitive Multi-document Summarizer: Exploring the Factors That Influence Summarization Ani Nenkova, Stanford University Lucy Vanderwende,
Knowledge-rich approaches for text summarization Minna Vasankari
Chapter 7. BEAT: the Behavior Expression Animation Toolkit
Illinois-Coref: The UI System in the CoNLL-2012 Shared Task Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth Supported by ARL,
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 Text Summarization: News and Beyond Kathleen McKeown Department of Computer Science Columbia University.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.
Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.
Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2006.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
Processing of large document collections Part 6 (Text summarization: discourse- based approaches) Helena Ahonen-Myka Spring 2005.
Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2005.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.
Information Retrieval
Programming Fundamentals. Overview of Previous Lecture Phases of C++ Environment Program statement Vs Preprocessor directive Whitespaces Comments.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Data Representation. What is data? Data is information that has been translated into a form that is more convenient to process As information take different.
Processing of large document collections Part 1 (Introduction) Helena Ahonen-Myka Spring 2006.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Levels of Linguistic Analysis
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Dr. Saatchi, Seyed Mohsen 1 Arab Open University - AOU T209 Information and Communication Technologies: People and Interactions Sixth Session.
Abstract  An abstract is a concise summary of a larger project (a thesis, research report, performance, service project, etc.) that concisely describes.
N5 Databases Notes Information Systems Design & Development: Structures and links.
Text Based Information Retrieval
Chapter 3 Data Storage.
CSc4730/6730 Scientific Visualization
CS 430: Information Discovery
Chapter 3 DataStorage Foundations of Computer Science ã Cengage Learning.
Levels of Linguistic Analysis
Presentation transcript:

Processing of large document collections Fall 2002, Part 2

2 Outline 1. Term selection: information gain 2. Character code issues 3. Text summarization

3 1. Term selection a large document collection may contain millions of words -> document vectors would contain millions of dimensions many algorithms cannot handle high dimensionality of the term space (= large number of terms) usually only a part of terms are used how to select terms that are used? term selection (often called feature selection or dimensionality reduction) methods

4 Term selection: information gain Information gain: measures the (number of bits of) information obtained for category prediction by knowing the presence or absence of a term in a document information gain is calculated for each term and the highest-scoring n terms are selected

5 Term selection: IG information gain for a term t:

6 Estimating probabilities Doc 1: cat cat cat (c) Doc 2: cat cat cat dog (c) Doc 3: cat dog mouse (~c) Doc 4: cat cat cat dog dog dog (~c) Doc 5: mouse (~c) 2 classes: c and ~c

7 Term selection: estimating probabilities P(t): probability of a term t P(cat) = 4/5, or ‘cat’ occurs in 4 docs of 5 P(cat) = 10/17 the proportion of the occurrences of ´cat’ of the all term occurrences

8 Term selection: estimating probabilities P(~t): probability of the absence of t P(~cat) = 1/5, or P(~cat) = 7/17

9 Term selection: estimating probabilities P(c i ): probability of category i P(c) = 2/5 (the proportion of documents belonging to c in the collection), or P(c) = 7/17 (7 of the 17 terms occur in the documents belonging to c)

10 Term selection: estimating probabilities P(c i | t): probability of category i if t is in the document; i.e., which proportion of the documents where t occurs belong to the category i P(c | cat) = 2/4 (or 6/10) P(~c | cat) = 2/4 (or 4/10) P(c | mouse) = 0 P(~c | mouse) = 1

11 Term selection: estimating probabilities P(c i | ~t): probability of category i if t is not in the document; i.e., which proportion of the documents where t does not occur belongs to the category i P(c | ~cat) = 0 (or 1/7) P(c | ~dog) = ½ (or 6/12) P(c | ~mouse) = 2/3 (or 7/15)

12 Term selection: estimating probabilities In other words... Let term t occurs in B documents, A of them are in category c category c has D documents, of the whole of N documents in the collection

13 Term selection: estimating probabilities For instance, P(t): B/N P(~t): (N-B)/N P(c): D/N P(c|t): A/B P(c|~t): (D-A)/(N-B)

14 Term selection: IG information gain for a term t: G(cat) = G(dog) = G(mouse) = -0.01

15 2. Character code issues Abstract character vs. its graphical representation (glyph, font) abstract characters are grouped into alphabets each alphabet forms the basis of the written form of a certain language or a set of languages

16 Character codes For instance for English: uppercase letters A-Z lowercase letters a-z punctuation marks digits 0-9 common symbols: +, = ideographic symbols of Chinese and Japanese phonetic letters of Western languages

17 Some terminology character repertoire (merkkivalikoima) a set of distinct characters, alphabet no internal presentation, ordering etc assumed usually defined by specifying names of characters and a sample presentation of characters in visible form repertoire may contain characters which look the same (in some presentations), but are logically distinct

18 Some terminology character code (merkkikoodi) a mapping which defines a one-to-one correspondence between characters in a character repertoire and a set of nonnegative integers each character is assigned a unique code position (code number, code value, code element, code point, code set value, code) set of codes often has ”holes”

19 Some terminology character encoding (merkkikoodaus) an algorithm for presenting characters in digital form by mapping sequences of code numbers into sequences of octets (=bytes) in the simplest case, each character is mapped to an integer in the range according to a character code and these are used as octets works only for character repertoires with at most 256 characters for larger sets, more complicated encodings are needed

20 Character codes in English: 26 letters in both lower- and uppercase ten digits + some punctuation marks in Russian: cyrillic letters both could use the same set of code points (if not a bilingual document) in Japanese: could be over 6000 characters

21 Character codes: standars Character codes can be arbitrary, but in practice standardization is needed for interoperability (between computers, programs,...) early standards were designed for English only, or for a small group of languages at a time

22 Character codes: standards ASCII ISO-8859 (e.g. ISO Latin1) Unicode UTF-8, UTF-16

23 ASCII American Standard Code for Information Interchange A seven bit code -> 128 code positions actually 95 printable characters only code positions 0-31 and 127 are assigned to control characters (mostly outdated) ISO 646 (1972) version of ASCII incorporated several national variants (accented letters and currency symbols) replaced

24 ASCII With 7 bits, the set of code points is too small for anything else than American English solution: 8 bits brings more code points (256) ASCII character repertoire is mapped to the values additional symbols are mapped to other values

25 Extended ASCII Problems: different manufacturers each developed their own 8-bit extensions to ASCII different character repertoires -> translation between them is not always possible also 256 code values is not enough to represent all the alphabets -> different variants for different languages

26 ISO 8859 Standardization of 8-bit character sets In the 80´s: multipart standard ISO 8859 was produced defines a collection of 8-bit character sets, each designed for a group of languages the first part: ISO (ISO Latin1) covers most Western European languages 0-127: identical to ASCII, (mostly) unused, 96 code values for accented letters and symbols

27 ”Safe” ASCII subset due to the national variants, only the following characters can be regarded ”safe” in data transmission: A-Z, a-z, 0-9 ! ” % & ’ ( ) * +, -. / : ; ?

28 Unicode 256 is not enough code positions for ideographically represented languages (Chinese, Japanese…) for simultaneous use of several languages solution: more than one byte for each code value a 16-bit character set has 65,536 code positions

29 Unicode 16-bit character set; 65,536 code positions not sufficient for all the characters required for Chinese, Japanese, and Korean scripts in distinct positions CJK-consolidation: characters of these scripts are given the same value if they look the same

30 Unicode Code values for all the characters used to write contemporary ’major’ languages also the classical forms of some languages Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Georgian, Tibetan Chinese, Japanese, and Korean ideograms, and the Japanese and Korean phonetic and syllabic scripts

31 Unicode punctuation marks technical and mathematical symbols arrows dingbats (pointing hands, stars, …) both accented letters and separate diacritical marks (accents, tildes…) are included, with a mechanism for building composite characters can also create problems: two characters that look the same may have different code values ->normalization may be necessary

32 Unicode Code values for nearly 39,000 symbols are provided some part is reserved for an expansion method (see later) 6,400 code points are reserved for private use they will never be assigned to any character by the standard, so they will not conflict with the standard

33 Unicode: encodings the ”native” Unicode encoding is UCS-2 presents each code number as two consecutive octets m and n code number = 256m + n (=2-byte integer) can be inefficient for text containing ISO Latin characters only, the length of the Unicode encoded sequence is twice the length of the ISO encoding

34 Unicode: encodings UTF-8 ASCII code values are likely to be more common in most text than any other values in UTF-8 encoding, ASCII characters are sent themselves (high-order bit 0) other characters (two bytes) are encoded using 2-6 bytes (high-order bit is set to 1)

35 Unicode: encodings UTF-16: expansion method two 16-bit values are combined to a 32-bit value -> a million characters available

36 Use of character codes try to use character codes logically don’t choose a character just because it looks right inform applications of the encoding used MIME headers, XML/HTML document declarations Should be the responsibility of the authoring applications… but…

37 3. Text summarization ”Process of distilling the most important information from a source to produce an abridged version for a particular user or task”

38 Text summarization Many everyday uses: headlines (from around the world) outlines (notes for students) minutes (of a meeting) reviews (of books, movies)...

39 Architecture of a text summarization system Input: a single document or multiple documents text, images, audio, video database

40 Architecture of a text summarization system output: extract or abstract compression rate ratio of summary length to source length connected text or fragmentary generic or user-focused/domain-specific indicative or informative

41 Architecture of a text summarization system Three phases: analyzing the input text transforming it into a summary representation synthesizing an appropriate output form

42 Condensation operations Selection of more salient (=”keskeinen”, ”essential”) or non-redundant information aggregation of information (e.g. from different parts of the source, or of different linguistic descriptions) generalization of specific information with more general, abstract information

43 The level of processing surface level discourse level

44 Surface-level approaches Tend to represent information in terms of shallow features the features are then selectively combined together to yield a salience function used to extract information

45 Surface level Shallow features thematic features presence of statistically salient terms, based on term frequency statistics location position in text, position in paragraph, section depth, particular sections background presence of terms from the title or headings in the text, or from the user’s query

46 Surface level Cue words and phrases ”in summary”, ”our investigation” emphasizers like ”important”, ”in particular” domain-specific bonus (+ ) and stigma (-) terms

47 Discourse-level approaches Model the global structure of the text and its relation to communicative goals structure can include: format of the document (e.g. hypertext markup) threads of topics as they are revealed in the text rhetorical structure of the text, such as argumentation or narrative structure

48 Classical approaches Luhn ’58 Edmundson ’69 general idea: give a score to each sentence choose the sentences with the highest score to be included in the summary

49 Luhn’s method Filter terms in the document using a stoplist Terms are normalized based on combining together ortographically similar terms differentiate, different, differently, difference -> differen Frequencies of combined terms are calculated and non-frequent terms are removed -> ”significant” terms remain

50 Luhn’s method Sentences are weighted using the resulting set of ”significant” terms and a term density measure: each sentence is divided into segments bracketed by significant terms not more than 4 non-significant terms apart each segment is scored by taking the square of the number of bracketed significant terms divided by the total number of bracketed terms

51 Exercise (CNN News) Let {13, computer, servers, Internet, traffic, attack, officials, said} be significant words. ”Nine of the 13 computer servers that manage global Internet traffic were crippled by a powerful electronic attack this week, officials said.”

52 Exercise (CNN News) Let {13, computer, servers, Internet, traffic, attack, officials, said} be significant words. * * * [13 computer servers * * * Internet traffic] * * * * * * [attack * * officials said]

53 Exercise (CNN News) [13 computer servers * * * Internet traffic] score: 5 2 / 8 = 25/8 = 3.1 [attack * * officials said] score: 3 2 / 5 = 9/5 = 1.8

54 Luhn’s method the score of the highest scoring segment is taken as the sentence score the highest scoring sentences are chosen to the summary a cutoff value is given

55 ”Modern” application text summarization of web pages on handheld devices (Buyukkokten, Garcia- Molina, Paepcke; 2001) macro-level summarization micro-level summarization

56 Web page summarization macro-level summarization The web page is partitioned into ‘Semantic Textual Units’ (STUs) Paragraphs, lists, alt texts (for images) Hierarchy of STUs is identified List - list item, table – table row Nested STUs are hidden

57 Web page summarization micro-level summarization: 5 methods tested for displaying STUs in several states incremental: 1) the first line, 2) the first three lines, 3) the whole STU all: the whole STU in a single state keywords: 1) important keywords, 2) the first three lines, 3) the whole STU

58 Web page summarization summary: 1) the STUs ’most significant’ sentence is displayed, 2) the whole STU keyword/summary: 1) keywords, 2) the STUs ’most significant’ sentence, 3) the whole STU The combination of keywords and a summary has given the best performance for discovery tasks on web pages

59 Web page summarization extracting summary sentences Sentences are scored using a variant of Luhn’s method: Words are TF*IDF weighted; given a weight cutoff value, the high scoring words are selected to be significant words Weight of a segment: sum of the weights of significant words divided by the total number of words within a segment

60 Edmundson’s method Extends earlier work to look at three features in addition to word frequencies: cue phrases (e.g. ”significant”, ”impossible”, ”hardly”) title and heading words location

61 Edmundson’s method Programs to weight sentences based on each of the four features weight of a sentence = the sum of the weights for features programs were evaluated by comparison against manually created extracts corpus-based methodology: training set and test set in the training phase, weights were manually readjusted

62 Edmundson’s method Results: three additional features dominated word frequency measures the combination of cue-title-location was the best, with location being the best individual feature keywords alone was the worst

63 Fundamental issues What are the most powerful but also more general features to exploit for summarization? How do we combine these features? How can we evaluate how well we are doing?

64 Corpus-based approaches In the classical methods, various features (thematic features, title, location, cue phrase) were used to determine the salience of information for summarization an obvious issue: determine the relative contribution of different features to any given text summarization task

65 Corpus-based approaches Contribution is dependent on the text genre, e.g. location: in newspaper stories, the leading text often contains a summary in TV news, a preview segment may contain a summary of the news to come in scientific text: an author-written abstract

66 Corpus-based approaches The importance of different text features for any given summarization problem can be determined by counting the occurrences of such features in text corpora in particular, analysis of human-generated summaries, along with their full-text sources, can be used to learn rules for summarization

67 Corpus-based approaches Challenges creating a suitable text corpus, designing an annotation scheme ensuring the suitable set of summaries is available may already be available: scientific papers if not: author, professional abstractor, judge

68 KPC method Kupiec, Pedersen, Chen (1995): A Trainable Document Summarizer a learning method using a corpus of abstracts written by professional human abstractors (Engineering Information Co.) naïve Bayesian classification method is used

69 KPC method: general idea training phase: Select a set of features Calculate a probability of each feature value to appear in a summary sentence using a training corpus (e.g. originals + manual summaries)

70 KPC method: general idea when a new document is summarized: For each sentence Find values for the features Calculate the probability for this feature value combination to appear in a summary sentence Choose n best scoring sentences

71 KPC method: features sentence-length cut-off feature given a threshold (e.g. 5 words), the feature is true for all sentences longer than the threshold, and false otherwise F1(s) = 0, if sentence s has 5 or less words F1(s) = 1, if sentence s has more than 5 words

72 KPC method: features paragraph feature sentences in the first 10 paragraphs and the last 5 paragraphs in a document get a higher value in paragraphs: paragraph-initial, paragraph-final, paragraph-medial are distinguished

73 KPC method: features paragraph feature F2(s) = i, if sentence s is the first sentence in a paragraph F2(s) = f, if there are at least 2 sentences in a paragraph, and s is the last one F2(s) = m, if there are at least 3 sentences in a paragraph, and is neither the first nor the last sentence

74 KPC method: features thematic word feature a small number of thematic words (the most frequent content words) are selected each sentence is scored as a function of frequency of the thematic words highest scoring sentences are selected binary feature: feature is true for a sentence, if the sentence is present in the set of highest scoring sentences

75 KPC method: features fixed-phrase feature this feature is true for sentences that contain any of 26 indicator phrases (e.g. ”this letter…”, ”In conclusion…”), or that follow section head that contain specific keywords (e.g. ”results”, ”conclusion”)

76 KPC method: features Uppercase word feature proper names and explanatory text for acronyms are usually important feature is computed like the thematic word feature an uppercase thematic word is not sentence-initial and begins with a capital letter and must occur several times first occurrence is scored twice as much as later occurrences

77 Exercise (CNN news) sentence-length; F1: let threshold = 14 < 14 words: F1(s) =0, else F1(s)=1 paragraph; F2: i=first, f=last, m=medial thematic-words; F3 score: how many thematic words a sentence has F3(s) = 0, if score > 3, else F3(s) = 1

78 KPC method: classifier For each sentence s, we compute the probability that s will be included in a summary S given the k features Fj, j=1…k the probability can be expressed using Bayes’ rule:

79 KPC method: classifier Assuming statistical independence of the features: P(s  S) is a constant, and P(F j | s  S) and P(F j ) can be estimated directly from the training set by counting occurrences

80 KPC method: corpus Corpus is acquired from Engineering Information Co, which provides abstracts of technical articles to online information services articles do not have author-written abstracts abstracts were created by professional abstractors

81 KPC method: corpus 188 document/summary pairs sampled from 21 publications in the scientific/technical domain summaries are mainly indicative, average length is 3 sentences average number of sentences in the original documents is 86 author, address, and bibliography were removed

82 KPC method: sentence matching The abstracts from the human abstractors are not extracts but inspired by the original sentences the automatic summarization task here: extract sentences that the human abstractor might have chosen to prepare summary text (with minor modifications…)

83 KPC method: sentence matching For training, a correspondence between the manual summary sentences and sentences in the original document need to be obtained matching can be done in several ways

84 KPC method: sentence matching matching can be done in several ways: a direct sentence match the same sentence is found in both a direct join 2 or more original sentences were used to form a summary sentence summary sentence can be ’unmatchable’ summary sentence (single or joined) can be ’incomplete’

85 KPC method: sentence matching Matching was done in two passes first, the best one-to-one sentence matches were found automatically second, these matches were used as a starting point for the manual assignment of correspondences

86 KPC method: evaluation Cross-validation strategy for evaluation documents from a given journal were selected for testing one at a time; all other document/summary pairs were used for training unmatchable and incomplete summary sentences were excluded total of 498 unique sentences

87 KPC method: evaluation Two ways of evaluation 1. the fraction of manual summary sentences that were faithfully reproduced by the summarizer program the summarizer produced the same number of sentences as were in the corresponding manual summary -> 35% of summary sentences reproduced 83% is the highest possible value, since unmatchable and incomplete sentences were excluded 2. the fraction of the matchable sentences that were correctly identified by the summarizer -> 42%

88 KPC method: evaluation the effect of different features was also studied best combination (44%): paragraph, fixed-phrase, sentence-length baseline: selecting sentences from the beginning of the document (result: 24%) if 25% of the original sentences selected: 84%

89 Discourse-based approaches Discourse structure appears to play an important role in the strategies used by human abstractors and in the structure of their abstracts an abstract is not just a collection of sentences, but it has an internal structure -> abstract should be coherent and it should represent some of the argumentation used in the source

90 Discourse models cohesion relations between words or referring expressions, which determine how tightly connected the text is anaphora, ellipsis, synonymy, hypernymy (dog is-a-kind animal) coherence overall structure of a multi-sentence text in terms of macro-level relations between sentences (e.g. ”although” -> contrast)

91 Boguraev, Kennedy (BG) Goal: identify those phrasal units across the entire span of the document that best function as representative highlights of the document’s content these phrasal units are called topic stamps a set of topic stamps is called capsule overview

92 BG A capsule overview not a set/sequence of sentences a semi-formal (normalised) representaion of the document, derived after a process of data reduction over the original text not always very readable, but still represents the flow of the narrative can be combined with surrounding information to produce more coherent presentation

93 Priest is charged with Pope attack A Spanish priest was charged here today with attempting to murder the Pope. Juan Fernandez Krohn, aged 32, was arrested after a man armed with a bayonet approached the Pope while he was saying prayers at Fatima on Wednesday night. According to the police, Fernandez told the investigators today that he trained for the past six months for the assault. He was alleged to have claimed the Pope ’looked furious’ on hearing the priest’s criticism of his handling of the church’s affairs. If found quilty, the Spaniard faces a prison sentence of years.

94 Capsule overview vs. summary summary could be, e.g. “A Spanish priest is charged after an unsuccessful murder attempt on the Pope” capsule overview: A SPANISH PRIEST was charged Attempting to murder the POPE HE trained for the assault POPE furious on hearing PRIEST’S criticisms

95 BG Primary consideration: methods should apply to any document type and source (domain independence) also: efficient and scalable technology shallow syntactic analysis, no comprehensive parsing engine needed

96 BG Based on the findings on technical terms technical terms have such linguistic properties that can be used to find terms automatically in different domains quite reliably technical terms seem to be topical task of content characterization identifying phrasal units that have lexico-syntactic properties similar to technical terms discourse properties that signify their status as most prominent

97 BG: terms as content indicators Problems undergeneration overgeneration differentiation

98 Undergeneration a set of phrases should contain an exhaustive description of all the entities that are discussed in the text the set of technical terms has to be extended to include also expressions with pronouns etc.

99 Overgeneration already the set of technical terms can be large extensions make the information overload even worse solution: phrases that refer to one participant in the discourse are combined with referential links

100 Differentiation The same list of terms may be used to describe two documents, even if they, e.g., focus on different subtopics it is necessary to differentiate term sets not only according to their membership, but also according to the relative representativeness of the terms they contain

101 Term sets and coreference classes Phrases are extracted using a phrasal grammar (e.g. a noun with modifiers) also expressions with pronouns and incomplete expressions are extracted using a (Lingsoft) tagger that provides information about the part of speech, number, gender, and grammatical function of tokens in a text solves the undergeneration problem

102 Term sets and coreference classes The phrase set has to be reduced to solve the problem of overgeneration -> a smaller set of expressions that uniquely identify the objects referred to in the text application of anaphora resolution e.g. to which noun a pronoun ’he’ refers to?

103 Resolving coreferences Procedure moving through the text sentence by sentence and analysing the nominal expressions in each sentence from left to right either an expression is identified as a new participant in the discourse, or it is taken to refer to a previously mentioned referent

104 Resolving coreferences Coreference is determined by a 3 step procedure a set of candidates is collected: all nominals within a local segment of discourse some candidates are eliminated due to morphological mismatches or syntactical restrictions remaining candidates are ranked according to their relative salience in the discourse

105 Salience factors sent(term) = 100 iff term is in the current sentence cntx(term) = 50 iff term is in the current discourse segment subj(term) = 80 iff term is a subject acc(term) = 50 iff term is a direct object dat(term) = 40 iff term is an indirect obj...

106 Local salience of a candidate The local salience of a candidate is the sum of the values of the salience factors the most salient candidate is selected as the antecedent if the coreference link cannot be established to some other expression, the nominal is taken to introduce a new referent -> coreferent classes

107 Topic stamps In order to further reduce the referent set, some additional structure has to be imposed the term set is ranked according to the salience of its members relative prominence or importance in the discourse of the entities to which they refer objects in the centre of discussion have a high degree of salience

108 Saliency Measured like local saliency in coreference resolution, but tries to measure the importance of unique referents in the discourse

109 Priest is charged with Pope attack A Spanish priest was charged here today with attempting to murder the Pope. Juan Fernandez Krohn, aged 32, was arrested after a man armed with a bayonet approached the Pope while he was saying prayers at Fatima on Wednesday night. According to the police, Fernandez told the investigators today that he trained for the past six months for the assault. He was alleged to have claimed the Pope ’looked furious’ on hearing the priest’s criticism of his handling of the church’s affairs. If found quilty, the Spaniard faces a prison sentence of years.

110 Saliency ’priest’ is the primary element eight references to the same actor in the body of the story these reference occur in important syntactic positions: 5 are subjects of main clauses, 2 are subjects of embedded clauses, 1 is a possessive ’Pope attack’ is also important ’Pope’ occurs 5 times, but not in so important positions (2 are direct objects)

111 Discourse segments If the intention is to use very concise descriptions of one or two salient phrases, i.e. topic stamps, longer text have to be broken down into smaller segments topically coherent, contiguous segments can be found by using a lexical similarity measure assumption: distribution of words used changes when the topic changes

112 BG: Summarization process 1. linguistic analysis 2. discourse segmentation 3. extended phrase analysis 4. anaphora resolution 5. calculation of discourse salience 6. topic stamp identification 7. capsule overview

113 Knowledge-rich approaches Structured information can be used as the starting point for summarization structured information: e.g. data and knowledge bases, may have been produced by processing input text summarizer does not have to address the linguistic complexities and variability of the input, but also the structure of the input text is not available

114 Knowledge-rich approaches There is a need for measures of salience and relevance that are dependent on the knowledge source addressing coherence, cohesion, and fluency becomes the entire responsibility of the generator

115 STREAK McKeown, Robin, Kukich (1995): Generating concise natural language summaries goal: folding information from multiple facts into a single sentence using concise linguistic constructions

116 STREAK Produces summaries of basketball games first creates a draft of essential facts then uses revision rules constrained by the draft wording to add in additional facts as the text allows

117 STREAK Input: a set of box scores for a basketball game historical information (from a database) task: summarize the highlights of the game, underscoring their significance in the light of previous games output: a short summary: a few sentences

118 STREAK The box score input is represented as a conceptual network that expresses relations between what were the columns and rows of the table essential facts: the game result, its location, date and at least one final game statistic (the most remarkable statistic of a winning team player)

119 STREAK Essential facts can be obtained directly from the box-score in addition, other potential facts other notable game statistics of individual players - from box-score game result streaks (Utah recorded its fourth straight win) - historical extremum performances such as maximums or minimums - historical

120 STREAK Essential facts are always included potential facts are included if there is space decision on the potential facts to be included could be based on the possibility to combine the facts to the essential information in cohesive and stylistically successful ways

121 STREAK Given facts: Karl Malone scored 39 points. Karl Malone’s 39 point performance is equal to his season high a single sentence is produced: Karl Malone tied his season high with 39 points

122 Text summarization surface-level methods “manual” features corpus-based learning discourse-level methods knowledge-rich methods