Download presentation
Presentation is loading. Please wait.
Published byHomer Franklin Modified over 8 years ago
1
WikiSimple – Automatic Simplification of Wikipedia Articles By Kristian Woodsend and Mirella Lapata Presented by Kira Belkin 05/2012 1
2
1. Introduction 2. Automatic Simplification 3. Experiment 4. Results 2
3
Text Simplification Text simplification is used in order to make a certain text accessible to a broader audience. One example is the Simple English Wikipedia - http://simple.wikipedia.org/wiki/Main_Page http://simple.wikipedia.org/wiki/Main_Page It uses simple English words and grammar, currently contains about 80,000 articles. 3
4
Text Simplification – How is it done manually? Some of Simple Wikipedia instructions. Do: Use of basic English vocabulary –start with BE 850 words list. When using a difficult word, explain its meaning in parentheses. Writing shorter sentences. Using simple sentence structure. But, simple doesn’t mean short- the language is simple, but the ideas don't have to be. 4
5
Text Simplification – How is it done manually? Don’t: Use poor grammar or incorrect spelling. Use contractions (such as I've, can't, hasn't). Use idioms (such as break a leg). 5
6
Text Simplification - examples Lexical simplification: a. She burst into tears, when she couldn’t locate her keys. b. She cried, when she couldn’t find her keys. Structural simplification: a. It was a Honda that John sold to Tom. b. John sold a Honda to Tom. 6
7
Text Simplification – not an easy task The simplification task requires paying attention and using common scene and intelligence. For example, the quote: “I have nothing to offer but blood, toil, tears, and sweat.” Winston Churchill Had we used BE 850 it would have been translated to: “... blood, hard work, drops from eyes, and body water.” So, it looks like a pretty difficult task to perform automatically. 7
8
Who can benefit from it? Children. Non native English speakers. People with learning disabilities. People suffering from language impairments, such as aphasia. Preprocessing step to other tasks, such as parsing, machine translation, summarization and semantic role labeling. 8
9
So, is SimpleEW successful? Both SimpleEW and MainEW started around the same time in 2001. Lets take a look at the statistics: o SimpleEW- about 80,000 articles. MainEW- about 4 million. Perhaps because it is harder to write an article for SimpleEW, due to the strict restrictions. 9
10
So, is SimpleEW successful? 10
11
So, is SimpleEW successful? In fact a lot of SimpleEW are “translated” from MainEW, and are not only simpler, but are shorter and lack information. 11
12
A few remarks SimpleEW hasn’t necessarily failed. We can look at it simply as a different language that contains 80,000 articles. For example, Norwegian wiki contains around 82,000 articles and of course isn’t a failed project. 12
13
1. Introduction 2. Automatic Simplification 2.1 Previous Studies 3. Experiment 4. Results 13
14
Previous work – Not learning It has been mostly rule- based. Different approaches tried to use syntactic simplification rules to split long complicated sentences, or apply different lexical simplifications. 14
15
Previous work - Learning Using Main & Simple Wikipedia articles as a training corpus for learning to automatically translate from one to the other. Using revision history from MainEW as training data for learning sentence compression model. More recent methods explore the learning of semantical simplifications from SimpleEW revision history (Yatskar 2010). 15
16
1. Introduction 2. Automatic Simplification 2.1 Previous Studies 2.2 The current study 3. Experiment 4. Results 16
17
How does it differ from previous work? Provides an end-to-end system that simplifies articles. This is done using the SimpleEW revision history as a training data. The model deals with both lexical and structural simplifications. Simplifies both content and structure. Here the simplification is done at document level, instead of a sentence level. Potentially helpful with other NLP tasks, such as semantic relatedness, information extraction etc. 17
18
The Simplification Model Scheme: MainEW article Selection of salient phrases (+their sentences) Simplifying them A simpler version of the article 18
19
Background The grammatical structure of a sentence is often represented by an hierarchical syntactic tree structure. 19
20
Quasi- synchronous Grammar General idea: Using a training corpus of sentences and their simplifications- a set of rules is created. Later these rules will be applied to texts creating a simpler version of them. 20
21
Example John Smith walked his dog and afterwards met Mary. John Smith walked his dog. He met Mary later. What rules can be deduced? 21
22
22
23
QG- How is it done? 1 The model operates on documents with syntactic information. This information was obtained using the Stanford parser. QG is extracted from SimpleEW edit histories. The QG doesn't assume strictly synchronous structure over the source and target sentence, it identifies some alignment. THE EXAMPLE 23
24
QG- How is it done? 2 We take a pair of sentences with their trees and build a list of leaf node alignments based on lexical identity. We align direct parent nodes where more than one child node aligns. A grammar rule is created if all of the nodes in the target tree can be explained using nodes from the source. A small amount of substitution is allowed. THE EXAMPLE 24
25
QG- How is it done? 3 QG rules are created from aligned nodes above the leaf node level. Finally, the simplified text is created from source sentence parse trees, by applying suitable rules recursively. THE EXAMPLE 25
26
Several simplification options Sometimes, more than one simplification is possible. As oppose to previous QG models, this one doesn't use a probability model in order to decide which possibility to use. A frequency count is used to indicate how often each rule is encountered in the training data. All alternative simplifications are incorporated into the target parse tree. The ILP model will choose which one to use. 26
27
ILP Formulation A binary integer linear program. The input is phrase structure trees augmented with alternative simplifications. Each phrase in the MainEW document is given a salience score, representing whether it should be included in the simple version or not. This is done using SVMs (support vector machines – analyze data and recognize patterns). THE SIMPLIFICATION MODEL 27
28
Number of words bellow the target (positive -> sentences are shorter than target) Number of syllables bellow the target Salience score Rewrite penalty Eventually, we get the best simplification option, using “best “ set of rules. 28
29
We set the maximum length of the output at L max words. Ensure that the phrase dependencies are respected and the resulting structure is a tree, thus providing grammatical correction. If QG provides several alternative simplifications, we will of course select only one. Finally, phrases are linked to sentences. Some other parameters 29
30
1. Introduction 2. Automatic Simplification 3. Experiment 4. Results 30
31
Experimental Setup 1654 articles on Animals, Celebrities and Cities were extracted from MainEW and SimpleEW. Each category was split into training / test set. The corpus was parsed using Stanford parser. 31
32
QG rule extraction QG rules were learned from the revision histories of SimpleEW articles. The revision with simplification in revision comments were identified. Each of this revisions was compared to the previous version. The modified sections were identified using a diff program. In total 14,831 paired sentences were used to create QG rules. 32
33
QG rule extraction Over all, 269 syntactic simplification rules and 5,779 lexical substitution rules were obtained. 33
34
SimpleEW Article generation Test articles were generated from corresponding MainEW articles. For each document an ILP was created and solved with the following parameters: - L max = 250 (max num of words) - target words per sentence (wps) =8 - target syllabels per word (spw) = 1.5 - ZIB optimization Suite software was used Finally, the solution was converted into an article by removing the nodes that were not chosen from the tree representation and than joining the remaining leaf nodes in order. 34
35
1. Introduction 2. Automatic Simplification 3. Experiment 4. Results 35
36
Compared to : 1) SimpleEW article 2) “preamble” (introduction sentences before sections) 3) SpencerK (summery based on main sentences extraction with lexical simplifications provided by the SimpleEW editor SpencerK ) * Notice, they didn’t compere with previous studies Measured using: 1)Measures assessing the readability of written documents 2)15 volunteers, non native English speakers. 36
37
The volunteers were given 9 articles. All 4 simple versions for each, and access to MainEW article. They were asked to rank them in order of simplicity and informativeness. 37
38
Results- Automatic evaluation FKGL – corresponds to a grade level, e.g. 8.2- 8 th grade student 38
39
Results- Human evaluation 39
40
“Judge a man by his questions rather than his answers” (Voltaire) 40
41
Definitions S – a set of sentences in a document. P – a set of phrases in a document. P Ↄ P s - a set of phrases in sentence s ϵ S P Ↄ D i for each i in P, the phrase dependency information for each phrase I, where each set D i contains the phrases that depend on the presence of i. P Ↄ C – a set on nodes involving choices of alternative simplifications P Ↄ C i, for i ϵ C – sets of phrases that are direct children of such node l i (ʷ) – the length of each phrase i in words l i (SY) – the length of each phrase i in syllables x ϵ {0,1} |p| - a vector of binary variables indicating if each phrase is to be part of the output y ϵ {0,1} |s| - a vector of binary variables indicating if each sentence is to be part of the output 41
42
Formulas The objective function - - f i is the salient score for each phrase i - gi is a rewrite penalty, where common QG rules are given a smaller penalty compared to rare QG rules. - h w and h sy - parameters associated with simpler language. 42
43
wps - average number of words per sentence. spw - average number of syllables per word. h w (x,y) - measures the number of words below the target level h sy (x) – mesures the number of syllables below the target level. 43
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.