Presentation is loading. Please wait.

Presentation is loading. Please wait.

Readability Assessment for Text Simplification Sandra Aluisio 1, Lucia Specia 2, Caroline Gasperin 1, Carolina Scarton 1 1 University of São Paulo, Brazil.

Similar presentations


Presentation on theme: "Readability Assessment for Text Simplification Sandra Aluisio 1, Lucia Specia 2, Caroline Gasperin 1, Carolina Scarton 1 1 University of São Paulo, Brazil."— Presentation transcript:

1 Readability Assessment for Text Simplification Sandra Aluisio 1, Lucia Specia 2, Caroline Gasperin 1, Carolina Scarton 1 1 University of São Paulo, Brazil 2 University of Wolverhampton, UK The 5th Workshop on Innovative Use of NLP for Building Educational Applications

2 Develop technology to benefit low literacy readers Motivation The 5th Workshop on Innovative Use of NLP for Building Educational Applications 2 68 % ˗ Rudimentary: studied up to 4 years; can find explicit information in short and familiar texts ˗ Basic: studied between 4 and 8 years; can read and understand texts of average length, and find information even when it is necessary to make some inference INAF levels

3 Readability Assessment To assess the readability level of a text – Three levels of readability: INAF levels Rudimentary – Basic – Advanced To supplement our text simplification technology – Two levels of simplification: degree of application of simplification operations STRONG: operations are applied to all complex syntactic phenomena present NATURAL: operations are applied selectively, only when the resulting text remains “natural” The 5th Workshop on Innovative Use of NLP for Building Educational Applications3  RUDIMENTARY  BASIC

4 Text Simplification Scenario Authoring tool for creating simplified texts 1.Author inputs text 2.Author receives suggestions of possible simplifications: may accept or not Lexical substitutions Syntactic simplification 3.Author does not know if the text is simple enough for his audience Feedback: Readability assessment The 5th Workshop on Innovative Use of NLP for Building Educational Applications4 SIMPLIFICA

5 Readability Assessment System Machine learning – Classes = 3 INAF levels – Trained on corpus of manually simplified texts Original text + natural and strong simplifications – Extensive set of features Cognitively-motivated: Coh-Metrix [Graesser et al., 2004] Syntactic: occurrence of complex phenomena Language model: up to trigrams – 3 paradigms: Classification, Ordinal Classification, Regression [Heilman et al., 2007] The 5th Workshop on Innovative Use of NLP for Building Educational Applications5

6 Corpora Training and testing corp – General news: Zero Hora (ZH) newspaper – Popular science news: Caderno Ciencia (CC) – 3 versions for each text: original, natural, strong The 5th Workshop on Innovative Use of NLP for Building Educational Applications6

7 Features 1Number of words21Number of high level constituents41Adverb ambiguity ratio 2Number of sentences22Number of personal pronouns42Adjective ambiguity ratio 3Number of paragraphs23Type-token ratio43Incidence of clauses 4Number of verbs24Pronoun-NP ratio44Incidence of adverbial phrases 5Number of nouns25Number of “e” (and)45Incidence of apposition 6Number of adjectives26Number of “ou” (or)46Incidence of passive voice 7Number of adverbs27Number of “se” (if)47Incidence of relative clauses 8Number of pronouns28Number of negations48Incidence of coordination 9Average number of words per sentence29Number of logic operators49Incidence of subordination 10 Average number of sentences per paragraph 30Number of connectives50Out-of-vocabulary words 11Average number of syllables per word31Number of positive additive connectives51LM probability of unigrams 12Flesch index for Portuguese32Number of negative additive connectives52LM perplexity of unigrams 13Incidence of content words33Number of positive temporal connectives53LM perplexity of unigrams, no line break 14Incidence of functional words34Number of negative temporal connectives54LM probability of bigrams 15Raw Frequency of content words35Number of positive causal connectives55LM perplexity of bigrams 16Minimal frequency of content words36Number of negative causal connectives56LM perplexity of bigrams, no line break 17Average number of verb hypernyms37Number of positive logic connectives57LM probability of trigrams 18Incidence of NPs38Number of negative logic connectives58LM perplexity of trigrams 19Number of NP modifiers39Verb ambiguity ratio59LM perplexity of trigrams, no line break 20Number of words before the main verb40Noun ambiguity ratio The 5th Workshop on Innovative Use of NLP for Building Educational Applications7

8 Feature Analysis Pearson correlation between features and literacy levels The 5th Workshop on Innovative Use of NLP for Building Educational Applications8 FeatureCorrelation 1Words per sentence0.693 2Incidence of apposition0.688 3Incidence of clauses0.614 4Flesch index0.580 5Words before main verb0.516 6Sentences per paragraph0.509 7Incidence of relative clauses0.417 8Syllables per word0.414 9Number of positive additive connectives0.397 10Number of negative causal connectives0.388

9 Predicting readability Levels The 5th Workshop on Innovative Use of NLP for Building Educational Applications9 Classification Weka SVM Ordinal Classification Weka Pairwise SVM

10 Predicting readability Levels The 5th Workshop on Innovative Use of NLP for Building Educational Applications10 Regression Weka SVM-reg, RBF Kernel Best correlation: Regression Lowest MAE: Ordinal Classification Combination of all features consistently yields better results: more robust Syntactic features achieve the best correlation scores Language model features performed the poorest

11 Conclusions It is possible to predict with satisfactory performance the readability level of texts according to our three classes of interest Ordinal Classification seems to be the most appropriate model to use – High correlation, lowest error rate (MAE) Combination of all features is best The 5th Workshop on Innovative Use of NLP for Building Educational Applications11

12 SIMPLIFICA Tool Integration of classification model – Simplest model, highest F-measure, comparable correlation scores The 5th Workshop on Innovative Use of NLP for Building Educational Applications12

13 Future Work Add deeper cognitive features, e.g. semantic, coreference, latent semantics metrics User evaluation: authors The 5th Workshop on Innovative Use of NLP for Building Educational Applications13

14 Thanks! The 5th Workshop on Innovative Use of NLP for Building Educational Applications14


Download ppt "Readability Assessment for Text Simplification Sandra Aluisio 1, Lucia Specia 2, Caroline Gasperin 1, Carolina Scarton 1 1 University of São Paulo, Brazil."

Similar presentations


Ads by Google