Download presentation
Presentation is loading. Please wait.
Published byMarian Morris Modified over 9 years ago
2
What is Readability? A characteristic of text documents.. “the sum total of all those elements within a given piece of printed material that affect the success of a group of readers have with it. The success is the extent to which they understand it, read it at an optimal speed, and find it interesting.” (Dale & Chall, 1949) “ease of understanding or comprehension due to the style of writing” (Klare, 1963)
3
Readability encompasses a number of areas… Syntactic complexity of the text ▪ grammatical arrangement of words within a sentence, (e.g. active / passive sentences have been shown to affect readability) ▪ Simple/compound sentence/complex sentences Organization of text ▪ discourse structure ▪ textual cohesion Semantic complexity of the text
4
Improve literacy rate Improving instruction delivery Judging technical manuals Matching text to appropriate grade level And many more…
5
Assign score to text based on some textual cues (e.g., average sentence length) Readability formula Over 200 formulas by 1980s (DuBay 2004) Textual cues ▪ sentence length, percentage of familiar words, and word length, syllables per word etc. Testing validity: correlating predicted score to reading comprehension score
7
Dale-Chall Formula Maintains a list of “easy words”. Score =.1579PDW +.0496ASL + 3.6365 ▪ PDW= Percentage of Difficult Words FOG index Lexile scale Commonalities among formulae Linear regression over some predictor variables
8
Traditional readability measures are robust for large sample size (textbook and essays) as compared to short and consize web documents. Web documents are generally noisy Resource: Predicting Reading Difficulty With Statistical Language Models, Kevyn Collins-Thompson and Jamie Callan
9
LM can encode more complex relationships as compared to simple linear regression model in traditional readability measures A probabilistic distribution in all grade levels Relative difficulty of words can be obtained statistically as compared to hardcoded approach in traditional measures
10
Earlier grade readers tend to use more concrete words (e.g. red); later grade readers use more abstract words (e.g., determine) Same observations in web documents
14
Text words
15
Token
28
Smooth individual grade-based language model using Good-Turing smoothing We have estimate of total probability mass of all unseen words We need to find each unseen word’s share of this total probability mass Uniform probability distribution?
29
Usage of discriminative words are clustered towards grade levels. Borrow probability mass from neighboring grade classes
32
Readability Score assigned documents Training New doc Readability Score Resource: Revisiting Readability: A Unified Framework for Predicting Text Quality, Emily Pitler and Ani Nenkova
33
There are different predictor variables indicating readability score What is a the contribution of individual predictor variable in readability score? Testing methodology Collect Readability Corpus Extract Predictor Variable Measure Correlation
36
+Ve -Ve
41
Log likelihood, WSJ article likelihood estimated from a language model from WSJ Log likelihood, NEWS article likelihood according to a unigram language model from NEWS LL with length, WSJ Linear regression of WSJ unigram and article length LL with length, NEWS Linear regression of NEWS unigram and article length
42
Average parse tree height Average number of noun phrases per sentence Average number of verb phrases per sentence Average number of subordinate clauses per sentence Counting SBAR nodes in parse tree
43
Curious case of average verb phrases No of verb phrases per sentence may increase the text complexity ▪ average verb phrases should have a negative correlation Let’s look at the following examples It was late at night, but it was clear. The stars were out and the moon was bright. (1) It was late at night. It was clear. The stars were out. The moon was bright. (2)
44
Aspects of well written discourse Cohesive devices like pronouns, definite descriptions, topic continuity Number of pronouns per sentence Number of definite articles per sentence Average cosine similarity Word overlap Word overlap over nouns and pronouns
45
Entity based approach towards local coherence discourse coherence is achieved in view of the way discourse entities are introduced and discussed Some entities are more salient than others ▪ Salient entities are more likely to appear in prominent syntactic positions (such as subject or object), and to be introduced in a main clause. ▪ Centering theory models the continuity of discourse
46
Entity-Grid discourse representation Each text is represented by an entity grid ▪ A two-dimensional array that captures the distribution of entities across text sentences. Optional Resource: Modeling Local Coherence: An Entity-Based Approach, Regina Barzilay and Mirella Lapata
48
If a noun phrase appears more than once in a sentence, we resort to grammatical role based ranking [S>O>X] -- Sentence 1: ‘Microsoft’ appears as subject (S) and rest (X) category -- Mark entry for Microsoft as S
55
Increase in number of discourse relations in a document will lower the log-likelihood Number of relations in a document as feature
57
200+ readability measures and still counting Are they really looking at deeper aspects of language comprehension? Are they tuned towards individual reading abilities? Is reader in the loop?
58
How do we comprehend sentences? How do we store and access words? How do we resolve ambiguities?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.