What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect.

What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect the success of a group of readers have with it. The success is the extent to which they understand it, read it at an optimal speed, and find it interesting.” (Dale & Chall, 1949)  “ease of understanding or comprehension due to the style of writing” (Klare, 1963)

 Readability encompasses a number of areas…  Syntactic complexity of the text ▪ grammatical arrangement of words within a sentence, (e.g. active / passive sentences have been shown to affect readability) ▪ Simple/compound sentence/complex sentences  Organization of text ▪ discourse structure ▪ textual cohesion  Semantic complexity of the text

 Improve literacy rate  Improving instruction delivery  Judging technical manuals  Matching text to appropriate grade level  And many more…

 Assign score to text based on some textual cues (e.g., average sentence length)  Readability formula  Over 200 formulas by 1980s (DuBay 2004)  Textual cues ▪ sentence length, percentage of familiar words, and word length, syllables per word etc.  Testing validity: correlating predicted score to reading comprehension score

 Dale-Chall Formula  Maintains a list of “easy words”.  Score =.1579PDW +.0496ASL + 3.6365 ▪ PDW= Percentage of Difficult Words  FOG index  Lexile scale  Commonalities among formulae  Linear regression over some predictor variables

 Traditional readability measures are robust for large sample size (textbook and essays) as compared to short and consize web documents.  Web documents are generally noisy Resource: Predicting Reading Difficulty With Statistical Language Models, Kevyn Collins-Thompson and Jamie Callan

 LM can encode more complex relationships as compared to simple linear regression model in traditional readability measures  A probabilistic distribution in all grade levels  Relative difficulty of words can be obtained statistically as compared to hardcoded approach in traditional measures

 Earlier grade readers tend to use more concrete words (e.g. red); later grade readers use more abstract words (e.g., determine)  Same observations in web documents

Text words

 Smooth individual grade-based language model using Good-Turing smoothing  We have estimate of total probability mass of all unseen words  We need to find each unseen word’s share of this total probability mass  Uniform probability distribution?

 Usage of discriminative words are clustered towards grade levels.  Borrow probability mass from neighboring grade classes

Readability Score assigned documents Training New doc Readability Score Resource: Revisiting Readability: A Unified Framework for Predicting Text Quality, Emily Pitler and Ani Nenkova

 There are different predictor variables indicating readability score  What is a the contribution of individual predictor variable in readability score?  Testing methodology Collect Readability Corpus Extract Predictor Variable Measure Correlation

+Ve -Ve

 Log likelihood, WSJ  article likelihood estimated from a language model from WSJ  Log likelihood, NEWS  article likelihood according to a unigram language model from NEWS  LL with length, WSJ  Linear regression of WSJ unigram and article length  LL with length, NEWS  Linear regression of NEWS unigram and article length

 Average parse tree height  Average number of noun phrases per sentence  Average number of verb phrases per sentence  Average number of subordinate clauses per sentence  Counting SBAR nodes in parse tree

 Curious case of average verb phrases  No of verb phrases per sentence may increase the text complexity ▪ average verb phrases should have a negative correlation  Let’s look at the following examples  It was late at night, but it was clear. The stars were out and the moon was bright. (1)  It was late at night. It was clear. The stars were out. The moon was bright. (2)

 Aspects of well written discourse  Cohesive devices like pronouns, definite descriptions, topic continuity  Number of pronouns per sentence  Number of definite articles per sentence  Average cosine similarity  Word overlap  Word overlap over nouns and pronouns

 Entity based approach towards local coherence  discourse coherence is achieved in view of the way discourse entities are introduced and discussed  Some entities are more salient than others ▪ Salient entities are more likely to appear in prominent syntactic positions (such as subject or object), and to be introduced in a main clause. ▪ Centering theory models the continuity of discourse

 Entity-Grid discourse representation  Each text is represented by an entity grid ▪ A two-dimensional array that captures the distribution of entities across text sentences. Optional Resource: Modeling Local Coherence: An Entity-Based Approach, Regina Barzilay and Mirella Lapata

If a noun phrase appears more than once in a sentence, we resort to grammatical role based ranking [S>O>X] -- Sentence 1: ‘Microsoft’ appears as subject (S) and rest (X) category -- Mark entry for Microsoft as S

 Increase in number of discourse relations in a document will lower the log-likelihood  Number of relations in a document as feature

 200+ readability measures and still counting  Are they really looking at deeper aspects of language comprehension?  Are they tuned towards individual reading abilities?  Is reader in the loop?

 How do we comprehend sentences?  How do we store and access words?  How do we resolve ambiguities?

What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect.

Similar presentations

Presentation on theme: "What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect.

Similar presentations

Presentation on theme: "What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect."— Presentation transcript:

Similar presentations

About project

Feedback