V. Ivanov, V. Solovyev, M. Solnyshkina Kazan Federal University

Slides:



Advertisements
Similar presentations
Tracking L2 Lexical and Syntactic Development Xiaofei Lu CALPER 2010 Summer Workshop July 14, 2010.
Advertisements

Identifying Prepositional Phrases
® Towards Using Structural Events To Assess Non-Native Speech Lei Chen, Joel Tetreault, Xiaoming Xi Educational Testing Service (ETS) The 5th Workshop.
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
Used in place of a noun pronoun.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Correlation-Regression The correlation coefficient measures how well one can predict X from Y or Y from X.
Readability Formulas. Why were they developed? What do readers look like?
Chapter 4 Basics of English Grammar Business Communication Copyright 2010 South-Western Cengage Learning.
What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect.
© 2006 SOUTH-WESTERN EDUCATIONAL PUBLISHING 11th Edition Hulbert & Miller Effective English for Colleges Chapter 9 SENTENCES: ELEMENTS, TYPES, AND STRUCTURES.
2 pt 3 pt 4 pt 5pt 1pt. 2 pt 3 pt 4 pt 5 pt 1 pt 2pt 3 pt 4pt 5 pt 1pt 2pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4pt 5 pt 1pt Parts of Speech PunctuationVerbal's.
Textbook complicacy and the rules for clear writing Jaan Mikk
UNIT 4: Readability Index Measurement 1. What is Readability? The feature of plain language that makes it easy to read Or Describes the ease with which.
Indices Using Weighted Sums and Averages Readability Indices.
English Review for Final These are the chapters to review. In Textbook: Chapter 1 Nouns Chapter 2 Pronouns Chapter 3 Adjectives Chapter 4 Verbs Chapter.
Date: 2013/8/27 Author: Shinya Tanaka, Adam Jatowt, Makoto P. Kato, Katsumi Tanaka Source: WSDM’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Estimating.
P ARTS OF SPEECH Carlos Daniel Prado Pérez Angélica Rodríguez.
$100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100.
English Review for Final These are the chapters to review. In Textbook: Chapter 9 Nouns Chapter 10 Pronouns Chapter 11 Adjectives Chapter 12 Verbs Chapter.
Verbals. What are Verbals?  A verbal is a word that is based on a verb and expresses action or a state of being, but is acting as a different part of.
Grammar Boot Camp Parts of Speech Challenge
USE CORNELL NOTES AS WE REVIEW THE PARTS OF SPEECH. Parts of Speech Review.
What do we mean by Syntax? Unit 6 – Presentation 1 “the order or arrangement of words within a sentence” And what is a ‘sentence’? A group of words that.
Evaluating 6 th Grade Literature By: Lorraine M. Carmona Torres Prof. E. Lugo ENGG 633 December 2 nd, 2010.
Choice1Choice 2Choice 3Choice
Choice1Choice 2Choice 3Choice
Method 3: Least squares regression. Another method for finding the equation of a straight line which is fitted to data is known as the method of least-squares.
Phrase Definition review. Consists of an appositive and any modifiers the appositive has.
Parts of Speech By: Miaya Nischelle Sample. NOUN A noun is a person place or thing.
Writing 2 ENG 221 Norah AlFayez. Lecture Contents Revision of Writing 1. Introduction to basic grammar. Parts of speech. Parts of sentences. Subordinate.
» Give details of local providers able to provide specialist assessment » Give details of the range of provision within the area » Select provision options.
Authorized Junior High School English Textbooks in Japan: From the Viewpoint of Vocabulary and Readability Kenji Kitao ( Doshisha University ) Shosaku.
Verbals Participles, Gerunds, Infinitives. Verb A word that shows an action, being, or links a subject to a subject compliment.
Language Identification and Part-of-Speech Tagging
The categorial System of English verbal
Parts of Speech Review.
1. Review of last Friday (Form, Function, Fluency)
Week 13 Warm-Ups English 12 Mrs. Fountain.
Verbal Phrases: Participial & Infinitive Phrases
Appendix A: Basic Grammar and Punctuation Reference
The DOOM Lab Missouri State University
DGP – Sentence 2 Parts of Speech.
The Great Fire of London
Week 2 DGP 11th Grade.
Diagramming Sentences
Dr. A .K. Bhattacharyya Professor EEI(NE Region), AAU, Jorhat
Grammar Review.
Chapter 4 Basics of English Grammar
Parts of Speech Project
Conjunctions Prepared by: Khaled Hadi Al Ahbabi Grade: 12 LC
The Eight Parts of Speech
Patient Education: Are We Getting the Message Across?
Monday Write out this week's sentence and add capitalization and punctuation including end punctuation, commas, semicolons, apostrophes, underlining, and.
UNIT 3: READABILITY INDEX MEASUREMENT
VERBS PART 2.
Daily Grammar Practice
Participles and Participial Phrases
آسان عربی گرامر حصّہ اول مرکبِ عطفی و توصیفی
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
PREPOSITIONAL PHRASES
Week 3 Warm-Ups English 12 Mrs. Fountain.
Chapter 4 Basics of English Grammar
Week 8 DGP 11th Grade.
Week 9 Warm-Ups English 12 Mrs. Fountain.
Parallel Sentence Structure
Prepositions and Prepositional Phrases
DGP THURSDAY NOTES (Clauses and Sentence Type)
Language Maps Review.
Presentation transcript:

V. Ivanov, V. Solovyev, M. Solnyshkina Kazan Federal University EFFICIENCY OF TEXT READABILITY FEATURES IN RUSSIAN ACADEMIC TEXTS V. Ivanov, V. Solovyev, M. Solnyshkina Kazan Federal University

KAZAN (VOLGA REGION) FEDERAL UNIVERSITY Outline 1. Introduction: Problem 2. Related Work and Application 3. Resources 4. Mathematical model 5. Results: Analysis of Features 6. Conclusion

Problem How to measure the readability (complexity) of a text? KAZAN (VOLGA REGION) FEDERAL UNIVERSITY Problem How to measure the readability (complexity) of a text? – How well do the existing readability formulas work for Russian academic texts? – Which linguistic text features better correlate with readability of Russian academic texts?

KAZAN (VOLGA REGION) FEDERAL UNIVERSITY 2. Related Work Flesch–Kincaid Grade Level (1975): FKG = 0.39 ASL + 11.8 ASW − 15.59  ASL = Average Sentence Length ASW = Average number of syllables per word

KAZAN (VOLGA REGION) FEDERAL UNIVERSITY 2. Related Work https://readable.io/blog/the-flesch-reading-ease-and-flesch-kincaid-grade-level/

KAZAN (VOLGA REGION) FEDERAL UNIVERSITY 2. Application https://www.cmu.edu/news/stories/archives/2016/march/speechifying.html

KAZAN (VOLGA REGION) FEDERAL UNIVERSITY 2. Related Work FKG (English) = 0.39 ASL + 11.8 ASW − 15.59    FKG (Russian) = 0.5 ASL + 8.4 ASW − 15.59 Oborneva, 2006 100 parallel English-Russian literary texts

KAZAN (VOLGA REGION) FEDERAL UNIVERSITY 2. Related Work The Dale-Chall formula defines text complexity as a linear function of - PDW = Percentage of rare words   - ASL = Average Sentence Length in words  Raw Score = 0.1579 PDW + 0.0496 ASL

KAZAN (VOLGA REGION) FEDERAL UNIVERSITY 2. Related Work Russian academic texts in readability studies : number of syllables, number of words, number of sentences, number of abstract words, number of homonyms, number of polysemantic words, number of technical terms the percentage of short adjectives the percentage of finite verb form percentage of complex sentences и др. [Ivanov 2015, Shpakovskiy et al 2007].

KAZAN (VOLGA REGION) FEDERAL UNIVERSITY 3. Resources Russian academic texts on Social Studies (5 – 11 grades): two sets of school textbooks by L. N. Bogolubov and A.F. Nikitin. The choice of textbooks was caused by: (a) the texts under study are relatively free of non alphabetical symbols, graphs, figures etc., (b) the availability on Internet. http://kpfu.ru/slozhnost-tekstov-304364.html

KAZAN (VOLGA REGION) FEDERAL UNIVERSITY 3. Corpus Pre-processing tokenization, splitting text into sentences, excluded all extremely long sen-tences (longer than120 words) and short sentences (shorter than 5 words) Grade level Tokens Sentences ASL ASW BOG NIK 5-th - 17 221 1 499 11.49 2.35 6-th 16 467 16 475 1 273 1 197 12.94 13.76 2.56 2.71 7-th 23 069 22 924 1 671 1 675 13.81 13.69 2.84 2.70 8-th 49 796 40 053 3 181 2 889 15.65 13.86 2.96 2.88 9-th 42 305 43 404 2 584 2 792 16.37 15.55 3.04 3.00 10-th 75 182 39 183 4 468 2 468 16.83 15.88 3.07 3.12 10-th* 98 034 5 798 16.91 3.05 11-th 38 869 2 270 17.12 3.11 11-th* 100 800 6 004 16.79 3.19

KAZAN (VOLGA REGION) FEDERAL UNIVERSITY 4. Mathematical models FKG (Oborneva) = 0.5 ASL + 8.4 ASW − 15.59 FKG (our) = 0.36 ASL + 5.76 ASW − 11.97 Oborneva’s formula Our formula Grade BOG NIK BOG NIK 5-th – 9.15 – 5.16 6-th 11.35 13.05 6.69 7.87 7-th 14.03 13.01 8.54 7.85 8-th 15.81 14.17 9.78 8.62 9-th 16.38 16.29 10.18 10.12 10-th 17.18 17.45 10.74 10.92 10-th* 17.10 – 10.69 – 11-th – 17.84 – 11.21 11-th* 18.36 – 11.55 – MSE 6.92 6.51 1.07 0.98 The Oborneva’s model systematically predicts a higher text complexity

4. Our model for academic texts KAZAN (VOLGA REGION) FEDERAL UNIVERSITY 4. Our model for academic texts Учебники по всеобщей истории   Guryan11 Klimov 10 Petrov 11 Plenko Ponom Sobol Unk FKG (our) 10,43 10,69 9,90 10,83 10,75 10,12 10,49

KAZAN (VOLGA REGION) FEDERAL UNIVERSITY 5. Results: Analysis of Features An extended feature set for the text explored: PART1: Features based on length and frequency PART2: Features based on POS tags PART3: Features based of syntactic dependencies

KAZAN (VOLGA REGION) FEDERAL UNIVERSITY 5. PART1: Features based on length and frequency ASL is an average number of words per sentence ASW is an average number of syllables per word FREQ is a cumulative frequency of content words

KAZAN (VOLGA REGION) FEDERAL UNIVERSITY 5. PART2: Features based on POS tags NOUNS is a number of nouns per sentence
 VERBS is a number of verbs per sentence
 ADJ is a number of adjectives per sentence
 PRONOUNS is a number of pronouns per sentence
 PERONAL PRONOUNS is a number of personal pronouns per sentence NEG is a number of negations per sentence

KAZAN (VOLGA REGION) FEDERAL UNIVERSITY 5. PART3: Features based of syntactic dependencies AVERAGE_PATH is the quotient of the number of nodes and the number of leaves in a sentence AVERAGE_SOCHIN_LENGTH is the average length of coordinating constructions DEEPRICH_RATE is the average number of verbal participles DEEPRICH_V is the average span of a verbal adverb phrase LEAVES_NUMBER is the average number of 'leaves‘ in a sentence LONGEST_PATH is the average length of the longest branch NOUNS_DEP is the average number of modifiers in a nominal group; coordinating and explanatory links are ignored PODCHIN_NUMBER is the ratio of sentences in which there is at least one subordinate conjunctions or relational links 
 PODCHIN_RATE is the average number of subordinate links PRICH_RATE is the average number of participial construction; participial constructions are defined as a participle that has at least one dependent 
 PRICH_V is the average span of a participial construction is the quotient of the number of nodes that depend on the participle SENTSOCH_NUMBER is the average number of compound sentences SOCHIN_NUMBER is defined as the average number of coordinating chains 
 PATH_NUMBER is defined as the average number of sub-trees (in a sentence) VERBS_DEP is defined as the average number of finite dependent verbs and is calculated as the sum of nodes directly dependent on the finite verb divided by the number of finite verbs; coordinating and explanatory links were ignored.
  

KAZAN (VOLGA REGION) FEDERAL UNIVERSITY 5. Example of syntactic dependencies (ETAP-3)

KAZAN (VOLGA REGION) FEDERAL UNIVERSITY Correlation coefficient 5. Correlation between features and grade level Feature name Correlation coefficient 1 ASL 0.94 13 NOUNS 0.82 2 ASW 14 VERBS 0.74 3 SOCHIN_NUMBER 0.93 15 NEGATIONS 0.7 4 PRICH_RATE 0.91 16 PRONOUNS 5 NOUNS_DEP 0.88 17 PODCHIN_RATE 0.64 6 AVERAGE_SOCHIN_LEN 0.87 18 PODCHIN_NUMBER 0.62 7 PATH_NUMBER 19 DEEPRICH_V 0.52 8 LONGEST_PATH 0.84 20 PERS_PRONOUNS 0.47 9 FREQ 21 DEEPRICH_RATE 0.44 10 LEAVES_NUMBER 22 VERBS_DEP 0.43 11 AVERAGE_PATH 23 PRICH_V 0.33 12 ADJ 24 SENTSOCH_NUMBER 0.03

5. Significance of features KAZAN (VOLGA REGION) FEDERAL UNIVERSITY 5. Significance of features Feature Absolute value of Coefficient in Ridge Regression 1 ASL 0.506 2 ASW 0.125 3 SOCHIN_NUMBER 0.119 4 PRICH_RATE 0.106 5 LONGEST_PATH 0.089 6 PATH_NUMBER 0.079 7 LEAVES_NUMBER 0.075 8 AVERAGE_SOCHIN_LEN 0.071 9 NOUNS_DEP 10 FREQ 0.034 11 NEGATIONS 0.01 12 AVERAGE_PATH 0.007 13 PERS_PRONOUNS 0.003 14 VERBS 0.001 15 ADJ 16 NOUNS 0.0

KAZAN (VOLGA REGION) FEDERAL UNIVERSITY 6. Conclusion For our research we create dataset which are uploaded on KFU website and are available for potential verification and validation of the research outcomes. We offer 24-feature analysis of Russian texts readability embracing "classical“ features, part-of-speech, and syntactic features. The average sentence length is the most important feature for text complexity prediction. There are several highly important syntactic features such as the average number of coordinating chains, rate of participle, that can improve prediction.

KAZAN (VOLGA REGION) FEDERAL UNIVERSITY Спасибо! maki.solovyev@mail.ru 22