Learning to Predict Readability using Diverse Linguistic Features Rohit J. Kate 1 Xiaoqiang Luo 2 Siddharth Patwardhan 2 Martin Franz 2 Radu Florian 2.

Slides:



Advertisements
Similar presentations
Statistical modelling of MT output corpora for Information Extraction.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Objectives 10.1 Simple linear regression
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
VALIDITY.
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
INFO 624 Week 3 Retrieval System Evaluation
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Getting Started with Hypothesis Testing The Single Sample.
Chapter 9: Introduction to the t statistic
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Introduction.  Classification based on function role in classroom instruction  Placement assessment: administered at the beginning of instruction 
Albert Gatt Corpora and Statistical Methods Lecture 9.
Chapter 13: Inference in Regression
Chapter 10 Hypothesis Testing
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Measuring Hint Level in Open Cloze Questions Juan Pino, Maxine Eskenazi Language Technologies Institute Carnegie Mellon University International Florida.
What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect.
Fundamentals of Data Analysis Lecture 4 Testing of statistical hypotheses.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
A Comparison of Features for Automatic Readability Assessment Lijun Feng 1 Matt Huenerfauth 1 Martin Jansche 2 No´emie Elhadad 3 1 City University of New.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
1 The Ferret Copy Detector Finding short passages of similar texts in large document collections Relevance to natural computing: System is based on processing.
User Study Evaluation Human-Computer Interaction.
Which of the two appears simple to you? 1 2.
Statistical Estimation of Word Acquisition with Application to Readability Prediction Proceedings of the 2009 Conference on Empirical Methods in Natural.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
Automatic Readability Evaluation Using a Neural Network Vivaek Shivakumar October 29, 2009.
A Language Independent Method for Question Classification COLING 2004.
1 Automating Slot Filling Validation to Assist Human Assessment Suzanne Tamang and Heng Ji Computer Science Department and Linguistics Department, Queens.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Discriminant Analysis Discriminant analysis is a technique for analyzing data when the criterion or dependent variable is categorical and the predictor.
EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Qi Guo Emory University Ryen White, Susan Dumais, Jue Wang, Blake Anderson Microsoft Presented by Tetsuya Sakai, Microsoft Research.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
Chapter Eight: Using Statistics to Answer Questions.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Unit 1 Sections 1-1 & : Introduction What is Statistics?  Statistics – the science of conducting studies to collect, organize, summarize, analyze,
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
© 2010 IBM Corporation Learning to Predict Readability using Diverse Linguistic Features Rohit J. Kate, Xiaoqiang Luo, Siddharth Patwardhan, Martin Franz,
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
CiteData: A New Multi-Faceted Dataset for Evaluating Personalized Search Performance CIKM’10 Advisor : Jia-Ling, Koh Speaker : Po-Hsien, Shih.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
CHAPTER 15: THE NUTS AND BOLTS OF USING STATISTICS.
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
R. E. Wyllys Copyright 2003 by R. E. Wyllys Last revised 2003 Jan 15
Evaluation of IR Systems
Chapter 9 Hypothesis Testing: Single Population
Presentation transcript:

Learning to Predict Readability using Diverse Linguistic Features Rohit J. Kate 1 Xiaoqiang Luo 2 Siddharth Patwardhan 2 Martin Franz 2 Radu Florian 2 Raymond J. Mooney 1 Salim Roukos 2 ChrisWelty 2 1 Department of Computer Science The University of Texas at Austin 2 IBM Watson Research Center Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 546–554, Beijing, August 2010 報告者:劉憶年 2014/4/11

Outline Introduction Related Work Readability Data Readability Model Features for Predicting Readability Experiments Conclusions

Introduction(1/3) Readability involves many aspects including grammaticality, conciseness, clarity, and lack of ambiguity. We explore the task of learning to automatically judge the readability of natural language documents.

Introduction(2/3) For example, the results of a web-search can be ordered taking into account the readability of the retrieved documents thus improving user satisfaction. Readability judgements can also be used for automatically grading essays, selecting instructional reading materials, etc. Even when the intended consumers of text are machines, for example, information extraction or knowledge extraction systems, a readability measure can be used to filter out documents of poor readability so that the machine readers will not extract incorrect information because of ambiguity or lack of clarity in the documents.

Introduction(3/3) As part of the DARPA Machine Reading Program (MRP), an evaluation was designed and conducted for the task of rating documents for readability. In this evaluation, 540 documents were rated for readability by both experts and novice human subjects. Our results demonstrate that a rich combination of features from syntactic parsers, language models, as well as lexical statistics all contribute to accurately predicting expert human readability judgements.

Related Work(1/4) There is a significant amount of published work on a related problem: predicting the reading difficulty of documents, typically, as the school grade-level of the reader from grade 1 to 12. Some early methods measure simple characteristics of documents like average sentence length, average number of syllables per word, etc. and combine them using a linear formula to predict the grade level of a document. Some later methods use pre-determined lists of words to determine the grade level of a document. More recently, language models have been used for predicting the grade level of documents.

Related Work(2/4) Pitler and Nenkova (2008) consider a different task of predicting text quality for an educated adult audience. Kanungo and Orr (2009) consider the task of predicting readability of web summary snippets produced by search engines.

Related Work(3/4) Our work differs from this previous research in several ways. Firstly, the task we have considered is different, we predict the readability of general documents, not their grade level. Secondly, we note that all of the above approaches that use language models train a language model for each difficulty level using the training data for that level. Thirdly, we use a more sophisticated combination of linguistic features derived from various syntactic parsers and language models than any previous work.

Related Work(4/4) Fourthly, given that the documents in our data are not from a particular genre but from a mix of genres, we also train genre-specific language models and show that including these as features improves readability predictions. Finally, we also show comparison between various machine learning algorithms for predicting readability, none of the previous work compared learning algorithms.

Readability Data The readability data was collected and released by LDC. The documents were collected from the following diverse sources or genres: newswire/newspaper text, weblogs, newsgroup posts, manual transcripts, machine translation output, closed-caption transcripts and Wikipedia articles. A total of 540 documents were collected in this way which were uniformly distributed across the seven genres. Each document was then judged for its readability by eight expert human judges. Each document was also judged for its readability by six to ten naive human judges.

Readability Model We want to answer the question whether a machine can accurately estimate readability as judged by a human. The evaluation was then designed to compare how well machine and naive human judges predict expert human judgements. Hence, the task is to predict an integer score from 1 to 5 that measures the readability of the document. However, since the classes are numerical and not unrelated (for example, the score 2 is in between scores 1 and 3), we decided to model the task as a regression problem and then round the predicted score to obtain the closest integer value. We take the average of the expert judge scores for each document as its gold-standard score.

Features Based on Syntax Many times, a document is found to be unreadable due to unusual linguistic constructs or ungrammatical language that tend to manifest themselves in the syntactic properties of the text. Sundance features: The Sundance system is a rule- based system that performs a shallow syntactic analysis of text. We expect that this analysis over readable text would be “well-formed”, adhering to grammatical rules of the English language. Deviations from these rules can be indications of unreadable text. ESG features: ESG uses slot grammar rules to perform a deeper linguistic analysis of sentences than the Sundance system. ESG may consider several different interpretations of a sentence, before deciding to choose one over the other interpretations.

Features Based on Language Models(1/2) A probabilistic language model provides a prediction of how likely a given sentence was generated by the same underlying process that generated a corpus of training documents. Normalized document probability: One obvious proxy for readability is the score assigned to a document by a generic language model (LM). Due to variable document lengths, we normalize the document-level LM score by the number of words and compute the normalized document probability NP(D) for a document D as follows:

Features Based on Language Models(2/2) Perplexities from genre-specific language models: In our experiments, however, since documents were acquired through several different channels, such as machine translation or web logs, we also build models that try to predict the genre of a document. Posterior perplexities from genre-specific language models: While perplexities computed from genre- specific LMs reflect the absolute probability that a document was generated by a specific model, a model’s relative probability compared to other models may be a more useful feature.

Lexical Features(1/2) Out-of-vocabulary (OOV) rates: We conjecture that documents containing typographical errors (e.g., for closed-caption and web log documents) may receive low readability ratings. Since modern LMs often have a very large vocabulary, to get meaningful OOV rates, we truncate the vocabularies to the top (i.e., most frequent) 3000 words. Ratio of function words: A characteristic of documents generated by foreign speakers and machine translation is a failure to produce certain function words, such as “the,” or “of.”

Lexical Features(2/2) Ratio of pronouns: We conjecture that the pronoun ratio may be a good indicator whether a document is translated by machine or produced by humans, and for each document, we first run a POS tagger, and then compute the ratio of pronouns over the number of words in the document: Fraction of known words: This feature measures the fraction of words in a document that occur either in an English dictionary or a gazetteer of names of people and locations.

Evaluation Metric In order to compare a machine’s predicted readability score to those assigned by the expert judges, the Pearson correlation coefficient was computed. The mean of the expert-judge scores was taken as the gold- standard score for a document. The upper critical value was set at 97.5% confidence, meaning that if the machine performs better than the upper critical value then we reject the null hypothesis that machine scores and naive scores come from the same distribution and conclude that the machine performs significantly better than naive judges in matching the expert judges.

Regression Algorithms

Ablations with Feature Sets(1/3)

Ablations with Feature Sets(2/3)

Ablations with Feature Sets(3/3)

Official Evaluation Results

Conclusions The system accurately predicts readability as judged by linguistically-trained expert human judges and exceeds the accuracy of naive human judges. Language-model based features were found to be most useful for this task, but syntactic and lexical features were also helpful. We also found that for a corpus consisting of documents from a diverse mix of genres, using features that are indicative of the genre significantly improve the accuracy of readability predictions.