Download presentation
Presentation is loading. Please wait.
Published byKristopher Mosley Modified over 9 years ago
1
Copyright © 2013 by Educational Testing Service. All rights reserved. 14-June-2013 Detecting Missing Hyphens in Learner Text Aoife Cahill *, Martin Chodorow **, Susanne Wolff * and Nitin Madnani * * Educational Testing Service, 660 Rosedale Road, Princeton, NJ 08541, USA {acahill, swolff, nmadnani}@ets.org ** Hunter College and the Graduate Center, City University of New York, NY 10065, USA martin.chodorow@hunter.cuny.edu
2
Copyright © 2013 by Educational Testing Service. All rights reserved. Outline 14-June-2013 Motivation Baselines New Model Experiments and Results Conclusion 2
3
Copyright © 2013 by Educational Testing Service. All rights reserved. Motivation 14-June-2013 Hyphen errors are infrequent But are an important consideration for students aiming to improve the overall quality of their writing 3 Dogs are lucky… most of them have built in fur coats! Brrrr! From: http://daughternumberthree.blogspot.comhttp://daughternumberthree.blogspot.com
4
Copyright © 2013 by Educational Testing Service. All rights reserved. Motivation 14-June-2013 4 Missing hyphen errors are not all lexical 1.Schools may have more after school sports. 2.I went to the dentist after school today. Language Learner text introduces additional complications 3.My father like play basketball with me.
5
Copyright © 2013 by Educational Testing Service. All rights reserved. Baselines 14-June-2013 Baseline 1: Collins Dictionary [5,246] – predicts a missing hyphen between bigrams that appear hyphenated in the dictionary Baseline 2: Wiki (counts) [1,095] – predicts a missing hyphen between bigrams that occur hyphenated more than 1,000 times in Wikipedia Baseline 3: Wiki (probs) [673,269] – predicts a missing hyphen between bigrams where the probability of the hyphenated form as estimated from Wikipedia is greater than 0.66 5
6
Copyright © 2013 by Educational Testing Service. All rights reserved. New Model 14-June-2013 Logistic Regression Model – assigns a probability to the likelihood of a hyphen occurring between w i and w i+1 6 Features Tokensw i-1, w i, w i+1, w i+2 DictDoes the hyphenated form appear in the Collins dictionary? Stemss i-1, s i, s i+1, s i+2 ProbWhat is the probability of the word bigram appearing hyphenated in Wikipedia? Tagst i-1, t i, t i+1, t i+2 DistanceDistance to following and preceding verb, noun Bigramsw i –w i+1, s i –s i+1, t i –t i+1 Verb/NounIs there a verb/noun preceding/following this bigram?
7
Copyright © 2013 by Educational Testing Service. All rights reserved. Data 14-June-2013 Training – Well-edited text (San Jose Mercury News) – Error-corrected data mined from Wikipedia Revisions – Combination Test – Artificial errors: Brown corpus – Learner text: CLC-FCE corpus, TOEFL/GRE essays 7
8
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluation on Artificial Errors 14-June-2013 Brown Corpus: 24,243 sentences, automatically remove hyphens from 2,072 words Each system makes a prediction for all bigrams about whether a hyphen should appear between the pair of words – precision: how many of the missing hyphen errors predicted by the system were true errors – recall: how many of the artificially removed hyphens the system detected as errors – f-score: the harmonic mean of precision and recall 8
9
Copyright © 2013 by Educational Testing Service. All rights reserved. Artificial Error Results 14-June-2013 9 BaselineTrue PositivesPrecisionRecallF-Score Collins Dictionary39740.519.226.0 Wiki (counts)35939.117.324.0 Wiki (probs)81185.539.153.7 ClassifierTrue PositivesPrecisionRecallF-Score SJM-Trained109782.052.964.3 Wiki-Revision-trained106172.851.260.1 Combination110680.953.464.3
10
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluation on Learner Text (1) 14-June-2013 CLC-FCE corpus: 173 instances of missing hyphen errors 10 BaselineTrue PositivesPrecisionRecallF-Score Collins Dictionary13164.575.769.7 Wiki (counts)14173.181.577.0 Wiki (probs)3692.320.834.0 ClassifierTrue PositivesPrecisionRecallF-Score SJM-Trained6084.534.749.2 Wiki-Revision-trained7198.641.058.0 Combination6698.538.255.0
11
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluation on Learner Text (1) Some observations: – Very low frequency error (173) – Dominated by one lexical item: make- up – Errors are not independent events 14-June-2013 11
12
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluation on Learner Text (2) Precision-only manual evaluation Random sample of 100 errors per system detected in 1,000 student essays 2 native speaker judgements (0.79) 14-June-2013 12
13
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluation on Learner Text (2) Native Speaker Judgements (Precision) 14-June-2013 13 BaselineTotal PredictionsJudge 1Judge 2 Collins Dictionary416118 Wiki (counts)21852021 Wiki (probs)2245452 ClassifierTotal PredictionsJudge 1Judge 2 SJM-Trained4216269 Wiki-Revision-trained5774341 Combination4506062
14
Copyright © 2013 by Educational Testing Service. All rights reserved. Conclusions 14-June-2013 Logistic Regression Model for predicting missing hyphens in learner text Trained on: 1.A corpus of well-edited text 2.A corpus of automatically mined corrections In general, the classifiers outperform the baselines, especially in terms of precision 14 http://blog.ezinearticles.com Thanks! Questions? Comments?
15
Copyright © 2013 by Educational Testing Service. All rights reserved. Brown Corpus: Precision/Recall 14-June-2013 15
16
Copyright © 2013 by Educational Testing Service. All rights reserved. CLC-FCE: Precision/Recall 14-June-2013 16
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.