Download presentation
Presentation is loading. Please wait.
Published byEmory Cobb Modified over 9 years ago
1
Detecting Missing Hyphens in Learner Text Aoife Cahill, SusanneWolff, Nitin Madnani Educational Testing Service ACL 2013 Martin Chodorow Hunter College and the Graduate Center
2
Outline Introduction Baselines System Description Evaluation Conclusions
3
Introduction (1) Schools may have more after school sports. (2) I went to the dentist after school today. (3) My father like play basketball with me. Missing Hyphens :
4
Outline Introduction Baselines System Description Evaluation Conclusions
5
Baselines (1) Collins Dictionary (2) More than 1,000 times in Wikipedia (3) Probability of the hyphenated form as estimated from Wikipedia is greater than 0.66
6
Outline Introduction Baselines System Description Evaluation Conclusions
7
System Description Learner text: Schools may have more after school sports.
8
System Description Model: Logistic regression model Probability: Only predict a missing hyphen error when the probability of the prediction is >0.99
9
System Description SJM-trained: - San Jose Mercury News corpus - For training, hyphenated words are automatically split (i.e. well-known becomes well known) - The training data contains 1% of the positive examples and 3% of the negative examples
10
System Description Negative examples selected: Only contexts that occur more than 20 times are selected during training.
11
System Description Wiki-revision-trained: - Wikipedia articles
12
System Description
14
Combined: - Combine both data sources
15
Outline Introduction Baselines System Description Evaluation Conclusions
16
Evaluation Artificial Data: - Brown corpus - taking 24,243 sentences - 2,072 hyphenated words
17
Evaluation
19
Learner Text: - CLC-FCE - The corpus contains 1,244 exam scripts - Totally 173 instances of missing hyphen errors Evaluation 1
20
Evaluation
22
There are 131 true positives for the learner data reveal that 87 of these are cases of a single type, the word “make-up”.
23
Evaluation Evaluation 2 Learner Text: - A data set of 1,000 student GRE and TOEFL essays - Drawn from 295 prompts - Ranged in length from 1 to 50 sentences - Average of 378 words per essay
24
Evaluation Learner Text (Cont.): - Manually inspect a random sample of 100 instances where each system detected a missing hyphen - Two native-English speakers judge - Using the Chicago Manual of Style as a guide - High agreement
25
Evaluation
26
Outline Introduction Baselines System Description Evaluation Conclusions
27
Conclusions 1 ) Automatically detecting missing hyphen errors in learner text 2 ) The classifiers generally performed better than the baseline systems 3 ) Taking context into account when detecting the errors is important.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.