Grammar correction – Data collection interface David Ling 2018-08-15
Contents Background Data collection plan Data collection system (progress report)
Background Project in charge: Target: Method: Holly Chung, Amy Kwok, Anora Wong (ENG) Target: English error corrections for HK students Highlight good practices (not well defined yet) Method: Chollampatt, 2018 Fairseq (Facebook) + Language model (KenLM) Math lessons use English. Math lessons are conducted in English. Better than traditional rule based (eg. Word, Grammarly)
Grammar correction data set LANG-8 (Japan social website) NUCLE (National University of Singapore)
Grammar correction data set Building a data set for Hong Kong students Improvement on the checker Different sentence style Different error types Literature value statistical analysis on HK students’ English
Grammar correction data set Daniel Dahlmeier, Building a Large Annotated Corpus of Learner English: The NUS Corpus of Learner English, ACL 2013 Course assignments A wide range of topics, like technology innovation or health care
Grammar correction data set NUCLE DATA SAMPLE 28 error types: Verb tense, Subject-verb-agreement, Article or Determiner, Noun Number, … 10 English instructors 7 months
Data collection plan Timeline proposed by Amy Review and modify the tag sets (adding HK style tag) Hire teachers for tagging and proposed: HK$100 per essay x 2,000 essays = HK$200,000 Data set: internal use/ open to public / commercial?
Marking tool Marking tool = Data base (sqlite) + Interface (PHP + javascript) Data base: essays + annotations Currently implemented features Login, listing database essays, annotate, save and remove annotation DEMO: http://10.244.0.191/annotation/main.php Teacher Teacher Teacher