Presentation is loading. Please wait.

Presentation is loading. Please wait.

English project More detail and the data collection system

Similar presentations


Presentation on theme: "English project More detail and the data collection system"— Presentation transcript:

1 English project More detail and the data collection system
David Ling

2 Contents Project background Evaluation Training Data collection system

3 Background Project in charge: Target:
Holly Chung, Amy Kwok, Anora Wong (ENG) Target: English error corrections for HK students Highlight good practices (not well defined yet) More than traditional grammar checkers: Chinglish, collocation, meaning, and style Math lessons use English.  Math lessons are conducted in English. He can say Chinese.  He can speak Chinese.

4 Background Old methods: New method: Rule based Statistical methods
An error rule extracted from LanguageTool on subject-verb-agreement Sentence start + determiner + plural noun + is (Eg. The dogs is …, The teachers is, ….) Pattern matched  Trigger correction About 1.7k error handcrafted error patterns Old methods: Rule based eg. Microsoft Word, LanguageTool Statistical methods New method: Deep Learning, Translation, data driven Chollampatt, 2018 (National Singapore University) Fairseq (Facebook) + Language model (KenLM)

5 Which set is by deep learning?
Input sentences He go to schol tomorrow. "I go to school by bus.", said David yesterday. … she did not want another mother would also feeling it. It can make the audiences having the same feeling on it. He goes to school tomorrow. "I go to school by bus.", said, David, yesterday. … she did not want another mother would also feel it. [No change] He will go to school tomorrow. "I go to school by bus," said David yesterday. … she did not want another mother to feel it. It can make the audience feel the same way. A B Correction based on the context Recall more errors Not just correcting errors, but also improving styles Grammarly Deep learning Which set is by deep learning?

6 Evaluation – Four main steps
INPUT He go to schol tomorrow. 1. Tokenize + Byte pair encoding He go to l tomorrow . He will go to school tomorrow. OUTPUT 2. Fairseq (Beam search 12 sentences) 4. Reweighted with number of edit operations and sentence length He will go to school tomorrow . ||| F0= He goes to school tomorrow . ||| F0= He is going to school tomorrow . ||| F0= He will go to school tomorrow . ||| LM0= He goes to school tomorrow . ||| LM0= He is going to school tomorrow . ||| LM0= 3. Language model (Kenlm 150GB)

7 Training – data sets LANG-8 (Japan social website), 2012 ~2000k sentences NUCLE (National University of Singapore), 2014 ~60k sentences (1500 essays) Topics and errors are far from enough, eg. eSports

8 Training - with additional training sentences
Before After are conducted in are conducted in be used in be used in math Recalled successfully Chinese and Science are not in training data Five additional training sentences 1. Math lessons used English .  Math lessons were conducted in English . 2. Physics lessons used English .  Physics lessons were conducted in English . 3. Biology classes use English .  Biology classes are conducted in English . 4. History lessos used English .  History lessons was conducted in English . 5. Philosophy lessons often used English .  Philosophy lessons are conducted in English often

9 Grammar correction data set
Building a data set for Hong Kong students Improvement on the checker Different sentence style Different error types Literature value statistical analysis on HK students’ English

10 Data collection system
Four tables in the database System = database + interface (PHP+JS) System Contains about 40 computer corrected essays Asked the English teachers to try SQLITE Easy compatible with python and php Stored as a single file

11 Data collection system
Table -- ESSAYS Table -- ANNOTATIONS Stored in JSON format

12 END Thank you


Download ppt "English project More detail and the data collection system"

Similar presentations


Ads by Google