Some preliminary results 2017-12-27
Marking interface
Highlight unusual phrases / grammatical error Current corpora: Google ngram, Wikipedia 2007 Too many phrases not in corpus Reasons for not presenting in the corpus: (1) uncommon words (2) small corpus (3) wrong usage Highlighted: trigram not in Google ngram book (from HSMC student’s report)
It seems that it is weak in identifying simple subject verb agreement To rule out false positive highlights, we attempted normalized linear approximation last week: It seems that it is weak in identifying simple subject verb agreement 𝑝′′′ 𝑤 1 , 𝑤 2 , 𝑤 3 = 𝜆 1 𝑝′ 𝑤 1 , 𝑤 2 , 𝑤 3 + 𝜆 2 𝑝′′ 𝑤 1 , 𝑤 2 , 𝑤 3 𝑠𝑐𝑜𝑟𝑒= 𝑝′′′ 𝑤 1 , 𝑤 2 , 𝑤 3 𝑝 𝑤 1 𝑝 𝑤 2 𝑝 𝑤 3
TEST 1 O: True Positive, X: False Positive (c) Normalized Linear approximation (threshold = 0.3e-24, weights = 0.5, 0.5) 1 ,X 2, O 3, O 1: FN 2: FN 4, O 3: FN 5, O 5: FN 6, O 6: FN 7, X 8, O 9,O 10, X 7: FN 8: FN Precision: 0.7 Recall: 7/((7+8) = 0.47
TEST 3, very poor Left column: system Right column: by teacher (c) Normalized Linear approximation (threshold = 0.3e-24, weights = 0.5, 0.5) 1, X 1, X 2, X X:19 3, O Precision: 1/3=0.3 Recall: 1/20 = 0.05
New scores (1) Normalized score without interpolation normalized raw frequency = 𝑓𝑟𝑒𝑞( 𝑠𝑜 ℎ𝑒 𝑑𝑜) 𝑓𝑟𝑒𝑞( 𝑠𝑜)𝑓𝑟𝑒𝑞( ℎ𝑒)𝑓𝑟𝑒𝑞( 𝑑𝑜) (2) Sore by ratio of inflected form Eg. “So he do not explain what is YouTube” Possible inflected forms: “so he does” Calculate the ratio of “so he does”/”so he do”
Sore by ratio of inflected form Step1: use parser to detect POS tag with 'VBZ','VBP', 'VB','VBD‘ Step2: screening using normalized raw frequency THRESHOLD IS LOWER THAN PURE NORMALIZED RAW FREQUENCY normalized raw frequency = 𝑓𝑟𝑒𝑞( 𝑠𝑜 ℎ𝑒 𝑑𝑜) 𝑓𝑟𝑒𝑞( 𝑠𝑜)𝑓𝑟𝑒𝑞( ℎ𝑒)𝑓𝑟𝑒𝑞( 𝑑𝑜) Step3: ratio Ratio = 𝑓𝑟𝑒𝑞(𝑠𝑜 ℎ𝑒 𝑑𝑜𝑒𝑠) 𝑓𝑟𝑒𝑞(𝑠𝑜 ℎ𝑒 𝑑𝑜) =21.78 Step4: highlight if higher than a threshold "so he do”:3487, “so he does”:75976 “so": 724571145, "he": 2055218371, "do": 558298911} “normalized raw frequency": 4.195374032065089e-24,
Pink: due to normalized raw frequency: (threshold = 0.5e-24) Purple: due to ratio of inflected form: (rawf_threshold, ratio_threshold = 1e-23, 5.5)