Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ariadna Font Llitjós March 10, 2004

Similar presentations


Presentation on theme: "Ariadna Font Llitjós March 10, 2004"— Presentation transcript:

1 Ariadna Font Llitjós March 10, 2004
AVENUE's last component: Interactive and Automatic Refinement of Translation Rules Ariadna Font Llitjós March 10, 2004

2 The Translation Correction Tool Rule Refinement Module
Today: the Interactive step The Translation Correction Tool English-Spanish user study (LREC’04 paper) Next week: the Automatic step Rule Refinement Module

3 AVENUE overview

4 Motivation In general: MTS output still requires post-editing
Current systems do not recycle post-editing efforts back into the system Within AVENUE: Communities that speak a low-density language tend not to have computational linguists who can write translation grammars need to validate automatically learned transfer rules (RL module) General, 2nd point: - …., and thus MTS don’t improve beyond adding that specific corrected translation to the database.

5 Goal Simplify the correction task maximally
Get naïve bilingual speakers to: accurately and minimally correct translations, and accurately classify MT errors

6 Ultimate goal Learn mapping between incorrect structures and correct structures. Ex: She saw  high woman She saw the tall woman

7 Spanish SLS: Ella vio a la mujer alta English TLS: She saw high woman
Corrected TLS: She saw the tall woman MT error classification: missing determiner + wrong sense Blame assignment (NP rule that generated the direct object + selectional restrictions) Rule refinement: the Noun Phrase (NP) rule that generated the error: NP -> Adj N needs to be refined into 2 different cases: NP -> Det Adj N[sg] (the tall woman) NP -> (Det) Adj N[pl] ((the)? tall women)

8 Research questions What does minimally correction mean? How can we convey this to bilingual informants? What is the easiest and most intuitive way to minimally correct a sentence? Namely, how can users indicate errors? What is the right (intuitive and easy) MT error classification that will help in the RR task the most? Can naive users actually tell what the source of error is? Should we make MT error classification finer depending on user profiles? Does it make a difference if there is more context than just a sentence?

9 The Translation Correction Tool (TCTool)
Online tool User friendly and easy to use (not for linguists or computer experts) Provides translation and error classification help (23-page tutorial + error-example page) Elicits users for as much information about translation errors as possible Initial MT error classification, expected to change after user studies 1st attempt to answer these questions, resulted into the TCTool *online tool: bilingual users can access it from anywhere where there is a computer and and internet connection (office, house, etc.)

10

11 MT error classification
1st approach, expected to change after u.s. 9 linguistically motivated error types: word order, sense, agreement (number, person, gender, tense), form (case, POS), incorrect word, and no translation Users were given the agreement options, but case and POS had to be indistinguishably classified as form.

12 English-Spanish user study
32 sentences from the elicitation corpus: 4 correct / 28 incorrect Examples: sl: mary and anna are falling tl: maría y ana están cayendo sl: you saw the woman tl: viste la mujer tl: vió la mujer sl: i used my elbow to push the button tl: usé mi codo que apretar el botón sl: we are building new bridges in the city tl: nosotros estamos construyó nuevo puentes dentro la ciudad

13 English-Spanish MTS 12 manually written translation rules
(2 for S, 7 for NP and 3 for VP) 442 lexical entries (designed to translate the first 400 sentences of the Elicitation corpus)

14 Correction Example

15 Editing a word

16 Users stats (Spain 24, Colombia 4, Mexico 1)
29 users completed the evaluation. (Spain 24, Colombia 4, Mexico 1) 66% of the users did not have any background in Linguistics, 75% had a graduate degree and 25% of the users had a Bachelor's degree. Fixed 26.6 translations (over 32; 28 needed fixing) duration: ~1 hour 30 min [28min-5hours] about 3 minutes per sentence

17 ******************************************
Time stamp: Mar (22:51:13) num of sessions is 83 but the number of IP addresses is only 55 the number of queries by natives of US English is 2 the number of queries by non-natives is 69 the number of queries by users who did not specify either way is 12 Stats for ALL users: total number of users who finished the user studies is 29 total number of users who didn't start evaluation 15 total number of users who started (68) but didn't finish all sentences is 39 TOTAL NUMBER OF FINISHED SESSIONS 29 and 29 filled out the questionnaire

18

19 Gold standard To measure user accuracy in detecting and classifying errors, we need to establish exactly what is the minimum number of errors and corrections needed per translation. Created a gold standard, which determines what are the least number of errors that must be corrected and what are the error types Doesn’t include corrections that might make the translation more fluid, but don’t change it from incorrect to correct (s.a. removing the subject in Spanish)

20 Accuracy measures (wrt. gold standard)
for 10 (of the 29): - from Spain, - 2 had Linguistics background - 2 had a Bachelor's degree, 5 a Masters and 3 a PhD. To measure accuracy, i.e. how close are users from the gold standard, we looked at precision, recall and F1 measure. In this context, precision is a measure of the proportion of errors that the user fixed correctly (\# errors detected correctly / \# errors detected). And since we are also interested in the accuracy of users when telling us what is the type of the error, we also estimated the precision in which users checked the right error type. Recall is a measure of the proportion of the errors in the translations that the user detected (\# errors detected correctly / \# errors there are in gold standard). Usually there is a trade-off between precision and recall, and the F1 measure is an even combination of the two. It is defined as [2*p*r/(p+r)]. All three measures fall in the range from 0 to 1, with 1 being the best score. Interested in high precision, even at the expense of lower recall - ideally: no false positives (users correcting something that is not strictly necessary) - we don't care so much about having false negatives (errors that were not corrected)

21 Analyzing results Users did not always fix a translation in the same way Most of the time, when the final translation was not = g.s. , it was still correct or better Users only produced 2.5 (out of 26.6) translations that were worse than the g.s. There doesn’t seem to be a time-accuracy correlation

22 Usability Questionnaire
All users said it was easy to determine if a sentence was correct (reality: 89% accuracy) Users who thought it was easy to determine the source of error goes down to 88% of the users (reality: 73% accuracy)

23 Users thought TCTool is user friendly (82%)

24 But alignment representation could be improved (67%)
Pie charts for all questions at the end

25 Conclusions MT error classification needs to depart from linguistically motivated classes; motivate on RR operations. TCTool usable, but some improvements needed: Tutorial -> dynamic, with movies (Ken) Make the alignment representation less confusing (done) Add login capability (so that users can take breaks, and not loose their work) Improve edit_word pop-up window interface

26 TCTool questionnaire stats
Ariadna Font Llitjós March 2, 2004 24 users

27

28

29

30

31

32

33

34

35

36

37

38

39 Total users = 12

40 Total users = 14


Download ppt "Ariadna Font Llitjós March 10, 2004"

Similar presentations


Ads by Google