Presentation is loading. Please wait.

Presentation is loading. Please wait.

Deriving Paraphrases for Highly Inflected Languages from Comparable Documents Kfir Bar, Nachum Dershowitz Tel Aviv University, Israel.

Similar presentations


Presentation on theme: "Deriving Paraphrases for Highly Inflected Languages from Comparable Documents Kfir Bar, Nachum Dershowitz Tel Aviv University, Israel."— Presentation transcript:

1 Deriving Paraphrases for Highly Inflected Languages from Comparable Documents Kfir Bar, Nachum Dershowitz Tel Aviv University, Israel

2 Paraphrases I exposed my secret about my personal life I spilled the beans and told Jacky I loved her China did not change its policy toward Taiwan Beijing’s policy toward Taiwan remains unchanged phrase level sentence level

3 Motivation? MT coverage problem Arabic covered ngrams parallel corpus size

4 Related work on paraphrasing Continuing our previous work on Arabic synonyms (Bar and Dershowitz, AMTA, 2010) Using parallel corpus (Callison-Burch et al., 2006) Using monolingual corpus (Marton et al., 2009) Using comparable documents (Wang and Callison-Burch, 2011)

5 Why Arabic? Being a Semitic language, Arabic is highly inflected وتدرسها direct object root pattern conjunction and she learns it =

6 Extracting paraphrases Inspired by: Extracting Paraphrases from a Parallel Corpus, Regina Barzilay and Kathleen R. McKeown (2001) Working on Arabic comparable documents

7 Preparing the corpus Using Arabic Gigaword. We automatically paired documents – – published on the same day – maximize the cosine similarity over the lemma-frequency vector AFP XIN 24.12.2002 25.12.2002 27.12.2002 24.12.2002 27.12.2002 max cos similarity

8 Preparing the corpus 690 document pairs Manual evaluation by two Arabic speakers: – randomly selected 120 document pairs – question: “Do both documents discuss the same event”?

9 Preprocessing AMIRAN [Diab et al. – to appear] is a tool for finding context-sensitive morpho-syntactic information – Segmentation – Diacritized lemma – Stem – Full part-of-speech tag – Base-phrase tag – Named-entity-recognition (NER) tag

10 Extracting paraphrases: co-training technique extracting pairs of phrases co-training (context phrase) iterations paraphrases alignment paraphrases ✗ ✗ ✔

11 Extracting pairs of phrases Phrases: containing at least one non-functional word do not break base-phrase in the middle A magnitude 6.0 earthquake on the Richter scale occurred at 11:24 a.m. A strong undersea earthquake hit eastern Taiwan Wednesday A magnitude 6.0 earthquake on the Richter scale occurred at 11:24 a.m. A Strong Undersea Earthquake hit eastern Taiwan Wednesday

12 Co-training dEA xAfyyr swlAnA Almnsq Al>Ely llsyAsp AlxArjyp wAl>mnyp dEA xAfyyr swlAnA Almmvl Al>ElY llsyAsp AlxArjyp fy Inner (Phrase) Outer (Context)

13 Extracting paraphrases We maintain two sets unlabeled labeled positive = paraphrases negative = NOT paraphrases instances

14 Single iteration Unlabeled Labeled Training Outer 1 1 Using Outer 2 2 Training Inner 3 3 Using Inner 4 4 Deterministic labeling next iteration paraphrases

15 Deterministic labeling of potential paraphrases Labeling similar phrases as positive A strong undersea earthquake hit eastern Taiwan Wednesday, and there are no immediate reports of damage or casualties, according to reports from Taipei. The earthquake registering 6.0 on the Richter scale struck at 11:24 a.m. local time (0324 GMT), was about 76 km southeast of Hualien on the eastern coast, at a depth of 4 km, Taiwan's Central Weather Bureau said in a statement. A magnitude 6.0 earthquake on the Richter scale occurred at 11:24 a.m. Wednesday in the waters off Hualian, eastern Taiwan, with no immediate reports of casualties or property damage, the Central Weather Bureau (CWB) said. The quake's epicenter was 76 kilometers southeast of Hualien, according to the CWB.

16 Deterministic labeling of potential paraphrases Negative examples are also labeled – in the first iteration (single words): words don’t have similar gloss values – not using in subsequent iterations

17 The outer (context) classifier Features FeatureDescription lemma, POS, NER, BPof each context word gloss-match rateleft and right lemma-match rateleft and right Using SVM, quadratic kernel

18 The inner (phrase) classifier Features FeatureDescription POS, NER, BPof each phrase word morphological features (Boolean): conjunction, possessive, determiner, prepositions of each phrase word lengthnumber of words n-gram score2-4 grams

19 Experiments & results Arabic – 240 document pairs (165K words) – 5 iterations

20 Experiments & results negative pairspositive pairsunique paraphrases unlabeled pairs Initialization22,885,10466,31719,480 After iteration 123,799,787 (+1,726) 68,0433,166,935 After iteration 224,759,791 (+3,757) 71,8009542,790,574 After iteration 325,349,489 (+2,623) 74,4234162,198,253 After iteration 426,221,889 (+451) 74,8743311,557,931 After iteration 526,900,833(+101) 74,97572878,987 Total1,773

21 Evaluation 2 native speakers Pairs are provided with their context 4 labels: – paraphrases – entailment (e.g. a magnitude 6.0 earthquake  the quiver ) – related (e.g. San Diego ~ Los Angeles ) – wrong (e.g. a poor and little-developed province ≠ its resource-rich northwestern province)

22 Manual evaluation LengthEvaluatedParaphrasesEntailmentRelatedWrongPrecision 21204912253471% 3954510113169% 47026453550% 55024272066% Total3351442848120 66%

23 Inner classifier, morphological features ExperimentExtracted pairsPrecision Outer+Inner65368% Outer740523% Outer+Inner+ no-morph-features 21162% Tested on 40 document pairs Evaluation of 200 pairs

24 Conclusions We will try to better understand the effect of the morphological features on Arabic Utilize the paraphrases for improving Arabic-English translation system corpus sizeextracted document pairs pairs used in paraphrasing words used in inference unique paraphrases Precision Arabic~20,000,000690240165,3691,77366% English~1,000,0002944011,60052563%

25 Thank you kfirbar@post.tau.ac.il

26 Manual evaluation LengthEvaluatedParaphrasesEntailmentRelatedWrongPrecision 21202311374959% 36028691771% 45015882162% 5258521060% Total25574305697 63% English

27 Experiments & results negative pairspositive pairsunique paraphrases unlabeled pairs Initialization876,94732,9723,597 After iteration 1960,840(+868) 33,84086,648 After iteration 21,058,970 (+1,633) 35,473 23058,648 After iteration 31,109,746(+1,194) 36,66717721,332 After iteration 41,127,643(+339) 37,006946,677 After iteration 51,128,475 (+52) 37,058241,490 Total525 English

28 Co-training was 76 kilometers southeast of Hualien according to the about 76 km southeast of Hualien on the eastern Inner (Phrase) Outer (Context)

29 Manual evaluation LengthEvaluatedParaphrasesEntailmentRelatedWrongPrecision 21204912253471% 3954510113169% 47026453550% 55024272066% Total3351442848120 66% Arabic

30 Experiments & results negative pairspositive pairsunique paraphrases unlabeled pairs Initialization22,885,10466,31719,480 After iteration 123,799,787 (+1,726) 68,0433,166,935 After iteration 224,759,791 (+3,757) 71,8009542,790,574 After iteration 325,349,489 (+2,623) 74,4234162,198,253 After iteration 426,221,889 (+451) 74,8743311,557,931 After iteration 526,900,833(+101) 74,97572878,987 Total1,773 Arabic


Download ppt "Deriving Paraphrases for Highly Inflected Languages from Comparable Documents Kfir Bar, Nachum Dershowitz Tel Aviv University, Israel."

Similar presentations


Ads by Google