Deriving Paraphrases for Highly Inflected Languages from Comparable Documents Kfir Bar, Nachum Dershowitz Tel Aviv University, Israel.

Deriving Paraphrases for Highly Inflected Languages from Comparable Documents Kfir Bar, Nachum Dershowitz Tel Aviv University, Israel

Paraphrases I exposed my secret about my personal life I spilled the beans and told Jacky I loved her China did not change its policy toward Taiwan Beijing’s policy toward Taiwan remains unchanged phrase level sentence level

Motivation? MT coverage problem Arabic covered ngrams parallel corpus size

Related work on paraphrasing Continuing our previous work on Arabic synonyms (Bar and Dershowitz, AMTA, 2010) Using parallel corpus (Callison-Burch et al., 2006) Using monolingual corpus (Marton et al., 2009) Using comparable documents (Wang and Callison-Burch, 2011)

Why Arabic? Being a Semitic language, Arabic is highly inflected وتدرسها direct object root pattern conjunction and she learns it =

Extracting paraphrases Inspired by: Extracting Paraphrases from a Parallel Corpus, Regina Barzilay and Kathleen R. McKeown (2001) Working on Arabic comparable documents

Preparing the corpus Using Arabic Gigaword. We automatically paired documents – – published on the same day – maximize the cosine similarity over the lemma-frequency vector AFP XIN 24.12.2002 25.12.2002 27.12.2002 24.12.2002 27.12.2002 max cos similarity

Preparing the corpus 690 document pairs Manual evaluation by two Arabic speakers: – randomly selected 120 document pairs – question: “Do both documents discuss the same event”?

Preprocessing AMIRAN [Diab et al. – to appear] is a tool for finding context-sensitive morpho-syntactic information – Segmentation – Diacritized lemma – Stem – Full part-of-speech tag – Base-phrase tag – Named-entity-recognition (NER) tag

Extracting paraphrases: co-training technique extracting pairs of phrases co-training (context phrase) iterations paraphrases alignment paraphrases ✗ ✗ ✔

Extracting pairs of phrases Phrases: containing at least one non-functional word do not break base-phrase in the middle A magnitude 6.0 earthquake on the Richter scale occurred at 11:24 a.m. A strong undersea earthquake hit eastern Taiwan Wednesday A magnitude 6.0 earthquake on the Richter scale occurred at 11:24 a.m. A Strong Undersea Earthquake hit eastern Taiwan Wednesday

Co-training dEA xAfyyr swlAnA Almnsq Al>Ely llsyAsp AlxArjyp wAl>mnyp dEA xAfyyr swlAnA Almmvl Al>ElY llsyAsp AlxArjyp fy Inner (Phrase) Outer (Context)

Extracting paraphrases We maintain two sets unlabeled labeled positive = paraphrases negative = NOT paraphrases instances

Single iteration Unlabeled Labeled Training Outer 1 1 Using Outer 2 2 Training Inner 3 3 Using Inner 4 4 Deterministic labeling next iteration paraphrases

Deterministic labeling of potential paraphrases Labeling similar phrases as positive A strong undersea earthquake hit eastern Taiwan Wednesday, and there are no immediate reports of damage or casualties, according to reports from Taipei. The earthquake registering 6.0 on the Richter scale struck at 11:24 a.m. local time (0324 GMT), was about 76 km southeast of Hualien on the eastern coast, at a depth of 4 km, Taiwan's Central Weather Bureau said in a statement. A magnitude 6.0 earthquake on the Richter scale occurred at 11:24 a.m. Wednesday in the waters off Hualian, eastern Taiwan, with no immediate reports of casualties or property damage, the Central Weather Bureau (CWB) said. The quake's epicenter was 76 kilometers southeast of Hualien, according to the CWB.

Deterministic labeling of potential paraphrases Negative examples are also labeled – in the first iteration (single words): words don’t have similar gloss values – not using in subsequent iterations

The outer (context) classifier Features FeatureDescription lemma, POS, NER, BPof each context word gloss-match rateleft and right lemma-match rateleft and right Using SVM, quadratic kernel

The inner (phrase) classifier Features FeatureDescription POS, NER, BPof each phrase word morphological features (Boolean): conjunction, possessive, determiner, prepositions of each phrase word lengthnumber of words n-gram score2-4 grams

Experiments & results Arabic – 240 document pairs (165K words) – 5 iterations

Experiments & results negative pairspositive pairsunique paraphrases unlabeled pairs Initialization22,885,10466,31719,480 After iteration 123,799,787 (+1,726) 68,0433,166,935 After iteration 224,759,791 (+3,757) 71,8009542,790,574 After iteration 325,349,489 (+2,623) 74,4234162,198,253 After iteration 426,221,889 (+451) 74,8743311,557,931 After iteration 526,900,833(+101) 74,97572878,987 Total1,773

Evaluation 2 native speakers Pairs are provided with their context 4 labels: – paraphrases – entailment (e.g. a magnitude 6.0 earthquake  the quiver ) – related (e.g. San Diego ~ Los Angeles ) – wrong (e.g. a poor and little-developed province ≠ its resource-rich northwestern province)

Manual evaluation LengthEvaluatedParaphrasesEntailmentRelatedWrongPrecision 21204912253471% 3954510113169% 47026453550% 55024272066% Total3351442848120 66%

Inner classifier, morphological features ExperimentExtracted pairsPrecision Outer+Inner65368% Outer740523% Outer+Inner+ no-morph-features 21162% Tested on 40 document pairs Evaluation of 200 pairs

Conclusions We will try to better understand the effect of the morphological features on Arabic Utilize the paraphrases for improving Arabic-English translation system corpus sizeextracted document pairs pairs used in paraphrasing words used in inference unique paraphrases Precision Arabic~20,000,000690240165,3691,77366% English~1,000,0002944011,60052563%

Thank you kfirbar@post.tau.ac.il

Manual evaluation LengthEvaluatedParaphrasesEntailmentRelatedWrongPrecision 21202311374959% 36028691771% 45015882162% 5258521060% Total25574305697 63% English

Experiments & results negative pairspositive pairsunique paraphrases unlabeled pairs Initialization876,94732,9723,597 After iteration 1960,840(+868) 33,84086,648 After iteration 21,058,970 (+1,633) 35,473 23058,648 After iteration 31,109,746(+1,194) 36,66717721,332 After iteration 41,127,643(+339) 37,006946,677 After iteration 51,128,475 (+52) 37,058241,490 Total525 English

Co-training was 76 kilometers southeast of Hualien according to the about 76 km southeast of Hualien on the eastern Inner (Phrase) Outer (Context)

Manual evaluation LengthEvaluatedParaphrasesEntailmentRelatedWrongPrecision 21204912253471% 3954510113169% 47026453550% 55024272066% Total3351442848120 66% Arabic

Experiments & results negative pairspositive pairsunique paraphrases unlabeled pairs Initialization22,885,10466,31719,480 After iteration 123,799,787 (+1,726) 68,0433,166,935 After iteration 224,759,791 (+3,757) 71,8009542,790,574 After iteration 325,349,489 (+2,623) 74,4234162,198,253 After iteration 426,221,889 (+451) 74,8743311,557,931 After iteration 526,900,833(+101) 74,97572878,987 Total1,773 Arabic

Deriving Paraphrases for Highly Inflected Languages from Comparable Documents Kfir Bar, Nachum Dershowitz Tel Aviv University, Israel.

Similar presentations

Presentation on theme: "Deriving Paraphrases for Highly Inflected Languages from Comparable Documents Kfir Bar, Nachum Dershowitz Tel Aviv University, Israel."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Deriving Paraphrases for Highly Inflected Languages from Comparable Documents Kfir Bar, Nachum Dershowitz Tel Aviv University, Israel.

Similar presentations

Presentation on theme: "Deriving Paraphrases for Highly Inflected Languages from Comparable Documents Kfir Bar, Nachum Dershowitz Tel Aviv University, Israel."— Presentation transcript:

Similar presentations

About project

Feedback