Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein.

Similar presentations


Presentation on theme: "Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein."— Presentation transcript:

1 Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

2 Outline Multilingual text Problem definition Multilingual-text alignment Compression of multilingual texts using alignment –Algorithm –Results Future work

3 Multilingual text Same contents in two or more (natural) languages –Legislative texts of the European Union in all EU languages Subject: Supplies of military equipment to Iraq Objet: Livraisons de matériel militaire à l’Irak

4 Problem definition How can multilingual texts be compressed more efficiently relative to compression of each language separately? –Can semantic equivalence be exploited to reduce aggregate corpus size?

5 Multilingual-text alignment (1) Mapping of equivalent text fragments to each other –Paragraph/sentence and word/phrase levels –Algorithms for both levels Tokenization, lemmatization, shallow parsing –Alignment possibly partial

6 Multilingual-text alignment (2)

7 Linear alignment Given two parallel fragments S and T, the linear alignment of a token t j in T is the token s i in S such that:

8 Correct vs. linear alignment

9 Offset from linear alignment Signed distance between correct and linear alignments –Usually very small values (mostly [-10, 10])

10 Compression of multilingual texts using alignment: Basic idea (1) Compress by replacing words/phrases with pointers to their translations within the other text –Original text restored using bilingual dictionary Store offsets relative to linear alignment –Small values  small number of values  efficient encoding

11 Compression of multilingual texts using alignment: Basic idea (2) Store number of words in pointed fragment –Might be a multi-word phrase –bilan  balance sheet Single pointer may replace multi-word phrase –matériel militaire  pointer to military equipment –chemin de fer  railway

12 Basic scheme: Example (option 1) Prefixes: 0 - word, 1 - pointer 1(offset, length)

13 Basic scheme: Example (option 2) matériel militaire  pointer to military equipment Offset relative to first words

14 Complication: Words with multiple possible translations Sometimes more than one possible translation per word –equipment 1. équipement 2. matériel Must encode correct translation within pointer –Store index of translation

15 Complication: Morphological variants (1) Bilingual dictionary must use one morphological form (lemma) –go  aller stands for: {go, went, gone, going}  {aller, vais, vas, va etc.}

16 Complication: Morphological variants (2) Texts include inflected forms –More than one possible lemma (bound  {bind, bound } )  must indicate correct lemmas for S to enable dictionary lookup –Several variants per lemma  must indicate correct inflections of translation words to enable restoration of T

17 Complication: Morphological variants (3) lower bound borne inférieure 1(1,1,0,2,0) 1(-1,1,0,4,1) borne inférieure 1(offset, length, lemma(s), translation, variant(s)) Multiple values for multiple words

18 Optimizations No encoding for single option –Relevant for all 3 dictionaries Sort options by descending order of frequencies –Large number of small values  better encoding Encode length as (length – 1) –length never 0

19 Binary encoding (1) Use 3 Huffman codes –H 1 : words + pointer prefix –H 2 : absolute values of offsets sign bit follows, except for 0 –H 3 : lengths + indices

20 Binary encoding (2) Words: H 1 (lemma) [H 3 (variant)] Pointers: l = length, m = (# of words in translation) H 1 (ptr_prefix) H 2 (offset) [sign_bit] H 3 (l – 1) [H 3 (lemma 0 )] … [H 3 (lemma l - 1 )] [H 3 (translation)] [H 3 (variant 0 )] … [H 3 (variant m – 1 )]

21 Empirical results English-French responsa collection of European parliament (ARCADE project) Sizes do not include codes for HWORD and TRANS; also not dictionaries for TRANS –Dictionaries exist anyway in large IR systems –Heaps law: Dictionary size is αN β, where 0.4  β  0.6 For large corpora, size negligible

22 Empirical results (2)

23 Future work Other test corpora –Other languages Compress target using lemmatized source Improve encoding Bidirectional scheme Pattern matching within compressed text Improved model for k languages


Download ppt "Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein."

Similar presentations


Ads by Google