Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein.

Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Outline Multilingual text Problem definition Multilingual-text alignment Compression of multilingual texts using alignment –Algorithm –Results Future work

Multilingual text Same contents in two or more (natural) languages –Legislative texts of the European Union in all EU languages Subject: Supplies of military equipment to Iraq Objet: Livraisons de matériel militaire à l’Irak

Problem definition How can multilingual texts be compressed more efficiently relative to compression of each language separately? –Can semantic equivalence be exploited to reduce aggregate corpus size?

Multilingual-text alignment (1) Mapping of equivalent text fragments to each other –Paragraph/sentence and word/phrase levels –Algorithms for both levels Tokenization, lemmatization, shallow parsing –Alignment possibly partial

Multilingual-text alignment (2)

Linear alignment Given two parallel fragments S and T, the linear alignment of a token t j in T is the token s i in S such that:

Correct vs. linear alignment

Offset from linear alignment Signed distance between correct and linear alignments –Usually very small values (mostly [-10, 10])

Compression of multilingual texts using alignment: Basic idea (1) Compress by replacing words/phrases with pointers to their translations within the other text –Original text restored using bilingual dictionary Store offsets relative to linear alignment –Small values  small number of values  efficient encoding

Compression of multilingual texts using alignment: Basic idea (2) Store number of words in pointed fragment –Might be a multi-word phrase –bilan  balance sheet Single pointer may replace multi-word phrase –matériel militaire  pointer to military equipment –chemin de fer  railway

Basic scheme: Example (option 1) Prefixes: 0 - word, 1 - pointer 1(offset, length)

Basic scheme: Example (option 2) matériel militaire  pointer to military equipment Offset relative to first words

Complication: Words with multiple possible translations Sometimes more than one possible translation per word –equipment 1. équipement 2. matériel Must encode correct translation within pointer –Store index of translation

Complication: Morphological variants (1) Bilingual dictionary must use one morphological form (lemma) –go  aller stands for: {go, went, gone, going}  {aller, vais, vas, va etc.}

Complication: Morphological variants (2) Texts include inflected forms –More than one possible lemma (bound  {bind, bound } )  must indicate correct lemmas for S to enable dictionary lookup –Several variants per lemma  must indicate correct inflections of translation words to enable restoration of T

Complication: Morphological variants (3) lower bound borne inférieure 1(1,1,0,2,0) 1(-1,1,0,4,1) borne inférieure 1(offset, length, lemma(s), translation, variant(s)) Multiple values for multiple words

Optimizations No encoding for single option –Relevant for all 3 dictionaries Sort options by descending order of frequencies –Large number of small values  better encoding Encode length as (length – 1) –length never 0

Binary encoding (1) Use 3 Huffman codes –H 1 : words + pointer prefix –H 2 : absolute values of offsets sign bit follows, except for 0 –H 3 : lengths + indices

Binary encoding (2) Words: H 1 (lemma) [H 3 (variant)] Pointers: l = length, m = (# of words in translation) H 1 (ptr_prefix) H 2 (offset) [sign_bit] H 3 (l – 1) [H 3 (lemma 0 )] … [H 3 (lemma l - 1 )] [H 3 (translation)] [H 3 (variant 0 )] … [H 3 (variant m – 1 )]

Empirical results English-French responsa collection of European parliament (ARCADE project) Sizes do not include codes for HWORD and TRANS; also not dictionaries for TRANS –Dictionaries exist anyway in large IR systems –Heaps law: Dictionary size is αN β, where 0.4  β  0.6 For large corpora, size negligible

Empirical results (2)

Future work Other test corpora –Other languages Compress target using lemmatized source Improve encoding Bidirectional scheme Pattern matching within compressed text Improved model for k languages

Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein.

Similar presentations

Presentation on theme: "Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein.

Similar presentations

Presentation on theme: "Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein."— Presentation transcript:

Similar presentations

About project

Feedback