Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Introduction to Computer Science 2 Lecture 7: Extended binary trees
Bio Michel Hanna M.S. in E.E., Cairo University, Egypt B.S. in E.E., Cairo University at Fayoum, Egypt Currently is a Ph.D. Student in Computer Engineering.
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
CS38 Introduction to Algorithms Lecture 5 April 15, 2014.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
CpSc 881: Information Retrieval. 2 Why compression? (in general) Use less disk space (saves money) Keep more stuff in memory (increases speed) Increase.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Compression & Huffman Codes
CSCI 3 Chapter 1.8 Data Compression. Chapter 1.8 Data Compression  For the purpose of storing or transferring data, it is often helpful to reduce the.
A Data Compression Algorithm: Huffman Compression
EBMT1 Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman.
Compression & Huffman Codes Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
CS 206 Introduction to Computer Science II 12 / 10 / 2008 Instructor: Michael Eckmann.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Indexing and Searching
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 5: Index Compression 1.
XML Compression Aslam Tajwala Kalyan Chakravorty.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Huffman Codes Message consisting of five characters: a, b, c, d,e
Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.
Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
296.3Page 1 CPS 296.3:Algorithms in the Real World Data Compression: Lecture 2.5.
Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
COMPILERS Symbol Tables hussein suleman uct csc3003s 2007.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Review 1.Lexical Analysis 2.Syntax Analysis 3.Semantic Analysis 4.Code Generation 5.Code Optimization.
Homework #5 New York University Computer Science Department Data Structures Fall 2008 Eugene Weinstein.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Performance of Compressed Inverted Indexes. Reasons for Compression  Compression reduces the size of the index  Compression can increase the performance.
Evidence from Content INST 734 Module 2 Doug Oard.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
Huffman Coding with Non-Sorted Frequencies Shmuel Tomi Klein Dana Shapira Bar Ilan University, Ashkelon Academic College, ISRAEL.
Lampel ZIV (LZ) code The Lempel-Ziv algorithm is a variable-to-fixed length code Basically, there are two versions of the algorithm LZ77 and LZ78 are the.
Compression and Huffman Coding. Compression Reducing the memory required to store some information. Lossless compression vs lossy compression Lossless.
Introduction to COMP9319: Web Data Compression and Search Search, index construction and compression Slides modified from Hinrich Schütze and Christina.
1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.
COMP9319: Web Data Compression and Search
COMP261 Lecture 22 Data Compression 2.
Compression & Huffman Codes
Tries 07/28/16 11:04 Text Compression
Assignment 6: Huffman Code Generation
Succinct Data Structures
Indexing & querying text
Applied Algorithmics - week7
Courtsey & Copyright: DESIGN AND ANALYSIS OF ALGORITHMS Courtsey & Copyright:
CS 430: Information Discovery
Multilingual Biomedical Dictionary
Huffman Coding, Arithmetic Coding, and JBIG2
Query Languages.
Chapter 16: Greedy Algorithms
Why Compress? To reduce the volume of data to be transmitted (text, fax, images) To reduce the bandwidth required for transmission and to reduce storage.
Morphoogle - A Multilingual Interface to a Web Search Engine
Distributed Compression For Binary Symetric Channels
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
一種兼顧影像壓縮與資訊隱藏之技術 張 真 誠 國立中正大學資訊工程學系 講座教授
CPS 296.3:Algorithms in the Real World
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Outline Multilingual text Problem definition Multilingual-text alignment Compression of multilingual texts using alignment –Algorithm –Results Future work

Multilingual text Same contents in two or more (natural) languages –Legislative texts of the European Union in all EU languages Subject: Supplies of military equipment to Iraq Objet: Livraisons de matériel militaire à l’Irak

Problem definition How can multilingual texts be compressed more efficiently relative to compression of each language separately? –Can semantic equivalence be exploited to reduce aggregate corpus size?

Multilingual-text alignment (1) Mapping of equivalent text fragments to each other –Paragraph/sentence and word/phrase levels –Algorithms for both levels Tokenization, lemmatization, shallow parsing –Alignment possibly partial

Multilingual-text alignment (2)

Linear alignment Given two parallel fragments S and T, the linear alignment of a token t j in T is the token s i in S such that:

Correct vs. linear alignment

Offset from linear alignment Signed distance between correct and linear alignments –Usually very small values (mostly [-10, 10])

Compression of multilingual texts using alignment: Basic idea (1) Compress by replacing words/phrases with pointers to their translations within the other text –Original text restored using bilingual dictionary Store offsets relative to linear alignment –Small values  small number of values  efficient encoding

Compression of multilingual texts using alignment: Basic idea (2) Store number of words in pointed fragment –Might be a multi-word phrase –bilan  balance sheet Single pointer may replace multi-word phrase –matériel militaire  pointer to military equipment –chemin de fer  railway

Basic scheme: Example (option 1) Prefixes: 0 - word, 1 - pointer 1(offset, length)

Basic scheme: Example (option 2) matériel militaire  pointer to military equipment Offset relative to first words

Complication: Words with multiple possible translations Sometimes more than one possible translation per word –equipment 1. équipement 2. matériel Must encode correct translation within pointer –Store index of translation

Complication: Morphological variants (1) Bilingual dictionary must use one morphological form (lemma) –go  aller stands for: {go, went, gone, going}  {aller, vais, vas, va etc.}

Complication: Morphological variants (2) Texts include inflected forms –More than one possible lemma (bound  {bind, bound } )  must indicate correct lemmas for S to enable dictionary lookup –Several variants per lemma  must indicate correct inflections of translation words to enable restoration of T

Complication: Morphological variants (3) lower bound borne inférieure 1(1,1,0,2,0) 1(-1,1,0,4,1) borne inférieure 1(offset, length, lemma(s), translation, variant(s)) Multiple values for multiple words

Optimizations No encoding for single option –Relevant for all 3 dictionaries Sort options by descending order of frequencies –Large number of small values  better encoding Encode length as (length – 1) –length never 0

Binary encoding (1) Use 3 Huffman codes –H 1 : words + pointer prefix –H 2 : absolute values of offsets sign bit follows, except for 0 –H 3 : lengths + indices

Binary encoding (2) Words: H 1 (lemma) [H 3 (variant)] Pointers: l = length, m = (# of words in translation) H 1 (ptr_prefix) H 2 (offset) [sign_bit] H 3 (l – 1) [H 3 (lemma 0 )] … [H 3 (lemma l - 1 )] [H 3 (translation)] [H 3 (variant 0 )] … [H 3 (variant m – 1 )]

Empirical results English-French responsa collection of European parliament (ARCADE project) Sizes do not include codes for HWORD and TRANS; also not dictionaries for TRANS –Dictionaries exist anyway in large IR systems –Heaps law: Dictionary size is αN β, where 0.4  β  0.6 For large corpora, size negligible

Empirical results (2)

Future work Other test corpora –Other languages Compress target using lemmatized source Improve encoding Bidirectional scheme Pattern matching within compressed text Improved model for k languages