Language-Model Based Text-Compression

Slides:

Advertisements

Similar presentations

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.

Advertisements

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

Language Modeling.

15-583:Algorithms in the Real World

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna

Language modelling using N-Grams Corpora and Statistical Methods Lecture 7.

NATURAL LANGUAGE PROCESSING. Applications  Classification ( spam )  Clustering ( news stories, twitter )  Input correction ( spell checking )  Sentiment.

1 I256: Applied Natural Language Processing Marti Hearst Sept 13, 2006.

Probabilistic Pronunciation + N-gram Models CSPP Artificial Intelligence February 25, 2004.

1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.

XML Compression Aslam Tajwala Kalyan Chakravorty.

Squishin’ Stuff Huffman Compression. Data Compression Begin with a computer file (text, picture, movie, sound, executable, etc) Most file contain extra.

Data Compression Arithmetic coding. Arithmetic Coding: Introduction Allows using “fractional” parts of bits!! Used in PPM, JPEG/MPEG (as option), Bzip.

1 Advanced Smoothing, Evaluation of Language Models.

CMU-Statistical Language Modeling & SRILM Toolkits

Chapter 5 : IMAGE COMPRESSION – LOSSLESS COMPRESSION - Nur Hidayah Bte Jusoh (IT 01481) 2)Azmah Bte Abdullah Sani (IT 01494) 3)Dina Meliwana.

8. Compression. 2 Video and Audio Compression Video and Audio files are very large. Unless we develop and maintain very high bandwidth networks (Gigabytes.

Algorithms in the Real World

15-853Page :Algorithms in the Real World Data Compression II Arithmetic Coding – Integer implementation Applications of Probability Coding – Run.

Source Coding-Compression

296.3Page 1 CPS 296.3:Algorithms in the Real World Data Compression: Lecture 2.5.

Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

The Sequence Memoizer Frank Wood Cedric Archambeau Jan Gasthaus Lancelot James Yee Whye Teh UCL Gatsby HKUST Gatsby TexPoint fonts used in EMF. Read the.

Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.

Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.

Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.

N-gram Models CMSC Artificial Intelligence February 24, 2005.

CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.

Chapter 7 Lossless Compression Algorithms 7.1 Introduction 7.2 Basics of Information Theory 7.3 Run-Length Coding 7.4 Variable-Length Coding (VLC) 7.5.

Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Compression techniques Adaptive and non-adaptive.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Huffman Coding with Non-Sorted Frequencies Shmuel Tomi Klein Dana Shapira Bar Ilan University, Ashkelon Academic College, ISRAEL.

Lampel ZIV (LZ) code The Lempel-Ziv algorithm is a variable-to-fixed length code Basically, there are two versions of the algorithm LZ77 and LZ78 are the.

Learning, Uncertainty, and Information: Evaluating Models Big Ideas November 12, 2004.

LZW (Lempel-Ziv-welch) compression method The LZW method to compress data is an evolution of the method originally created by Abraham Lempel and Jacob.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Language Model for Machine Translation Jang, HaYoung.

Information theory Data compression perspective Pasi Fränti

University of Maryland Baltimore County

A Simple Approach for Author Profiling in MapReduce

Textbook does not really deal with compression.

Data Coding Run Length Coding

A Straightforward Author Profiling Approach in MapReduce

Succinct Data Structures

Lecture 7 Data Compression

Source Coding Binit Mohanty Ketan Rajawat.

Information and Coding Theory

Tools for Natural Language Processing Applications

Presenter: Cheng – Yeh Tsao

Algorithms in the Real World

Applied Algorithmics - week7

Lempel-Ziv-Welch (LZW) Compression Algorithm

Chapter 7 Special Section

Context-based Data Compression

In the name of God Language Modeling Mohammad Bahrani Feb 2011.

Neural Language Model CS246 Junghoo “John” Cho.

Why Compress? To reduce the volume of data to be transmitted (text, fax, images) To reduce the bandwidth required for transmission and to reduce storage.

Chapter 11 Data Compression

CSE 589 Applied Algorithms Spring 1999

Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Introduction to Text Analysis

Chapter 7 Special Section

Huffman Coding Link Huffman Coding How will the following be decoded: ‘e’

Language model using HTK

CPS 296.3:Algorithms in the Real World

Presentation transcript:

Language-Model Based Text-Compression James Connor Antoine El Daher

Compressing with Structure Compression Huffman Arithmetic Lempel Ziv (LV78 LV77) Most popular compression tools based on LV77 Exploiting structure Our goal: incorporate prior knowledge about the structure of the input sequence

Perplexity and Entropy Compression ratio is bounded by the Entropy of the sequence to be compressed: A low-perplexity language model is also a low-entropy distribution:

Character N-grams Represent text as an nth order markov chain of characters Maintain counts of n-grams Build a library of huffman tables based on these counts

Compressing the file Training For each bigram in the training set, we keep a map of all the words that can follow it, along with their probabilities. E.g. “to have”  (“seen”, 0.1), (“been”, 0.1), (UNK, 0.1), etc. Then for each bigram, we build a Huffman tree.

Compressing the File Compressing: We go through the input file, using the Huffman trees from the training set to code each word based on the two preceding words. If the trigram is unknown, we code the UNK token, the revert to a unigram model (also coded using Huffman). If the unigram is unknown, we use a character level Huffman (trained on the training set) to code it. Decompression works similarily; we mimic the same behavior

Extensions We have a sliding context window, so that whenever we are compressing a file, words that are seen there have their counts incremented when they enter the window (and decremented when they leave); this allows us to make better use of the local context in terms of trigrams/bigrams, and give more representative weights.

Results Competitive with Gzip