Language-Model Based Text-Compression

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Advertisements

Chapter 6: Statistical Inference: n-gram Models over Sparse Data
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Modeling.
15-583:Algorithms in the Real World
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Language modelling using N-Grams Corpora and Statistical Methods Lecture 7.
NATURAL LANGUAGE PROCESSING. Applications  Classification ( spam )  Clustering ( news stories, twitter )  Input correction ( spell checking )  Sentiment.
1 I256: Applied Natural Language Processing Marti Hearst Sept 13, 2006.
Probabilistic Pronunciation + N-gram Models CSPP Artificial Intelligence February 25, 2004.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
XML Compression Aslam Tajwala Kalyan Chakravorty.
Squishin’ Stuff Huffman Compression. Data Compression Begin with a computer file (text, picture, movie, sound, executable, etc) Most file contain extra.
Data Compression Arithmetic coding. Arithmetic Coding: Introduction Allows using “fractional” parts of bits!! Used in PPM, JPEG/MPEG (as option), Bzip.
1 Advanced Smoothing, Evaluation of Language Models.
CMU-Statistical Language Modeling & SRILM Toolkits
Chapter 5 : IMAGE COMPRESSION – LOSSLESS COMPRESSION - Nur Hidayah Bte Jusoh (IT 01481) 2)Azmah Bte Abdullah Sani (IT 01494) 3)Dina Meliwana.
8. Compression. 2 Video and Audio Compression Video and Audio files are very large. Unless we develop and maintain very high bandwidth networks (Gigabytes.
Algorithms in the Real World
15-853Page :Algorithms in the Real World Data Compression II Arithmetic Coding – Integer implementation Applications of Probability Coding – Run.
Source Coding-Compression
296.3Page 1 CPS 296.3:Algorithms in the Real World Data Compression: Lecture 2.5.
Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
The Sequence Memoizer Frank Wood Cedric Archambeau Jan Gasthaus Lancelot James Yee Whye Teh UCL Gatsby HKUST Gatsby TexPoint fonts used in EMF. Read the.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
N-gram Models CMSC Artificial Intelligence February 24, 2005.
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
Chapter 7 Lossless Compression Algorithms 7.1 Introduction 7.2 Basics of Information Theory 7.3 Run-Length Coding 7.4 Variable-Length Coding (VLC) 7.5.
Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,
Compression techniques Adaptive and non-adaptive.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Huffman Coding with Non-Sorted Frequencies Shmuel Tomi Klein Dana Shapira Bar Ilan University, Ashkelon Academic College, ISRAEL.
Lampel ZIV (LZ) code The Lempel-Ziv algorithm is a variable-to-fixed length code Basically, there are two versions of the algorithm LZ77 and LZ78 are the.
Learning, Uncertainty, and Information: Evaluating Models Big Ideas November 12, 2004.
LZW (Lempel-Ziv-welch) compression method The LZW method to compress data is an evolution of the method originally created by Abraham Lempel and Jacob.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Language Model for Machine Translation Jang, HaYoung.
Information theory Data compression perspective Pasi Fränti
University of Maryland Baltimore County
A Simple Approach for Author Profiling in MapReduce
Textbook does not really deal with compression.
Data Coding Run Length Coding
A Straightforward Author Profiling Approach in MapReduce
Succinct Data Structures
Lecture 7 Data Compression
Source Coding Binit Mohanty Ketan Rajawat.
Information and Coding Theory
Tools for Natural Language Processing Applications
Presenter: Cheng – Yeh Tsao
Algorithms in the Real World
Applied Algorithmics - week7
Lempel-Ziv-Welch (LZW) Compression Algorithm
Chapter 7 Special Section
Context-based Data Compression
In the name of God Language Modeling Mohammad Bahrani Feb 2011.
Neural Language Model CS246 Junghoo “John” Cho.
Why Compress? To reduce the volume of data to be transmitted (text, fax, images) To reduce the bandwidth required for transmission and to reduce storage.
Chapter 11 Data Compression
CSE 589 Applied Algorithms Spring 1999
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Introduction to Text Analysis
Chapter 7 Special Section
Huffman Coding Link Huffman Coding How will the following be decoded: ‘e’
Language model using HTK
CPS 296.3:Algorithms in the Real World
Presentation transcript:

Language-Model Based Text-Compression James Connor Antoine El Daher

Compressing with Structure Compression Huffman Arithmetic Lempel Ziv (LV78 LV77) Most popular compression tools based on LV77 Exploiting structure Our goal: incorporate prior knowledge about the structure of the input sequence

Perplexity and Entropy Compression ratio is bounded by the Entropy of the sequence to be compressed: A low-perplexity language model is also a low-entropy distribution:

Character N-grams Represent text as an nth order markov chain of characters Maintain counts of n-grams Build a library of huffman tables based on these counts

Compressing the file Training For each bigram in the training set, we keep a map of all the words that can follow it, along with their probabilities. E.g. “to have”  (“seen”, 0.1), (“been”, 0.1), (UNK, 0.1), etc. Then for each bigram, we build a Huffman tree.

Compressing the File Compressing: We go through the input file, using the Huffman trees from the training set to code each word based on the two preceding words. If the trigram is unknown, we code the UNK token, the revert to a unigram model (also coded using Huffman). If the unigram is unknown, we use a character level Huffman (trained on the training set) to code it. Decompression works similarily; we mimic the same behavior

Extensions We have a sliding context window, so that whenever we are compressing a file, words that are seen there have their counts incremented when they enter the window (and decremented when they leave); this allows us to make better use of the local context in terms of trigrams/bigrams, and give more representative weights.

Results Competitive with Gzip