Download presentation
1
The PAQ4 Data Compressor
Matt Mahoney Florida Tech.
2
Outline Data compression background The PAQ4 compressor
Modeling NASA valve data History of PAQ4 development
3
Data Compression Background
Lossy vs. lossless Theoretical limits on lossless compression Difficulty of modeling data Current compression algorithms
4
Lossy vs. Lossless Lossy compression discards unimportant information
NTSC (color TV), JPEG, MPEG discard imperceptible image details MP3 discards inaudible details Losslessly compressed data can be restored exactly
5
Theoretical Limits on Lossless Compression
Cannot compress random data Cannot compress recursively Cannot compress every possible message Every compression algorithm must expand some messages by at least 1 bit Cannot compress x better than log2 1/P(x) bits on average (Shannon, 1949)
6
Difficulty of Modeling
In general, the probability distribution P of a source is unknown Estimating P is called modeling Modeling is hard Text: as hard as AI Encrypted data: as hard as cryptanalysis
7
Text compression is as hard as passing the Turing test for AI
Q: Are you human? A: Yes Computer P(x) = probability of a human dialogue x (known implicitly by humans) A machine knowing P(A|Q) = P(QA)/P(Q) would be indistinguishable from human Entropy of English ≈ 1 bit per character (Shannon, 1950) Best compression: 1.2 to 2 bpc (depending on input size)
8
Compressing encrypted data is equivalent to breaking the encryption
Example: x = 1,000,000 0 bytes encrypted with AES in CBC mode and key “foobar” The encrypted data passes all tests for statistical randomness (not compressible) C(x) = 65 bytes using English Finding C(x) requires guessing the key
9
Nevertheless, some common data is compressible
10
Redundancy in English text
Letter frequency: P(e) > P(q) so “e” is assigned a shorter code Word frequency: P(the) > P(eth) Semantic constraints: P(drink tea) > P(drink air) Syntactic constraints: P(of the) > P(the of)
11
Redundancy in images (pic from Calgary corpus)
Adjacent pixels are often the same color, P(000111) > P(011010)
12
Redundancy in the Calgary corpus Distance back to last match of length 1, 2, 4, or 8
13
Redundancy in DNA P(a)=P(t)=P(c)=P(g)=1/4 (2 bpc) e.coli (1.92 bpc?)
tcgggtcaataaaattattaaagccgcgttttaacaccaccgggcgtttctgccagtgacgttcaagaaaatcgggccattaagagtgagttggtattccatgttaagcatccacaggctggtatctgcaaccgattataacggatgcttaacgtaatcgtgaagtatgggcatatttattcatctttcggcgcagaatgctggcgaccaaaaatcacctccatccgcgcaccgcccgcatgctctctccggcgacgattttaccctcatattgctcggtgatttcgcgggctacc P(a)=P(t)=P(c)=P(g)=1/4 (2 bpc) e.coli (1.92 bpc?)
14
Some data compression methods
LZ77 (gzip) – Repeated strings replaced with pointers back to previous occurrence LZW (compress, gif) – Repeated strings replaced with index into dictionary LZ decompression is very fast PPM (prediction by partial match) – characters are arithmetic encoded based on statistics of longest matching context Slower, but better compression
15
LZ77 Example the cat in the hat ...a...a...a...
Sub-optimal compression due to redundancy in LZ77 coding or? ...a...a...a...
16
LZW Example at the in the cat in the hat a ab
bc c Sub-optimal compression due to parsing ambiguity ab+c or a+bc? ...ab...bc...abc...
17
Predictive Arithmetic Compression (optimal)
Compressor input Predict next symbol p Arithmetic Coder compressed data Decompressor output Predict next symbol p Arithmetic Decoder
18
Arithmetic Coding Maps string x into C(x) [0,1) represented as a high precision binary fraction P(y < x) < C(x) < P(y ≤ x) < is a lexicographical ordering There exists a C(x) with at most a log2 1/P(x) + 1 bit representation Optimal within 1 bit of Shannon limit Can be computed incrementally As characters of x are read, the bounds tighten As the bounds tighten, the high order bits of C(x) can be output
19
Arithmetic coding example
P(a) = 2/3, P(b) = 1/3 We can output “1” after the first “b” aaa = “” a aa aaa 0.01 aab aba = 1 ab aba 0.1 abb b ba baa 0.11 baa = 11 bab bb bba bbb = 11111 bbb
20
Prediction by Partial Match (PPM) Guess next letter by matching longest context
the cat in th? Longest context match is “th” Next letter in context “th” is “e” the cat in the ha? Longest context match is “a” Next letter in context “a” is “t”
21
How do you mix old and new evidence?
..abx...abx...abx...aby...ab? P(x) = ? P(y) = ?
22
How do you mix evidence from contexts of different lengths?
..abcx...bcy...cy...abc? P(x) = ? P(y) = ? P(z) = ? (unseen but not impossible)
23
PAQ4 Overview Predictive arithmetic coder Predicts 1 bit at a time
19 models make independent predictions Most models favor newer data Weighted average of model predictions Weights adapted by gradient descent SSE adjusts final probability (Osnach) Mixer and SSE are context sensitive
24
PAQ4 Input Data Mixer Arithmetic Coder SSE Compressed Data context
Model p Mixer Model Arithmetic Coder p p SSE Model Model Compressed Data
25
19 Models Fixed (P(1) = ½) n-gram, n = 1 to 8 bytes
Match model for n > 8 1-word context (white space boundary) Sparse 2-byte contexts (skips a byte) (Osnach) Table models (2 above, or 1 above and left) 8 predictions per byte Context normally begins on a byte boundary
26
n-gram and sparse contexts
x? x.x? ......xx? x..x? .....xxx? x.x.? ....xxxx? x...x...? ...xxxxx? xx.? ..xxxxxx? xx..? .xxxxxxx? word? (begins after space) xxxxxxxx? xxxxxxxxxx? (variable length > 8)
27
Record (or Table) Model
Find a byte repeated 4 times with same interval, e.g. ..x..x..x..x If interval is at least 3, assume a table 2 models: first and second bytes above bytes above and left ...x... ...? ..x?
28
Nonstationary counter model
Count 0 and 1 bits observed in each context Discard from the opposite count: If more than 2 then discard ½ of the excess Favors newer data and highly predictive contexts
29
Nonstationary counter example
Input (in some context) n0 n1 p(1) /10 /7 /6 /6 /6 /7 /8
30
Mixer p(1) = i win1i / i ni Cost to code a 0 bit = -log p(1)
wi = weight of i’th model n0i, n1i = 0 and 1 counts for i’th model ni = n0i + n1i Cost to code a 0 bit = -log p(1) Weight gradient to reduce cost = ∂cost/∂wi = n1i/jwjnj – ni/jwjn1j Adjust wi by small amount ( %) in direction of negative gradient after coding each bit (to reduce the cost of coding that bit)
31
Secondary Symbol Estimation (SSE)
Maps P(x) to P(x) Refines final probability by adapting to observed bits Piecewise linear approximation 32 segments (shorter near 0 or 1) Counts n0, n1 at segment intersections (stationary, no discounting opposite count) 8-bit counts are halved if over 255
32
SSE example Output p 1 Initial function Trained function Input p
33
Mixer and SSE are context sensitive
8 mixers selected by 3 high order bits of last whole byte 1024 SSE functions selected by current partial byte and 2 high order bits of last whole byte
34
Experimental Results on Popular Compressors, Calgary Corpus
Size (bytes) Compression Time, 750 MHz Original data compress 1.5 sec. pkzip 2.04e 1.5 gzip -9 2 winrar 3.20 754270 7 paq4 672134 166
35
Results on Top Compressors
Size Time ppmn 716297 23 sec. rk 1.02 707160 44 ppmonstr I 696647 35 paq4 672134 166 epm r9 668115 54 rkc 661602 91 slim 18 659358 153 compressia 1.0b 650398 66 durilca 0.3a 647028
36
Compression for Anomaly Detection
Anomaly detection: finding unlikely events Depends on ability to estimate probability So does compression
37
Prior work Compression detects anomalies in NASA TEK valve data
C(normal) = C(abnormal) C(normal + normal) < C(normal + abnormal) Verified with gzip, rk, and paq4
38
NASA Valve Solenoid Traces
Data set 3 solenoid current (Hall effect sensor) 218 normal traces 20,000 samples per trace Measurements quantized to 208 values Data converted to a 4,360,000 byte file with 1 sample per byte
39
Graph of 218 overlapped traces data (green)
40
Compression Results Compressor Size Original 4360000 gzip -9 1836587
slim 18 epm r9 durilca 0.3a rkc rk4 –mx ppmonstr Ipre paq4
41
PAQ4 Analysis Removing SSE had little effect
Removing all models except n=1 to 5 had little effect Delta coding made compression worse for all compressors Model is still too large to code in SCL, but uncompressed data is probably noise which can be modeled statistically
42
Future Work Compress with noise filtered out
Verify anomaly detection by temperature, voltage, and plunger impediment (voltage test 1) Investigate analog and other models Convert models to rules
43
History of PAQ4 Date Compressor Calgary Size Nov. 1999
P12 (Neural net, FLAIRS paper in 5/2000) 831341 Jan. 2002 PAQ1 (Nonstationary counters) 716704 May 2003 PAQ2 (Serge Osnach adds SSE) 702382 Sept. 2003 PAQ3 (Improved SSE) 696616 Oct. 2003 PAQ3N (Osnach adds sparse models) 684580 Nov. 2003 PAQ4 (Adaptive mixing) 672135
44
Acknowledgments Serge Osnach (author of EPM) for adding SSE and sparse models to PAQ2, PAQ3N Yoockin Vadim (YBS), Werner Bergmans, Berto Destasio for benchmarking PAQ4 Jason Schmidt, Eugene Shelwien (ASH, PPMY) for compiling faster/smaller executables Eugene Shelwien, Dmitry Shkarin (DURILCA, PPMONSTR, BMF) for improvements to SSE contexts Alexander Ratushnyak (ERI) for finding a bug in an earlier version of PAQ4
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.