Text Compression: Syllables Jan Lánský, Michal Žemlička Dept. of Software Engineering Faculty of Mathematics and Physics Charles University
Text Compression: Syllables2 Synopsis Introduction Letters, syllables, words Algorithms of Decomposition Into Syllables Syllable-based compress methods Results Conclusion
Introduction Why we are using syllable-based compress methods ?
Text Compression: Syllables4 Length of Phrase What should be the proper symbols for coding? Letters (Shannon – 1948) Words (HuffWord - Moffat 1987) Syllables – logical units between words and letters
Text Compression: Syllables5 Types of Languages With rich morphology Words have many grammatical forms. Word meaning can be overloaded using prefixes. Example: Czech, German With simple morphology Words have only a few grammatical forms. Word meaning can be overloaded using prepositions. Example: English
Text Compression: Syllables6 Expectations Syllable-based compression will be suitable for languages with rich morphology. Middle-sized files Sets of syllables of two documents are more similar than sets of words of these two documents. Number of unique syllables of given language will be lesser than number of unique words.
Text Compression: Syllables7 Problems of syllable-based compression Decomposition of words into syllables It is not unique (Os-tra-va, Ost-ra-va) Sometimes it is necessary to know the origin of word (neu-ron, ne-u-ro-nit) For text compression is not necessary to decomposed words into syllables always correctly.
Letters, syllables, words
Text Compression: Syllables9 Classification of Symbols Symbols LettersNon-Letters CapitalSmallSpec. charactersDigits VowelsConsonants
Text Compression: Syllables10 Vowels x Consonants In Czech can letters r and l be according their context used as vowel or as consonant. In context of two consonants are letters r, l vowels, in opposite case it is vowel. vrtat x vrátit, vlk x vlákat
Text Compression: Syllables11 Vowels x Consonants The role of letter y in English. Vowel happy Consonant buying Vowel followed by consonant. trying
Text Compression: Syllables12 Classification of words Words Letter Non-letter CapitalSmall SpecialNumeric Mixed hallo HALLOHallo 1982$?+
Text Compression: Syllables13 Syllable Syllable is sequence of sounds which contains exactly one maximal subsequence of vowels. Types of syllables (analogical as words) Letter (small, capital, mixed) Non-Letter (numerical, special)
Algorithms of decomposing words into syllables Who quick and with minimal information about language decompose words into syllables ?
Text Compression: Syllables15 Algorithms of decomposing words into syllables Word is decomposed into maximal sequences (blocks) vowels and consonants. Example: odstrčenou Bases of syllables are created by blocks of vowels. We have described 4 algorithms Differences are in the way of adding blocks of consonants to blocks vowels.
Text Compression: Syllables16 Algorithms of decomposing words into syllables universal left P UL Adds consonants to the left block of vowels. universal right P UR Adds consonants to the right block of vowels. universal middle-left P UML Adds bigger half of consonants to the left block of vowels. Blocks of consonants with size one are added to the right. universal middle-right P UMR Adds bigger half of consonants to the right block of vowels
Text Compression: Syllables17 Example of decomposing We decompose word odstrčenou which contains blocks of vowels: o, r, e, ou. P UL : odst-rč-en-ou P UR : o-dstr-če-nou P UML : ods-tr-če-nou P UMR : od-str-če-nou This is correct decomposition
Syllable-based compress methods Syllable as basic compression unit.
Text Compression: Syllables19 Syllable-based compress methods LZWL Dictionary-based method Syllable-based version of LZW HuffSyllable Statistical method Adaptive Huffman coding Inspired by HuffWord
Text Compression: Syllables20 Algorithm LZWL Dictionary of phrases is initialized with frequent syllables of given language. During compression we can get unknown syllable which must be added to the dictionary of phrases. We are extending phrases in the dictionary only if in both this and previous step has not been detected unknown syllables.
Text Compression: Syllables21 Algorithm HuffSyllable For each syllable type we have adaptive Huffman tree. In each step of algorithm is predicated expected type of following syllable. This says which tree will be used for its encoding. In case of bad prediction (following syllable have other type than expected) is used escape symbol for switching to correct tree. The prediction of type of following syllable is based on type of previous syllable and other criteria.
Text Compression: Syllables22 Prediction of syllable type Previous syllableExpected type small, mixedsmall capital numericalspecial special with dot, last letter syllable was wasn’t capital small special without dot, last letter syllable wasn’t capital mixed special, last letter syllable was capital Capital
Results Comparison of letter-based, syllables- based, and word-based compression methods.
Text Compression: Syllables24 Results - Czech File size/ method 5 kB – 50 kB 50 kB – 100 kB 100 kB – 500 kB 500 kB – 2 MB 2 MB – 5 MB compress4.35 bpc4.08 bpc3.90 bpc3.81 bpc--- LZWL syllables 4.07 bpc3.77 bpc3.56 bpc3.31 bpc--- LZWL words 4.56 bpc4.19 bpc3.99 bpc3.69 bpc--- FGK4.97 bpc4.95 bpc5.00 bpc4.99 bpc--- HuffSyll syllables 3.86 bpc3.79 bpc3.80 bpc3.74 bpc--- HuffSyll words 3.71 bpc3.51 bpc3.43 bpc3.21 bpc---
Text Compression: Syllables25 Results - Czech Czech – language with rich morphology Syllable-based versions of LZWL and HuffSyllable are better than letter-based. Syllable-based version of LZWL is better than word-based. Syllable-based version of HuffSyllable is a bit worse than word-based.
Text Compression: Syllables26 Results - English File size/ method 5 kB – 50 kB 50 kB – 100 kB 100 kB – 500 kB 500 kB – 2 MB 2 MB – 5 MB compress3.79 bpc3.57 bpc3.34 bpc3.27 bpc3.08 bpc LZWL syllables 3.31 bpc3.09 bpc2.87 bpc2.64 bpc2.37 bpc LZWL words 3.22 bpc3.03 bpc2.86 bpc2.62 bpc2.36 bpc FGK4.59 bpc4.60 bpc 4.58 bpc4.54 bpc HuffSyll syllables 3.23 bpc3.18 bpc3.15 bpc3.10 bpc2.97 bpc HuffSyll words 2.65 bpc2.58 bpc2.52 bpc2.38 bpc2.31 bpc
Text Compression: Syllables27 Results - English English – language with simple morphology Syllable-based versions of LZWL and HuffSyllable are better than letter-based. Syllable-based version of LZWL is a bit worse than word-based. Syllable-based version of HuffSyllable is much worse than word-based.
Conclusion What we plan ?
Text Compression: Syllables29 Conclusion We plan: Syllable-based version of bzip2 Try other languages with rich morphology For example: Germany, Hungarian Compression of specific formats XML