Download presentation
Presentation is loading. Please wait.
1
Text Compression: Syllables
Jan Lánský, Michal Žemlička Dept. of Software Engineering Faculty of Mathematics and Physics Charles University
2
Text Compression: Syllables
Synopsis Introduction Letters, syllables, words Algorithms of Decomposition Into Syllables Syllable-based compress methods Results Conclusion Text Compression: Syllables
3
Why we are using syllable-based compress methods ?
Introduction Why we are using syllable-based compress methods ?
4
Text Compression: Syllables
Length of Phrase What should be the proper symbols for coding? Letters (Shannon – 1948) Words (HuffWord - Moffat 1987) Syllables – logical units between words and letters Text Compression: Syllables
5
Text Compression: Syllables
Types of Languages With rich morphology Words have many grammatical forms. Word meaning can be overloaded using prefixes? Example: Czech, German With simple morphology Words have only a few grammatical forms. Word meaning can be overloaded using prepositions? Example: English Text Compression: Syllables
6
Text Compression: Syllables
Expectations Syllable-based compression will be suitable for languages with rich morphology. Middle-sized files Sets of syllables of two documents are more suitable than sets of words of these two documents. Number of unique syllables of given language will be lesser than number of unique words. Text Compression: Syllables
7
Problems of syllable-based compression
Decomposition of words into syllables It is not unique (Os-tra-va, Ost-ra-va) Sometimes it is necessary to know the origin of word (neu-ron, ne-u-ro-nit) For text compression is not necessary to decomposed words into syllables always correctly. Text Compression: Syllables
8
Letters, syllables, words
9
Classification of Symbols
Letters Non-Letters Small Capital Digits Spec. characters Vowels Consonants Text Compression: Syllables
10
Text Compression: Syllables
Vowels x Consonants In Czech can letters r and l be according their context used as vowel or as consonant. In context of two consonants are letters r, l vowels, in opposite case it is vowel. vrtat x vrátit, vlk x vlákat Text Compression: Syllables
11
Text Compression: Syllables
Vowels x Consonants The role of letter y in English. Vowel happy Consonant buying Vowel followed by consonant. trying Text Compression: Syllables
12
Classification of words
Letter Non-letter Small Capital Mixed hallo HALLO Hallo Numeric Special 1982 $?+ Text Compression: Syllables
13
Text Compression: Syllables
Syllable is sequence of sounds which contains exactly one maximal subsequence of vowels. Types of syllables (analogical as words) Letter (small, capital, mixed) Non-Letter (numerical, special) Text Compression: Syllables
14
Algorithms of decomposing words into syllables
Who quick and with minimal information about language decompose words into syllables ?
15
Algorithms of decomposing words into syllables
Word is decomposed into maximal sequences (blocks) vowels and consonants. Example: odstrčenou Bases of syllables are created by blocks of vowels. We have described 4 algorithms Differences are in the way of adding blocks of consonants to blocks vowels. Text Compression: Syllables
16
Algorithms of decomposing words into syllables
universal left PUL Adds consonants to the left block of vowels. universal right PUR Adds consonants to the right block of vowels. universal middle-left PUML Adds bigger half of consonants to the left block of vowels. Blocks of consonants with size one are added to the right. universal middle-right PUMR Adds bigger half of consonants to the right block of vowels U algoritmů BUML a BUMR se větší půlkou rozumí horní celá část z poloviny Text Compression: Syllables
17
Example of decomposing
We decompose word odstrčenou which contains blocks of vowels: o, r, e, ou. PUL: odst-rč-en-ou PUR: o-dstr-če-nou PUML: ods-tr-če-nou PUMR: od-str-če-nou This is correct decomposition Text Compression: Syllables
18
Syllable-based compress methods
Syllable as basic compression unit.
19
Syllable-based compress methods
LZWL Dictionary-based method Syllable-based version of LZW HuffSyllable Statistical method Adaptive Huffman coding Inspired by HuffWord Text Compression: Syllables
20
Text Compression: Syllables
Algorithm LZWL Dictionary of phrases is initialized with frequent syllables of given language. During compression we can get unknown syllable which must be added to the dictionary of phrases. We are extending phrases in the dictionary only if in both this and previous step has not been detected unknown syllables. Text Compression: Syllables
21
Algorithm HuffSyllable
For each syllable type we have adaptive Huffman tree. In each step of algorithm is predicated expected type of following syllable. This says which tree will be used for its encoding. In case of bad prediction (following syllable have other type than expected) is used escape symbol for switching to correct tree. The prediction of type of following syllable is based on type of previous syllable and other criteria. Text Compression: Syllables
22
Prediction of syllable type
Previous syllable Expected type small, mixed small capital numerical special special with dot, last letter syllable was small special without dot, last letter syllable wasn’t capital mixed special, last letter syllable was capital Capital Text Compression: Syllables
23
Results Comparison of letter-based, syllables-based, and word-based compression methods.
24
Text Compression: Syllables
Results - Czech File size/ method 5 kB – 50 kB 50 kB – 100 kB 100 kB – 500 kB 500 kB – 2 MB 2 MB – 5 MB compress 4.35 bpc 4.08 bpc 3.90 bpc 3.81 bpc --- LZWL syllables 4.07 bpc 3.77 bpc 3.56 bpc 3.31 bpc words 4.56 bpc 4.19 bpc 3.99 bpc 3.69 bpc FGK 4.97 bpc 4.95 bpc 5.00 bpc 4.99 bpc HuffSyll 3.86 bpc 3.79 bpc 3.80 bpc 3.74 bpc 3.71 bpc 3.51 bpc 3.43 bpc 3.21 bpc Text Compression: Syllables
25
Text Compression: Syllables
Results - Czech Czech – language with rich morphology Syllable-based versions of LZWL and HuffSyllable are better than letter-based. Syllable-based version of LZWL is better than word-based. Syllable-based version of HuffSyllable is a bit worse than word-based. Text Compression: Syllables
26
Text Compression: Syllables
Results - English File size/ method 5 kB – 50 kB 50 kB – 100 kB 100 kB – 500 kB 500 kB – 2 MB 2 MB – 5 MB compress 3.79 bpc 3.57 bpc 3.34 bpc 3.27 bpc 3.08 bpc LZWL syllables 3.31 bpc 3.09 bpc 2.87 bpc 2.64 bpc 2.37 bpc words 3.22 bpc 3.03 bpc 2.86 bpc 2.62 bpc 2.36 bpc FGK 4.59 bpc 4.60 bpc 4.58 bpc 4.54 bpc HuffSyll 3.23 bpc 3.18 bpc 3.15 bpc 3.10 bpc 2.97 bpc 2.65 bpc 2.58 bpc 2.52 bpc 2.38 bpc 2.31 bpc Text Compression: Syllables
27
Text Compression: Syllables
Results - English English – language with simple morphology Syllable-based versions of LZWL and HuffSyllable are better than letter-based. Syllable-based version of LZWL is a bit worse than word-based. Syllable-based version of HuffSyllable is much worse than word-based. Text Compression: Syllables
28
Conclusion What we plan ?
29
Text Compression: Syllables
Conclusion We plan: Syllable-based version of bzip2 Try other languages with rich morphology For example: Germany, Hungarian Compression of specific formats XML Text Compression: Syllables
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.