Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Compression: Syllables

Similar presentations


Presentation on theme: "Text Compression: Syllables"— Presentation transcript:

1 Text Compression: Syllables
Jan Lánský, Michal Žemlička Dept. of Software Engineering Faculty of Mathematics and Physics Charles University

2 Text Compression: Syllables
Synopsis Introduction Letters, syllables, words Algorithms of Decomposition Into Syllables Syllable-based compress methods Results Conclusion Text Compression: Syllables

3 Why we are using syllable-based compress methods ?
Introduction Why we are using syllable-based compress methods ?

4 Text Compression: Syllables
Length of Phrase What should be the proper symbols for coding? Letters (Shannon – 1948) Words (HuffWord - Moffat 1987) Syllables – logical units between words and letters Text Compression: Syllables

5 Text Compression: Syllables
Types of Languages With rich morphology Words have many grammatical forms. Word meaning can be overloaded using prefixes? Example: Czech, German With simple morphology Words have only a few grammatical forms. Word meaning can be overloaded using prepositions? Example: English Text Compression: Syllables

6 Text Compression: Syllables
Expectations Syllable-based compression will be suitable for languages with rich morphology. Middle-sized files Sets of syllables of two documents are more suitable than sets of words of these two documents. Number of unique syllables of given language will be lesser than number of unique words. Text Compression: Syllables

7 Problems of syllable-based compression
Decomposition of words into syllables It is not unique (Os-tra-va, Ost-ra-va) Sometimes it is necessary to know the origin of word (neu-ron, ne-u-ro-nit) For text compression is not necessary to decomposed words into syllables always correctly. Text Compression: Syllables

8 Letters, syllables, words

9 Classification of Symbols
Letters Non-Letters Small Capital Digits Spec. characters Vowels Consonants Text Compression: Syllables

10 Text Compression: Syllables
Vowels x Consonants In Czech can letters r and l be according their context used as vowel or as consonant. In context of two consonants are letters r, l vowels, in opposite case it is vowel. vrtat x vrátit, vlk x vlákat Text Compression: Syllables

11 Text Compression: Syllables
Vowels x Consonants The role of letter y in English. Vowel happy Consonant buying Vowel followed by consonant. trying Text Compression: Syllables

12 Classification of words
Letter Non-letter Small Capital Mixed hallo HALLO Hallo Numeric Special 1982 $?+ Text Compression: Syllables

13 Text Compression: Syllables
Syllable is sequence of sounds which contains exactly one maximal subsequence of vowels. Types of syllables (analogical as words) Letter (small, capital, mixed) Non-Letter (numerical, special) Text Compression: Syllables

14 Algorithms of decomposing words into syllables
Who quick and with minimal information about language decompose words into syllables ?

15 Algorithms of decomposing words into syllables
Word is decomposed into maximal sequences (blocks) vowels and consonants. Example: odstrčenou Bases of syllables are created by blocks of vowels. We have described 4 algorithms Differences are in the way of adding blocks of consonants to blocks vowels. Text Compression: Syllables

16 Algorithms of decomposing words into syllables
universal left PUL Adds consonants to the left block of vowels. universal right PUR Adds consonants to the right block of vowels. universal middle-left PUML Adds bigger half of consonants to the left block of vowels. Blocks of consonants with size one are added to the right. universal middle-right PUMR Adds bigger half of consonants to the right block of vowels U algoritmů BUML a BUMR se větší půlkou rozumí horní celá část z poloviny Text Compression: Syllables

17 Example of decomposing
We decompose word odstrčenou which contains blocks of vowels: o, r, e, ou. PUL: odst-rč-en-ou PUR: o-dstr-če-nou PUML: ods-tr-če-nou PUMR: od-str-če-nou This is correct decomposition Text Compression: Syllables

18 Syllable-based compress methods
Syllable as basic compression unit.

19 Syllable-based compress methods
LZWL Dictionary-based method Syllable-based version of LZW HuffSyllable Statistical method Adaptive Huffman coding Inspired by HuffWord Text Compression: Syllables

20 Text Compression: Syllables
Algorithm LZWL Dictionary of phrases is initialized with frequent syllables of given language. During compression we can get unknown syllable which must be added to the dictionary of phrases. We are extending phrases in the dictionary only if in both this and previous step has not been detected unknown syllables. Text Compression: Syllables

21 Algorithm HuffSyllable
For each syllable type we have adaptive Huffman tree. In each step of algorithm is predicated expected type of following syllable. This says which tree will be used for its encoding. In case of bad prediction (following syllable have other type than expected) is used escape symbol for switching to correct tree. The prediction of type of following syllable is based on type of previous syllable and other criteria. Text Compression: Syllables

22 Prediction of syllable type
Previous syllable Expected type small, mixed small capital numerical special special with dot, last letter syllable was small special without dot, last letter syllable wasn’t capital mixed special, last letter syllable was capital Capital Text Compression: Syllables

23 Results Comparison of letter-based, syllables-based, and word-based compression methods.

24 Text Compression: Syllables
Results - Czech File size/ method 5 kB – 50 kB 50 kB – 100 kB 100 kB – 500 kB 500 kB – 2 MB 2 MB – 5 MB compress 4.35 bpc 4.08 bpc 3.90 bpc 3.81 bpc --- LZWL syllables 4.07 bpc 3.77 bpc 3.56 bpc 3.31 bpc words 4.56 bpc 4.19 bpc 3.99 bpc 3.69 bpc FGK 4.97 bpc 4.95 bpc 5.00 bpc 4.99 bpc HuffSyll 3.86 bpc 3.79 bpc 3.80 bpc 3.74 bpc 3.71 bpc 3.51 bpc 3.43 bpc 3.21 bpc Text Compression: Syllables

25 Text Compression: Syllables
Results - Czech Czech – language with rich morphology Syllable-based versions of LZWL and HuffSyllable are better than letter-based. Syllable-based version of LZWL is better than word-based. Syllable-based version of HuffSyllable is a bit worse than word-based. Text Compression: Syllables

26 Text Compression: Syllables
Results - English File size/ method 5 kB – 50 kB 50 kB – 100 kB 100 kB – 500 kB 500 kB – 2 MB 2 MB – 5 MB compress 3.79 bpc 3.57 bpc 3.34 bpc 3.27 bpc 3.08 bpc LZWL syllables 3.31 bpc 3.09 bpc 2.87 bpc 2.64 bpc 2.37 bpc words 3.22 bpc 3.03 bpc 2.86 bpc 2.62 bpc 2.36 bpc FGK 4.59 bpc 4.60 bpc 4.58 bpc 4.54 bpc HuffSyll 3.23 bpc 3.18 bpc 3.15 bpc 3.10 bpc 2.97 bpc 2.65 bpc 2.58 bpc 2.52 bpc 2.38 bpc 2.31 bpc Text Compression: Syllables

27 Text Compression: Syllables
Results - English English – language with simple morphology Syllable-based versions of LZWL and HuffSyllable are better than letter-based. Syllable-based version of LZWL is a bit worse than word-based. Syllable-based version of HuffSyllable is much worse than word-based. Text Compression: Syllables

28 Conclusion What we plan ?

29 Text Compression: Syllables
Conclusion We plan: Syllable-based version of bzip2 Try other languages with rich morphology For example: Germany, Hungarian Compression of specific formats XML Text Compression: Syllables


Download ppt "Text Compression: Syllables"

Similar presentations


Ads by Google