Text Compression: Syllables Jan Lánský, Michal Žemlička Dept. of Software Engineering Faculty of Mathematics.

Slides:



Advertisements
Similar presentations
15-583:Algorithms in the Real World
Advertisements

Introduction to Computer Science 2 Lecture 7: Extended binary trees
Lecture 4 (week 2) Source Coding and Compression
Lecture # 20 Image and Data Compression. Data Compression.
Source Coding Data Compression A.J. Han Vinck. DATA COMPRESSION NO LOSS of information and exact reproduction (low compression ratio 1:4) general problem.
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
Huffman Encoding Dr. Bernard Chen Ph.D. University of Central Arkansas.
Lecture 10 : Huffman Encoding Bong-Soo Sohn Assistant Professor School of Computer Science and Engineering Chung-Ang University Lecture notes : courtesy.
Talking Letters Consonants Lessons 1 - 5
Bits and the "Why" of Bytes: Representing Information Digitally
Processing of large document collections
Text Compression 1 Assigning 16 bits to each character in a document uses too much file space We need ways to store and transmit text efficiently Text.
Compression & Huffman Codes
Huffman Encoding 16-Apr-17.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Compression Techniques. Digital Compression Concepts ● Compression techniques are used to replace a file with another that is smaller ● Decompression.
CSCI 3 Chapter 1.8 Data Compression. Chapter 1.8 Data Compression  For the purpose of storing or transferring data, it is often helpful to reduce the.
Chapter 8_2 Bits and the "Why" of Bytes: Representing Information Digitally.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
Data Representation CS105. Data Representation Types of data: – Numbers – Text – Audio – Images & Graphics – Video.
Chapter 8 Bits and the "Why" of Bytes: Representing Information Digitally.
EEE377 Lecture Notes1 EEE436 DIGITAL COMMUNICATION Coding En. Mohd Nazri Mahmud MPhil (Cambridge, UK) BEng (Essex, UK) Room 2.14.
Using CTW as a language modeler in Dasher Martijn van Veen Signal Processing Group Department of Electrical Engineering Eindhoven University.
Lossless Compression Multimedia Systems (Module 2 Lesson 3)
Data Compression Arithmetic coding. Arithmetic Coding: Introduction Allows using “fractional” parts of bits!! Used in PPM, JPEG/MPEG (as option), Bzip.
Management Information Systems Lection 06 Archiving information CLARK UNIVERSITY College of Professional and Continuing Education (COPACE)
15-853Page :Algorithms in the Real World Data Compression II Arithmetic Coding – Integer implementation Applications of Probability Coding – Run.
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
CSC 101 Introduction to Computing Lecture 9 Dr. Iftikhar Azim Niaz 1.
Algorithm Design & Analysis – CS632 Group Project Group Members Bijay Nepal James Hansen-Quartey Winter
Dr.-Ing. Khaled Shawky Hassan
296.3Page 1 CPS 296.3:Algorithms in the Real World Data Compression: Lecture 2.5.
Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.
Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.
Fundamental Structures of Computer Science Feb. 24, 2005 Ananda Guna Lempel-Ziv Compression.
Compsci Today’s topics l Binary Numbers  Brookshear l Slides from Prof. Marti Hearst of UC Berkeley SIMS l Upcoming  Networks Interactive.
Basic Concepts of Encoding Codes, their efficiency and redundancy 1.
Multimedia Specification Design and Production 2012 / Semester 1 / L3 Lecturer: Dr. Nikos Gazepidis
Communication Technology in a Changing World Week 2.
CS 111 – Sept. 10 Quiz Data compression –text –images –sounds Commitment: –Please read rest of chapter 1. –Department picnic next Wednesday.
Multimedia Data Introduction to Lossless Data Compression Dr Sandra I. Woolley Electronic, Electrical.
Huffman Coding and Decoding TAIABUL HAQUE NAEEMUL HASSAN.
1 Classification of Compression Methods. 2 Data Compression  A means of reducing the size of blocks of data by removing  Unused material: e.g.) silence.
File Compression Techniques Alex Robertson. Outline History Lossless vs Lossy Basics Huffman Coding Getting Advanced Lossy Explained Limitations Future.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Huffman Code and Data Decomposition Pranav Shah CS157B.
Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression.
Recent Results in Combined Coding for Word-Based PPM Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
Data Representation. What is data? Data is information that has been translated into a form that is more convenient to process As information take different.
Compression of a Dictionary Jan Lánský, Michal Žemlička Dept. of Software Engineering Faculty of Mathematics.
Lossless Decomposition and Huffman Codes Sophia Soohoo CS 157B.
Chapter 7 Lossless Compression Algorithms 7.1 Introduction 7.2 Basics of Information Theory 7.3 Run-Length Coding 7.4 Variable-Length Coding (VLC) 7.5.
Compression techniques Adaptive and non-adaptive.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Lampel ZIV (LZ) code The Lempel-Ziv algorithm is a variable-to-fixed length code Basically, there are two versions of the algorithm LZ77 and LZ78 are the.
Computer Sciences Department1. 2 Data Compression and techniques.
The Sounds of English: an Introduction to English Phonetics.
3.3 Fundamentals of data representation
Text Compression: Syllables
DATA STRUCTURES AND ALGORITHM (CSE 220)
Information and Coding Theory
an Introduction to English
Lesson Objectives Aims You should know about: 1.3.1:
Bits and the "Why" of Bytes: Representing Information Digitally
Ch2: Data Representation
Huffman Encoding Huffman code is method for the compression for standard text documents. It makes use of a binary tree to develop codes of varying lengths.
Chapter Nine: Data Transmission
Greedy: Huffman Codes Yin Tat Lee
Huffman Encoding.
Presentation transcript:

Text Compression: Syllables Jan Lánský, Michal Žemlička Dept. of Software Engineering Faculty of Mathematics and Physics Charles University

Text Compression: Syllables2 Synopsis Introduction Letters, syllables, words Algorithms of Decomposition Into Syllables Syllable-based compress methods Results Conclusion

Introduction Why we are using syllable-based compress methods ?

Text Compression: Syllables4 Length of Phrase What should be the proper symbols for coding? Letters (Shannon – 1948) Words (HuffWord - Moffat 1987) Syllables – logical units between words and letters

Text Compression: Syllables5 Types of Languages With rich morphology Words have many grammatical forms. Word meaning can be overloaded using prefixes. Example: Czech, German With simple morphology Words have only a few grammatical forms. Word meaning can be overloaded using prepositions. Example: English

Text Compression: Syllables6 Expectations Syllable-based compression will be suitable for languages with rich morphology. Middle-sized files Sets of syllables of two documents are more similar than sets of words of these two documents. Number of unique syllables of given language will be lesser than number of unique words.

Text Compression: Syllables7 Problems of syllable-based compression Decomposition of words into syllables It is not unique (Os-tra-va, Ost-ra-va) Sometimes it is necessary to know the origin of word (neu-ron, ne-u-ro-nit) For text compression is not necessary to decomposed words into syllables always correctly.

Letters, syllables, words

Text Compression: Syllables9 Classification of Symbols Symbols LettersNon-Letters CapitalSmallSpec. charactersDigits VowelsConsonants

Text Compression: Syllables10 Vowels x Consonants In Czech can letters r and l be according their context used as vowel or as consonant. In context of two consonants are letters r, l vowels, in opposite case it is vowel. vrtat x vrátit, vlk x vlákat

Text Compression: Syllables11 Vowels x Consonants The role of letter y in English. Vowel happy Consonant buying Vowel followed by consonant. trying

Text Compression: Syllables12 Classification of words Words Letter Non-letter CapitalSmall SpecialNumeric Mixed hallo HALLOHallo 1982$?+

Text Compression: Syllables13 Syllable Syllable is sequence of sounds which contains exactly one maximal subsequence of vowels. Types of syllables (analogical as words) Letter (small, capital, mixed) Non-Letter (numerical, special)

Algorithms of decomposing words into syllables Who quick and with minimal information about language decompose words into syllables ?

Text Compression: Syllables15 Algorithms of decomposing words into syllables Word is decomposed into maximal sequences (blocks) vowels and consonants. Example: odstrčenou Bases of syllables are created by blocks of vowels. We have described 4 algorithms Differences are in the way of adding blocks of consonants to blocks vowels.

Text Compression: Syllables16 Algorithms of decomposing words into syllables universal left P UL Adds consonants to the left block of vowels. universal right P UR Adds consonants to the right block of vowels. universal middle-left P UML Adds bigger half of consonants to the left block of vowels. Blocks of consonants with size one are added to the right. universal middle-right P UMR Adds bigger half of consonants to the right block of vowels

Text Compression: Syllables17 Example of decomposing We decompose word odstrčenou which contains blocks of vowels: o, r, e, ou. P UL : odst-rč-en-ou P UR : o-dstr-če-nou P UML : ods-tr-če-nou P UMR : od-str-če-nou This is correct decomposition

Syllable-based compress methods Syllable as basic compression unit.

Text Compression: Syllables19 Syllable-based compress methods LZWL Dictionary-based method Syllable-based version of LZW HuffSyllable Statistical method Adaptive Huffman coding Inspired by HuffWord

Text Compression: Syllables20 Algorithm LZWL Dictionary of phrases is initialized with frequent syllables of given language. During compression we can get unknown syllable which must be added to the dictionary of phrases. We are extending phrases in the dictionary only if in both this and previous step has not been detected unknown syllables.

Text Compression: Syllables21 Algorithm HuffSyllable For each syllable type we have adaptive Huffman tree. In each step of algorithm is predicated expected type of following syllable. This says which tree will be used for its encoding. In case of bad prediction (following syllable have other type than expected) is used escape symbol for switching to correct tree. The prediction of type of following syllable is based on type of previous syllable and other criteria.

Text Compression: Syllables22 Prediction of syllable type Previous syllableExpected type small, mixedsmall capital numericalspecial special with dot, last letter syllable was wasn’t capital small special without dot, last letter syllable wasn’t capital mixed special, last letter syllable was capital Capital

Results Comparison of letter-based, syllables- based, and word-based compression methods.

Text Compression: Syllables24 Results - Czech File size/ method 5 kB – 50 kB 50 kB – 100 kB 100 kB – 500 kB 500 kB – 2 MB 2 MB – 5 MB compress4.35 bpc4.08 bpc3.90 bpc3.81 bpc--- LZWL syllables 4.07 bpc3.77 bpc3.56 bpc3.31 bpc--- LZWL words 4.56 bpc4.19 bpc3.99 bpc3.69 bpc--- FGK4.97 bpc4.95 bpc5.00 bpc4.99 bpc--- HuffSyll syllables 3.86 bpc3.79 bpc3.80 bpc3.74 bpc--- HuffSyll words 3.71 bpc3.51 bpc3.43 bpc3.21 bpc---

Text Compression: Syllables25 Results - Czech Czech – language with rich morphology Syllable-based versions of LZWL and HuffSyllable are better than letter-based. Syllable-based version of LZWL is better than word-based. Syllable-based version of HuffSyllable is a bit worse than word-based.

Text Compression: Syllables26 Results - English File size/ method 5 kB – 50 kB 50 kB – 100 kB 100 kB – 500 kB 500 kB – 2 MB 2 MB – 5 MB compress3.79 bpc3.57 bpc3.34 bpc3.27 bpc3.08 bpc LZWL syllables 3.31 bpc3.09 bpc2.87 bpc2.64 bpc2.37 bpc LZWL words 3.22 bpc3.03 bpc2.86 bpc2.62 bpc2.36 bpc FGK4.59 bpc4.60 bpc 4.58 bpc4.54 bpc HuffSyll syllables 3.23 bpc3.18 bpc3.15 bpc3.10 bpc2.97 bpc HuffSyll words 2.65 bpc2.58 bpc2.52 bpc2.38 bpc2.31 bpc

Text Compression: Syllables27 Results - English English – language with simple morphology Syllable-based versions of LZWL and HuffSyllable are better than letter-based. Syllable-based version of LZWL is a bit worse than word-based. Syllable-based version of HuffSyllable is much worse than word-based.

Conclusion What we plan ?

Text Compression: Syllables29 Conclusion We plan: Syllable-based version of bzip2 Try other languages with rich morphology For example: Germany, Hungarian Compression of specific formats XML