Strings CopyWrite D.Bockus.

Slides:

Advertisements

Similar presentations

Introduction to Computer Science 2 Lecture 7: Extended binary trees

Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna

Character and String definitions, algorithms, library functions Characters and Strings.

Source Coding Data Compression A.J. Han Vinck. DATA COMPRESSION NO LOSS of information and exact reproduction (low compression ratio 1:4) general problem.

Algorithms for Data Compression

1 Assignment 2: (Due at 10:30 a.m on Friday of Week 10) Question 1 (Given in Tutorial 5) Question 2 (Given in Tutorial 7) If you do Question 1 only, you.

A Data Compression Algorithm: Huffman Compression

Lossless Compression Multimedia Systems (Module 2 Lesson 3)

Text Compression Spring 2007 CSE, POSTECH. 2 2 Data Compression Deals with reducing the size of data – Reduce storage space and hence storage cost Compression.

Comp 249 Programming Methodology Chapter 15 Linked Data Structure - Part B Dr. Aiman Hanna Department of Computer Science & Software Engineering Concordia.

CMPSC 16 Problem Solving with Computers I Spring 2014 Instructor: Lucas Bang Lecture 15: Linked data structures.

Source Coding-Compression

A Simple Two-Pass Assembler

Fundamental Structures of Computer Science Feb. 24, 2005 Ananda Guna Lempel-Ziv Compression.

Data Structures Week 6: Assignment #2 Problem

Fundamental Data Structures and Algorithms Aleks Nanevski February 10, 2004 based on a lecture by Peter Lee LZW Compression.

1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.

Lexical Analysis Hira Waseem Lecture

The LZ family LZ77 LZ78 LZR LZSS LZB LZH – used by zip and unzip

Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.

Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.

Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression.

Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.

1 Chapter 7 Skip Lists and Hashing Part 2: Hashing.

Lecture 7 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics

Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.

Lampel ZIV (LZ) code The Lempel-Ziv algorithm is a variable-to-fixed length code Basically, there are two versions of the algorithm LZ77 and LZ78 are the.

CS 1501: Algorithm Implementation

Compression and Huffman Coding. Compression Reducing the memory required to store some information. Lossless compression vs lossy compression Lossless.

Lecture on Data Structures(Trees). Prepared by, Jesmin Akhter, Lecturer, IIT,JU 2 Properties of Heaps ◈ Heaps are binary trees that are ordered.

Data Compression: Huffman Coding in Weiss (p.389)

Lecture 6 of Computer Science II

Data Structure By Amee Trivedi.

Course Developer/Writer: A. J. Ikuomola

Module 11: File Structure

COMP261 Lecture 22 Data Compression 2.

Data Coding Run Length Coding

Compression & Huffman Codes

Data Compression.

Top 50 Data Structures Interview Questions

A Simple Syntax-Directed Translator

Increasing Information per Bit

Information and Coding Theory

CS522 Advanced database Systems

Data Structure and Algorithms

Applied Algorithmics - week7

Lecture Trees Chapter 9 of textbook 1. Concepts of trees

Data Structures TreeMap HashMap.

Hashing Exercises.

Hash functions Open addressing

Data Compression If you’ve ever sent a large file to a friend, you may have compressed it into a zip archive like the one on this slide before doing so.

TreeSet TreeMap HashMap

Chapter 9: Huffman Codes

Space-for-time tradeoffs

Analysis & Design of Algorithms (CSCE 321)

Huffman Coding.

Data Compression Reduce the size of data.

Huffman Coding CSE 373 Data Structures.

فشرده سازي داده ها Reduce the size of data.

A Robust Data Structure

Space-for-time tradeoffs

A Simple Two-Pass Assembler

Data Structure and Algorithms

Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.

Space-for-time tradeoffs

How to use hash tables to solve olympiad problems

Data Structures & Algorithms

CPS 296.3:Algorithms in the Real World

Presentation transcript:

Strings CopyWrite D.Bockus

Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text processing. Word processing. 2) Grammatical Structure of Languages. 3) Searching, String Sequences

String Example 1 E.g. Java's "for" statement. (simplistic view) for (initialization ; condition; increment) u v w Where a “for” statement breaks down into ‘for (u;v;w)’. We can then define each part: u » identifier = constant v » identifier relational_operator value w » identifier++ In this context we can also define a while loop as: while(v) Deterministic Context Free Languages (programming languages) are defined by breaking rules down into sub-rules, etc.

Strings Example 2 Genetic Coding: aab cd aab d Searching for, and matching codes, leads into graph theory. a b b b a a a b b a a c c c d d d a b f g a b f g b b d d s1 s4 s2 s3

String Example 3 Compression Converting a large volume of symbols into a smaller format. Huffman Coding LZW compression.

Basics Given a string v: The length of v can be expressed: 1) |v| = magnitude of v 2) length (v). Empty strings v = ' ' or v =  There are 5 common operations that may be performed on strings. Insertion, Deletion, Substitution, Concatenation, Comparison.

Insertion & Deletion Insertion Deletion k = ac where a = (a1, a2 .. am) c = (c1, c2 .. cn) insert b = (b1, b2 ... bp) between ac  k = abc = a1, a2 … am, b1, b2 … bp, c1, c2 … cn |k| = m + p + n Deletion k = abc delete c k = ab

Substitution k =  u  where  &  maybe null, i.e. || = 0 or || = 0 Search for u & replace with v.  k =  v  Notice this same operation can be accomplished with a deletion and insertion. k =  u  Delete u k =   Insert v k =  v  Note: |u| does not have to equal |v|;  |k| before does not have to equal |k| after.

Concatination This is the joining of 2 strings a & b. c = a + b So if a = (a1, a2 .. am) & b = (b1, b2 ... bn) Then c = (a1, a2 .. am, b1, b2 ... bn) Note: concatenation may be performed with insertion, i.e. insert b at end of a, or substitution. a where  is null. substitute  for b. |c| = m + n

Comparison Compare a & b to see if one of the following is true. 1) a < b 2) a = b 3) a > b 1) a is less then b if a lexicographical comparison is performed on each element of a & b. Until the first ak < bk is true. a b a1 b1 a1 = b1 a2 b2 a2 = b2 a3 b3 a3 < b3 a4 b4 a4 = b4 b5 (a5 = ) < b5 a3 < b3 So, a < b

Comparison Cont... 2). For a = b the following must be true. Note: a3 < b3 is the first instance where an element in a differs from b.  a < b. If a3 = b3 then a is still less then b because |a| < |b|. Can think of  having a value of - for comparison purposes. 2). For a = b the following must be true. |a| = |b| and ak = bk k 3). a > b, opposite of (1). Or b < a

String Representations Consider the string "L1 CMPR BANANAS WATERMELLONS 12” There are 6 ways to represent strings in storage noting that 3 criteria must be kept in mind Storage Efficiency (1:1 packing ratio) Ease of Lookup (Searching) Ease of Modification Insertion Deletion

Fixed Length Strings Adv: Ease of Modification Dis: Storage Efficiency due of wasted space at end of short strings.

Var Strings Adv: Easier to look up strings, we already have the length. Dis: Still wastes space.

Count Delimited Adv: Very efficient in space usage, Lookup is not bad. Dis: Modification is hard , Replacing a string must be same length or readjustment of array is needed.

Indexed List Adv: Good Storage and Search capabilities Dis: Modification is poor Strategies include: always adding new strings and never reclaiming space except during a repack.

Linked List Adv: Modification is simple pointer manipulation. Dis: Storage overhead. one address per character Note: Lookup can be improved by adding additional length field to table or by imploring a hash function.

Blocked Linked List Adv: Better storage then linked list. More characters per node Note: A trade off between dealing with single characters and blocks of characters during modification. Note: If modification is not required then methods such as indexed lists are quite useful. Applications include symbol tables in compilers.

Implementation In most cases a variable length string structure is desirable. i.e. the most versatile. Consider a string type as: String { int size; char data[]; } Java declares string objects with methods to determine length and other attributes. Declaring Variables: String S1, S2

Basic Functions Other Usefull functions s.length(); -- Returns the length of S1 Other Usefull functions String s.concat(String t); String new String(s); String s.substring(int i); int indexOf(String t, int index); See Java api.

Variable Length Coding

Huffman Coding Algorithm 1) Collect a history of the frequencies of the characters i.e. determine the probabilities. 2) Arrange the characters in an ordered list (priority queue) based on increasing probabilities (frequency) 3) While (More then 1 node in List) Do i) Remove first 2 Nodes ii) Combine into a tree and have the tree root represent the sum of the frequencies of the children iii) Insert into List maintaining proper List order

Variable Length Coding New Tree

Huffman Coding Applications Biggest contribution is the concept of prefix codes, resulting in dynamic code resolution during decompression. Deflate (PKzip), JPEG and MP3 (https://en.wikipedia.org/wiki/Huffman_coding) Use quantizing vectorisation to reduce the robustness of the input space. Encode these using prefix coding for efficient decompression.

LZW Compression Lempel-Ziv Welch (LZW) Uses a method of finding the largest known prefix in a character string. Typical uses. LossLess Compressed file can be reconstructed without data loss GIF, TIFF zip & unzip

LZW Compression Idea is to build a code table, where codes are added as they are discovered. Look at the prefix for a given character.

Compressor Pseudo Code http://marknelson STRING = get input character WHILE there are still input characters DO CHARACTER = get input character IF STRING+CHARACTER is in the string table then STRING = STRING+character ELSE output the code for STRING add STRING+CHARACTER to the string table STRING = CHARACTER END of IF END of WHILE

DeCompressor Pseudo Code http://marknelson Read OLD_CODE output OLD_CODE CHARACTER = OLD_CODE WHILE there are still input characters DO Read NEW_CODE IF NEW_CODE is not in the translation table THEN STRING = get translation of OLD_CODE STRING = STRING+CHARACTER ELSE STRING = get translation of NEW_CODE END of IF output STRING CHARACTER = first character in STRING add OLD_CODE + CHARACTER to the translation table OLD_CODE = NEW_CODE END of WHILE

Compressor Example Assume we have an alphabet of a and b. We start by building a code book initialized to all characters in the alphabet, in this case a and b. We can now compress the string: a a a b b b b b b a a b a a b a Code String 2 3 4 5 6 7 8 9 1 a b

Compressor Example … a a a b b b b b b a a b a a b a 2 Code String 2 3 4 5 6 7 8 9 1 a b 2 Find largest prefix in code book 1 4 a a Add code + next char to code book 7 Add code + next char to code book a a b 3 Add code + next char to code book b b b a Find largest prefix in code book No more input to compress so stop Find largest prefix in code book a a b a Add code + next char to code book Find largest prefix in code book Add code + next char to code book b b b Add code + next char to code book b b Find largest prefix in code book Find largest prefix in code book Find largest prefix in code book 5 Output Code

De-compressor Example We have an encoded string. 0 2 1 4 5 3 7 To decode we need two things, knowledge of the alphabet. An initialized code book based on the alphabet. Headers on say GIF files contain the alphabet information. The code book is re-build during de-compression Code String 2 3 4 5 6 7 8 9 1 a b

De-compressor Example.. During De-compression a code is read and an attempt is made to find this code in the code book. There are two cases: The code is found in the code book. The code is not found in the code book. Code found: output the string from found code. make an entry based on: previous string + firstChar of current string. Not found: make an entry into the code book based on: previous string + firstChar of previous string. output the string of new entry.

De-compressor Example... Notice that a code which is not found is a special case: E.g. during compression of a a a b b b …. a is coded to 0, but the compressor now enters aa into the code book. aa is the next code to be used. During de-compression, we can guess at this code. Text(previous) + FC(previous).

De-compressor Example…. More formally: We encounter a string P[…]P[…]PQ. If P[…] is in the code book and P[…]P is not, then the compressor outputs P[…] and adds P[…]P to the code book. When the de-compressor sees P[…]P it will not of added this code yet. We know from the pattern that P[…] is already in the code book and it was the last code encountered, and that P[…]P would normally be added next (during compression). So…. We can accurately guess and enter P[…]P into the code book. Taken from: http://www.danbbs.dk/~dino/whirlgif/lzw.html

De-compressor Example…. 0 2 1 4 5 3 7 Code String 2 3 4 5 6 7 8 9 1 a b Not Found - Enter Text(previous) + FC(Previous). Not Found - Enter Text(previous) + FC(Previous). Found - Enter Text(previous) + FC(current). b Found - Enter Text(previous) + FC(current). a a b Not Found - Enter Text(previous) + FC(Previous). Not Found - Enter Text(previous) + FC(Previous). Found - No code book entry is made for first code a Output last code entered into code book b b a a No more code to de-compress - STOP Output last code entered into code book b b b Output last code entered into code book a a Output last code entered into code book a a b a a a b b b Output Text b b b b b b a a a b a

Links Squeeze Page - Applets dealing with compression Algorithms http://www.cs.sfu.ca/cs/CC/365/li/squeeze/ http://www.geocities.com/yccheok/lzw/lzw.html

Finite State Machine for KMP Pattern 1010110 1 2 3 4 5 6 7