Arrays and Strings CSCI 2720 University of Georgia Spring 2007.

Slides:



Advertisements
Similar presentations
15-583:Algorithms in the Real World
Advertisements

15 Data Compression Foundations of Computer Science ã Cengage Learning.
Data Compression CS 147 Minh Nguyen.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Greedy Algorithms Amihood Amir Bar-Ilan University.
The Binary Numbering Systems
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Algorithms for Data Compression
Compression & Huffman Codes
School of Computing Science Simon Fraser University
CPSC 231 Organizing Files for Performance (D.H.) 1 LEARNING OBJECTIVES Data compression. Reclaiming space in files. Compaction. Searching. Sorting, Keysorting.
Lempel-Ziv Compression Techniques
CSCI 3 Chapter 1.8 Data Compression. Chapter 1.8 Data Compression  For the purpose of storing or transferring data, it is often helpful to reduce the.
Spatial and Temporal Data Mining
A Data Compression Algorithm: Huffman Compression
CS336: Intelligent Information Retrieval
Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
Data Compression Basics & Huffman Coding
Representation of Strings  Background  Huffman Encoding.
Spring 2015 Mathematics in Management Science Binary Linear Codes Two Examples.
CSE Lectures 22 – Huffman codes
Trevor McCasland Arch Kelley.  Goal: reduce the size of stored files and data while retaining all necessary perceptual information  Used to create an.
Data Structures and Algorithms Huffman compression: An Application of Binary Trees and Priority Queues.
CS559-Computer Graphics Copyright Stephen Chenney Image File Formats How big is the image? –All files in some way store width and height How is the image.
8. Compression. 2 Video and Audio Compression Video and Audio files are very large. Unless we develop and maintain very high bandwidth networks (Gigabytes.
Compression Algorithms
Chapter 2 Source Coding (part 2)
Text Compression Spring 2007 CSE, POSTECH. 2 2 Data Compression Deals with reducing the size of data – Reduce storage space and hence storage cost Compression.
Source Coding-Compression
Page 110/6/2015 CSE 40373/60373: Multimedia Systems So far  Audio (scalar values with time), image (2-D data) and video (2-D with time)  Higher fidelity.
Fundamental Structures of Computer Science Feb. 24, 2005 Ananda Guna Lempel-Ziv Compression.
Prof. Amr Goneid Department of Computer Science & Engineering
Multimedia Data Introduction to Lossless Data Compression Dr Sandra I. Woolley Electronic, Electrical.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
1 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan Room: C3-222, ext: 1204, Lecture 5.
The LZ family LZ77 LZ78 LZR LZSS LZB LZH – used by zip and unzip
1 Classification of Compression Methods. 2 Data Compression  A means of reducing the size of blocks of data by removing  Unused material: e.g.) silence.
Addressing Image Compression Techniques on current Internet Technologies By: Eduardo J. Moreira & Onyeka Ezenwoye CIS-6931 Term Paper.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
COMPRESSION. Compression in General: Why Compress? So Many Bits, So Little Time (Space) CD audio rate: 2 * 2 * 8 * = 1,411,200 bps CD audio storage:
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Lecture 4: Lossless Compression(1) Hongli Luo Fall 2011.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
Lossless Compression(2)
STATISTIC & INFORMATION THEORY (CSNB134) MODULE 11 COMPRESSION.
1 Data Compression Hae-sun Jung CS146 Dr. Sin-Min Lee Spring 2004.
Chapter 7 Lossless Compression Algorithms 7.1 Introduction 7.2 Basics of Information Theory 7.3 Run-Length Coding 7.4 Variable-Length Coding (VLC) 7.5.
Page 1KUT Graduate Course Data Compression Jun-Ki Min.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
Textbook does not really deal with compression.
JPEG Compression What is JPEG? Motivation
HUFFMAN CODES.
Data Coding Run Length Coding
Compression & Huffman Codes
Data Compression.
Data Compression.
Applied Algorithmics - week7
Data Compression CS 147 Minh Nguyen.
Why Compress? To reduce the volume of data to be transmitted (text, fax, images) To reduce the bandwidth required for transmission and to reduce storage.
Chapter 11 Data Compression
COMS 161 Introduction to Computing
15 Data Compression Foundations of Computer Science ã Cengage Learning.
Huffman Coding Greedy Algorithm
15 Data Compression Foundations of Computer Science ã Cengage Learning.
Presentation transcript:

Arrays and Strings CSCI 2720 University of Georgia Spring 2007

The Array ADT  Stores a sequence of consecutively numbered objects  Each object can be accessed (selected) using its index

More formally ….  Given integers l and u with u >= l-1, the interval l..u is defined to be the set of integers i such that l <=i<=u  An array is a function from any interval (the index set of the array) to a set of objects or elements  the value set of the array

Formally, continued …  If X is an array and i is a member of its index set, We write X[i] to denote the value of X at i The members of the range of X are known as the elements of X

The Array ADT  Access(X,i)  Length(X)  Assign(X,i,v)  Initialize(X,v)  Iterate(X,F)

Access(X,i)  Return X[i]

Length(X)  Return u – l + 1, the number of elements in I (the interval on X)

Assign(X,i,v)  Replace array X with a function whose value on i is v (and whose value on all other arguments is unchanged).  We also write this as: X[i] <- v

Initialize(X,v)  Assign v to every element of array X

Iterate(X,F)  Apply F to each element of array X in order, from smallest index to largest index. F is an action on a single array element. for i = l to u do F(X[i])

String  A special type of array  If  is any finite set, then a string over  is an array whose value set is  and whose index set is 0..n-1 for some non-negative n  The set  is called an alphabet  Each element of  is called a character   often consists of the Roman alphabet, plus digits, the space, and common punctuation marks

Strings  If w is a string, then Length(w) = n  Also written |w|  If w = TREE, then w is a string of length 4 w[0] = T, w[1] = R  The null string is the string whose domain is the empty interval Has no elements Written 

String-specific operations  Substring(w,i,m)  Concat(w 1, w 2 )

Substring(w,i,m)  w is a string; i,m integers  Returns the string of length m containing the portion of w that starts at i  Formally: returns a string w’ with indices 0.. m-1 such that w’[k] = w[i+k] for each k satisfying 0 <=k <=m only applies if  0 <= i <= |w| and  0 <= m <= (|w| -1) otherwise, returns 

Substring …  Example: w = SNICKERING Substring(w,2,3) returns ICK Substring(w,3,0) returns  Substring(w,10,3) returns   Prefix each substring(w,0,j) for 0<= j <= |w| is a prefix of w  Suffix each substring(w,j, |w| - j) for 0<= j <= |w| is a suffix of w

Concat(w 1,w 2 )  returns a string of length |w 1 | + |w 2 | whose characters are the characters of w 1 followed by those of w 2  Concat(w,) = Concat(,w) = w  Example:  w 1 = BIRD, w 2 = DOG, Concat(w 1,w 2 ) = BIRDDOG Concat(w 2, w 1 ) = DOGBIRD

Tables vs. Arrays  Table = physical organization of memory into sequential cells  Array = an abstract data type, with specific operations  Arrays frequently implemented using tables, but may be implemented in other ways

Multi-dimensional arrays  a function whose range is any set V and whose domain is the Cartesian product of any number of intervals  the Cartesian product of intervals I 1, I 2, …I d, written as I 1 x I 2 x … I d, is the set of all d-tuples such that i k  I k for each k.

Multi-D arrays  if C is a multidimensional array and if i = then C[i 1, i 2, … i d ] is the value of C at i  The dimension of a multi-D array is the number of intervals whose Cartesian product makes up the index set  The size of the k th dimension of such an array is the number of elements in I k

Contiguous Representation of Arrays: Why Computer Scientists start counting at 0  Store elements in a table: x x+4 x+8 x+12 x+16 x+20 x[0] x[1] x[2] x[3] x[4] x[5]  Each element begins at x + 4(i-1) x = starting address of the array 4 = sizeof(element) i = index of element of interest

More generally  if X is the address of the first cell in memory of an array with indices l..u, and if each element has size L, then the i th element is stored at address X + L * (i-1) the element can be retrieved in constant time

When iterating through the array  can save a few operations by doing “pointer arithmetic” just add L to current address to get next element don’t have to subtract, multiply, add still linear in number of elements, but faster linear

Where’s the needed info stored?  Could store L, l, and u at the starting address of X.. but would need to adjust the formula to calculate the location of individual cells.  If language is strongly typed, some or all of L, l, and u may be part of the definition of X and stored elsewhere C/C++ -- L part of typing info, l assumed to be 0, u not stored (programmer needs to keep track)

Where’s the needed info stored?  Can use a sentinel value after the last element of the array C/C++ -- we do this with strings. Store a ‘\0’ at the end  means that you need to iterate through to find Length, no longer O(1)

What if the elements have different lengths?  allot Max to all elements wasted space can still access in O(1) time  store pointers to elements pointers require memory need 2 accesses (calculate location of pointer, then follow it), but still O(1) pointer to element is at X + P * (i-1) easy to swap even large or complex elements … just swap their pointers

2D arrays  can also represent in contiguous memory … but do we keep rows together or do we keep columns together??  Example: array with logical ordering A B C D E F G H I J K L

Row major v. column-major ABCDEFGHIJKLABCDEFGHIJKL AEIBFJCGKDHLAEIBFJCGKDHL

Where are 2D elements stored?  Row-major: R[i,j] stored at: R + L * (NPR(i-1) + (j-1)), where  R is starting address of the array  L is the size of each element  NPR is the number of elements per row  i is the row number  j is the column number

Where are 2D elements stored?  Column-major: C[i,j] stored at: C + L * (NPC(j-1) + (i-1)), where  C is starting address of the array  L is the size of each element  NPC is the number of elements per column  i is the row number  j is the column number

Multi-dimensional arrays

Constant-time initialization procedure Initialize(ptr M, value v) //Initialize each element of M to v Count(M) <- 0 Default(m) <- v function Valid(int I, ptr M): boolean //return true if M[i] has been modified //since last Initialize return (0 <= When(M)[i] < Count(M)) and (Which(m)[When(M)[i]] == i)

Constant time initialization function Access(int i, ptr M):value // return M[i] if Valid(I,M) then return Data(M)[i] else return Default(M) procedure Assign(ptr M, int I, value v) // Set M[i] <- v if not Valid(i, M) then When(M)[i] <- Count(M) Which(M)[Count(M)] <- i Count(M) <- Count(M) + 1 Data(M)[i] <- v

But requires 3x memory … Which(M) When(M) Data(M)

Sparse Arrays  Definitions  List Representations  Hierarchical Tables  Arrays with Special Shapes

Sparse Arrays  some arrays contain only a few elements … wouldn’t it be more efficient to store only the non-null values? same idea when only a few values differ from the majority  some arrays have a special shape … upper diagonal matrix, symmetric matrix  sparse array : an array in which only a small fraction of the elements are significant in some way  null element: doesn’t need to be stored; is either actually null, or well-known, or easily calculated

List representations

Hierarchical tables

Upper-triangular matrix

Representation of Strings  Background  Huffman Encoding  Lempel-Ziv Encoding

Representing Strings  How much space do we need?  Assume we represent every character.  How many bits to represent each character?  Depends on ||

Bits to encode a character  Two character alphabet{A,B} one bit per character:  0 = A, 1 = B  Four character alphabet{A,B,C,D} two bits per character: 00 = A, 01 = B, 10 = C, 11 = D  Six character alphabet {A,B,C,D,E, F} three bits per character: 000 = A, 001 = B, 010 = C, 011 = D, 100=E, 101 =F, 110 =unused, 111=unused

More generally  The bit sequence representing a character is called the encoding of the character.  There are 2 n different bit sequences of length n,  ceil(lg||) bits required to represent each character in   if we use the same number of bits for each character then length of encoding of a word is |w| * ceil(lg||)

Can we do better??  If  is very small, might use run- length encoding

What if … the string we encode doesn’t use all the letters in the alphabet?  log 2 (ceil(|set_of_characters_used|)  But then also need to store / transmit the mapping from encodings to characters  … and is typically close to size of alphabet

Huffman Encoding: Still assumes encoding on a per-character basis Observation: assigning shorter codes to frequently used characters can result in overall shorter encodings of strings  requires assigning longer codes to rarely used characters  Problem: when decoding, need to know how many bits to read off for each character.  Solution: Choose an encoding that ensures that no character encoding is the prefix of any other character encoding. An encoding tree has this property.

A Huffman Encoding Tree ATRN E

ATRN E A000 T001 R010 N011 E1

Weighted path length A000 T001 R010 N011 E1 Weighted path = Len(code(A)) * f(A) + Len(code(T)) * f(T) + Len(code(R) ) * f(R) + Len(code(N)) * f(N) + Len(code(E)) * f(E) = (3 * 3) + ( 2 * 3) + (3 * 3) + (4 *3) + (9*1) = = 45 Claim (proof in text) : no other encoding can result in a shorter weighted path length

Building the Huffman Tree A3A3 T4T4 R4R4 E5E5

A3A3 T4T4 R4R4 E5E5 7

R4R4 E5E5 A3A3 T4T4 7

R4R4 E5E5 A3A3 T4T4 7 9

A3A3 T4T4 7 R4R4 E5E5 9

A3A3 T4T4 7 R4R4 E5E5 9 16

Building the Huffman Tree A3A3 T4T4 7 R4R4 E5E

Taking a step back …  Why do we need compression? rate of creation of image and video data image data from digital camera  today 1k by 1.5 k is common = 1.5 mbytes  need 2k by 3k to equal 35mm slide = 6 mbytes video at even low resolution of  512 by 512 and 3 bytes per pixel, 30 frames/second

Compression basics  video data rate 23.6 mbytes/second 2 hours of video = 169 gigabytes  mpeg-1 compresses 23.6 mbytesdown to 187 kbytes per second 169 gigabytes down to 1.3 gigabytes  compression is essential for both storage and transmission of data

Compression basics  compression is very widely used jpeg, gif for single images mpeg1, 2, 3, 4 for video sequence zip for computer data mp3 for sound  based on two fundamental principles spatial coherence and temporal coherence  similarity with spatial neighbor  similarity with temporal neighbor

Basics of compression character = basic data unit in the input stream represents byte, bit, etc. strings = sequences of characters encoding = compression decoding = decompression codeword = data elements used to represent input characters or character strings codetable = list of codewords

Codeword  encoding/compression takes characters/strings as input and use codetable to decide on which codewords to produce  decoder/decompressor takes codewords as input and uses same codetable to decide on which characters/strings to produce

Codetable  clearly both encoder and decoder must pass the encoded data as a series of codewords  also must pass the codetable  the codetable can be passed explicitly or implicitly  that is we either pass it across agree on it beforehand (hard wired) recreate it from the codewords (clever!)

Basic definitions  compression ratio = size of original data / compressed data basically higher compression ratio the better  lossless compression output data is exactly same as input data essential for encoding computer processed data  lossy compression output data not same as input data acceptable for data that is only viewed or heard

Lossless versus lossy  human visual system less sensitive to high frequency losses and to losses in color  lossy compression acceptable for visual data  degree of loss is usually a parameter of the compression algorithm  tradeoff - loss versus compression higher compression => more loss lower compression => less loss

Symmetric versus asymmetric  symmetric encoding time == decoding time essential for real-time applications (ie. video or audio on demand)  asymmetric encoding time >> decoding ok for write-once, read-many situations

Entropy encoding  compression that does not take into account what is being compressed  normally is also lossless encoding  most common types of entropy encoding run length encoding Huffman encoding modified Huffman (fax…) Lempel Ziv

Source encoding  takes into account type of data (ie. visual)  normally is lossy but can also be lossless  most common types in use: JPEG, GIF = single images MPEG = sequence of images (video) MP3 = sound sequence  often uses entropy encoding as a sub-routine

Run length encoding  one of simplest and earliest types of compression  take account of repeating data (called runs)  runs are represented by a count along with the original data eg. AAAABB => 4A2B  do you run length encode a single character?  no, use a special prefix character to represent start of runs

Run length encoding  runs are represented as  prefix char itself becomes 1  want a prefix char that is not too common  an example early use is MacPaint file format  run length encoding is lossless and has fixed length codewords

MacPaint File Format

Run length encoding  works best for images with solid background  good example of such an image is a cartoon  does not work as well for natural images  does not work well for English text  however, is almost always a part of a larger compression system

Huffman encoding  assume we know the frequency of each character in the input stream  then encode each character as a variable length bit string, with the length inversely proportional to the character frequency  variable length codewords are used; early example is Morse code  Huffman produced an algorithm for assigning codewords optimally

Huffman encoding  input = probabilities of occurrence of each input character (frequencies of occurrence)  output is a binary tree each leaf node is an input character each branch is a zero or one bit codeword for a leaf is the concatenation of bits for the path from the root to the leaf codeword is a variable length bit string  a very good compression ratio (optimal)?

Huffman encoding  Basic algorithm Mark all characters as free tree nodes While there is more than one free node Take two nodes with lowest freq. of occurrence Create a new tree node with these nodes as children and with freq. equal to the sum of their freqs. Remove the two children from the free node list. Add the new parent to the free node list  Last remaining free node is the root of the binary tree used for encoding/decoding

Huffman example  a series of colors in an 8 by 8 screen  colors are red, green, cyan, blue, magenta, yellow, and black  sequence is rkkkkkkk gggmcbrr kkkrrkkk bbbmybbr kkrrrrgg gggggggr kkbcccrr grrrrgrr

Huffman example

Fixed versus variable length codewords  run length codewords are fixed length  Huffman codewords are variable length  length inversely proportional to frequency  all variable length compression schemes have the prefix property  one code can not be the prefix of another  binary tree structure guarantees that this is the case (a leaf node is a leaf node!)

Huffman encoding  advantages maximum compression ratio assuming correct probabilities of occurrence easy to implement and fast  disadvantages need two passes for both encoder and decoder  one to create the frequency distribution  one to encode/decode the data can avoid this by sending tree (takes time) or by having unchanging frequencies

Modified Huffman encoding  if we know frequency of occurrences, then Huffman works very well  consider case of a fax; mostly long white spaces with short bursts of black  do the following run length encode each string of bits on a line Huffman encode these run length codewords use a predefined frequency distribution  combination run length, then Huffman

Beyond Huffman Coding …  1977 – Lempel & Ziv, Israeli information theorists, develop a dictionary-based compression method (LZ77)  1978 – they develop another dictionary-based compression method (LZ78)

The LZ family  LZ77 LZR LZSS LZB LZH – used by zip and unzip  LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG

Overview of LZ family  To demonstrate: simple alphabet containing only two letters, a and b, and create a sample stream of text

LZ family overview  Rule: Separate this stream of characters into pieces of text so that the shortest piece of data is the string of characters that we have not seen so far.

Sender : The Compressor  Before compression, the pieces of text from the breaking-down process are indexed from 1 to n:

 indices are used to number the pieces of data. The empty string (start of text) has index 0. The piece indexed by 1 is a. Thus a, together with the initial string, must be numbered Oa. String 2, aa, will be numbered 1a, because it contains a, whose index is 1, and the new character a.

 the process of renaming pieces of text starts to pay off. Small integers replace what were once long strings of characters. can now throw away our old stream of text and send the encoded information to the receiver

Bit Representation of Coded Information  Now, want to calculate num bits needed each chunk is an int and a letter num bits depends on size of table permitted in the dictionary every character will occupy 8 bits because it will be represented in US ASCII format

Compression good?  in a long string of text, the number of bits needed to transmit the coded information is small compared to the actual length of the text.  example: 12 bits to transmit the code 2b instead of 24 bits ( ) needed for the actual text aab.

Receiver: The Decompressor (Implementation  receiver knows exactly where boundaries are, so no problem in reconstructing the stream of text.  Preferable to decompress the file in one pass; otherwise, we will encounter a problem with temporary storage..

Lempel-Ziv applet  See OldCourses/1997/topic23/#JavaA pplet OldCourses/1997/topic23/#JavaA pplet

Lempel Ziv Welsch (LZW)  previous methods worked only on characters  LZW works by encoding strings  some strings are replaced by a single codeword  for now assume codeword is fixed (12 bits)  for 8 bit characters, first 256 (or less) entries in table are reserved for the characters  rest of table ( ) represent strings

LZW compression  trick is that string-to-codeword mapping is created dynamically by the encoder  also recreated dynamically by the decoder  need not pass the code table between the two  is a lossless compression algorithm  degree of compression hard to predict  depends on data, but gets better as codeword table contains more strings

LZW encoder Initialize table with single character strings STRING = first input character WHILE not end of input stream CHARACTER = next input character IF STRING + CHARACTER is in the string table STRING = STRING + CHARACTER ELSE Output the code for STRING Add STRING + CHARACTER to the string table STRING = CHARACTER END WHILE Output code for string

Demonstrations  Another animated LZ algorithm … 

LZW encoder example  compress the string BABAABAAA

LZW decoder

Lempel-Ziv compression  a lossless compression algorithm  All encodings have the same length But may represent more than one character  Uses a “dictionary” approach – keeps track of characters and character strings already encountered

LZW decoder example  decompress the string

LZW Issues  compression better as the code table grows  what happens when all 4096 locations in string table are used?  A number of options, but encoder and decoder must agree to do the same thing do not add any more entries to table (as is) clear codeword table and start again clear codeword table and start again with larger table/longer codewords (GIF format)

LZW advantages/disadvantages  advantages simple, fast and good compression can do compression in one pass dynamic codeword table built for each file decompression recreates the codeword table so it does not need to be passed  disadvantages not the optimum compression ratio actual compression hard to predict

Entropy methods  all previous methods are lossless and entropy based  lossless methods are essential for computer data (zip, gnuzip, etc.)  combination of run length encoding/huffman is a standard tool  are often used as a subroutine by other lossy methods (Jpeg, Mpeg)

Lempel-Ziv compression  a lossless compression algorithm  All encodings have the same length But may represent more than one character  Uses a “dictionary” approach – keeps track of characters and character strings already encountered

String Searching  Background  Knuth-Morris-Pratt algorithm  Boyer-Moore algorithm  Fingerprinting and the Karp-Rabin algorithm