Presentation is loading. Please wait.

Presentation is loading. Please wait.

CompSci 201 Data Representation & Huffman Coding

Similar presentations


Presentation on theme: "CompSci 201 Data Representation & Huffman Coding"— Presentation transcript:

1 CompSci 201 Data Representation & Huffman Coding
Owen Astrachan Jeff Forbes November 29, 2017 11/29/17 CompSci 201, Fall 2017, Data Rep

2 X is for … XOR (A || B) && !(A && B) XML Extensible Markup Language
Xerox PARC GUIs, laser printers, OOP, ubicomp Xavier Winning 11/29/17 Compsci 201, Fall 2017, Data Rep

3 Getting Help on Piazza Help us help you! Where does your code hurt?
What have you tested on your own? How do you know your code is correct? Issues Referring to the writeup & previous questions Common problems Not a remote diagnostic service Public vs. private questions 11/29/17 Compsci 201, Fall 2017, Data Rep

4 What’s left? Huffman Coding Data Representation Graphs Upcoming
Compression Decompression Graphs Upcoming APT Quiz 3

5 Data and its Implications
11/29/17 Compsci 201, Fall 2017, Data Rep

6 Data Compression Why? Save space & Time How? Exploit redundancy
Leverage patterns Types Lossless: Can recover exact data Lossy: Can recover approximate data Compression Ratio (Encoded bits) / (original bits) Weissman Ratio Account for runtime

7 Compression! From Silicon Valley From Fall 2007: Huff that Code!
Jasdeep Garcha, Vyshak Chandra, Jonathan Wall, and Matt McKenna

8 Solving the problem Steps to developing a usable algorithm.
Identify and model the problem. Find an algorithm to solve it. Fast enough? Fits in memory? If not, figure out why. Find a way to address the problem. Iterate until satisfied. The scientific method. Mathematical analysis.

9 Data Representation Ultimately everything is stored as either a 0 or 1
Bit is binary digit a byte is a binary term (8 bits) We should be thankful we can deal with Strings rather than sequences of 0's and 1's. We should be thankful we can deal with an int rather than the 32 bits that comprise an int Program using abstractions and high level concepts Do we need to understand 32-bit twos-complement storage to understand x =x+1? Do we need to understand how arrays map to contiguous memory to use ArrayLists? Top-down vs. bottom-up?

10 Primitive types Some applications require attention to memory-use
Differences: one-million bytes, chars, and int First requires a megabyte, last requires four megabytes When do we care about these differences? int values are stored as two's complement numbers with 32 bits, for 64 bits use the type long, a char is 16 bits Java byte, int, long are signed values, char unsigned Java signed byte: , # bits? What if we only want 0-255? (Huff, pixels, …) Convert negative values or use char, trade-offs? Java char unsigned: ,536 # bits? Why is char unsigned?

11 More details about bits
How is 13 represented? … _0_ _0_ _1_ _1_ _0_ _1_ Total is = 13 What is bit representation of 32? Of 15? Of 1023? What is bit-representation of 2n - 1? What is bit-representation of 0? Of -1? Study later, but -1 is all 1’s, left-most bit determines < 0 Determining what bits are on? How many on? Understanding, problem-solving

12 Text Compression: Examples
“abcde” in the different formats ASCII: … Fixed: Variable: Encodings ASCII: 8 bits/character Unicode: 16 or 32 bits/character 1 a b c d e Symbol ASCII Fixed length Var. a 000 b 001 11 c 010 01 d 011 e 100 10 a d b c e 1

13 Huffman Coding D.A Huffman in early 1950’s
Before compressing data, analyze the input stream Represent data using variable length codes Variable length codes though Prefix codes Each letter is assigned an encoding Encoding is for a given letter is produced by traversing the Huffman tree Property: No encoding produced is the prefix of another Letters appearing frequently have short encoding, while those that appear rarely have longer ones Huffman coding is optimal per-character coding method

14 Building a Huffman tree
Begin with a forest of single-node trees (leaves) Each node/tree/leaf is weighted with character count Node stores two values: character and count There are n nodes in forest, n is size of alphabet? Repeat until there is only one node left: root of tree Remove two minimally weighted trees from forest Greedy! Create new tree with minimal trees as children, New tree root's weight: sum of children (character ignored) Does this process terminate? How do we get minimal trees? Remove minimal trees, need to order based on what?

15 Creating a Huffman Tree/Trie?
Insert weighted values into priority queue What are initial weights? Why? Remove minimal nodes, weight by sums, re-insert Total number of nodes? PriorityQueue<HuffNode> forest = new PriorityQueue<>();  for (int i = 0; i < 256; i++) if (frequencies[i] > 0) // computed elsewhere   forest.add(new HuffNode(i, frequencies[i]));  while (forest.size() > 1) {   HuffNode left = forest.remove();   HuffNode right = forest.remove();   forest.add(new HuffNode(-1, left.weight()+right.weight(), left, right));  }  HuffNode root = forest.remove();

16 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 11 6 I 5 E 5 N 4 S 4 M 3 A 3 B 3 O 3 T 2 G 2 D 2 L 2 R 2 U 1 P 1 F 1 C

17 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 2 1 F 1 C 11 6 I 5 E 5 N 4 S 4 M 3 A 3 B 3 O 3 T 2 2 G 2 D 2 L 2 R 2 U 1 P

18 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 2 3 1 C 1 F 1 P 2 U 11 6 I 5 E 5 N 4 S 4 M 3 3 A 3 B 3 O 3 T 2 2 G 2 D 2 L 2 R

19 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 4 2 3 2 L 2 R 1 C 1 F 1 P 2 U 11 6 I 5 E 5 N 4 4 S 4 M 3 3 A 3 B 3 O 3 T 2 2 G 2 D

20 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 4 4 2 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U 11 6 I 5 E 5 N 4 4 4 S 4 M 3 3 A 3 B 3 O 3 T 2

21 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 5 4 4 3 O 2 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U 11 6 I 5 5 E 5 N 4 4 4 S 4 M 3 3 A 3 B 3 T

22 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 5 6 4 4 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U 11 6 6 I 5 5 E 5 N 4 4 4 S 4 M 3 A 3 B

23 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 6 5 6 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U 11 6 6 6 I 5 5 E 5 N 4 4 4 S 4 M

24 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 6 8 5 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U 11 8 6 6 6 I 5 5 E 5 N 4 4

25 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 6 8 8 5 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U 11 8 8 6 6 6 I 5 5 E 5 N

26 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 10 6 8 8 5 E 5 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U 11 10 8 8 6 6 6 I 5 5 N

27 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 10 11 6 8 8 5 E 5 5 N 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U 11 11 10 8 8 6 6 I

28 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 12 10 11 6 I 6 8 8 5 E 5 5 N 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U 12 11 11 10 8 8

29 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 16 12 10 11 6 I 6 8 8 5 E 5 5 N 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U 16 12 11 11 10

30 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 21 16 12 10 11 6 I 6 8 8 5 E 5 5 N 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U 21 16 12 11

31 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 23 21 16 12 11 10 11 6 I 6 8 8 5 E 5 5 N 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U 23 21 16

32 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 37 23 21 16 12 11 10 11 6 I 6 8 8 5 E 5 5 N 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U 37 23

33 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 11 6 I 5 N E 1 F C P 2 U R L D G 3 O T B A 4 M S 8 16 10 21 12 23 37 60 60

34 Creating a Huffman Tree/Trie?
Insert weighted values into priority queue What are initial weights? Why? Remove minimal nodes, weight by sums, re-insert Total number of nodes? PriorityQueue<HuffNode> forest = new PriorityQueue<>();  for (int i = 0; i < 256; i++) if (frequencies[i] > 0) // computed elsewhere   forest.add(new HuffNode(i, frequencies[i]));  while (forest.size() > 1) {   HuffNode left = forest.remove();   HuffNode right = forest.remove();   forest.add(new HuffNode(-1, left.weight()+right.weight(), left, right));  }  HuffNode root = forest.remove();

35 Compsci 201, Fall 2017, Tries+Greedy
What’s a HuffNode Similar to: Encodes String via links like a Trie Two children like a TreeNode public class HuffNode implements Comparable<HuffNode> { // Data private int myValue, myWeight; private HuffNode myLeft, myRight; // Constructors public HuffNode(int value, int weight) public HuffNode(int value, int weight, HuffNode left, HuffNode right) } 11/17/17 Compsci 201, Fall 2017, Tries+Greedy

36 Encoding http://bit.ly/201-f17-1129-1 Given a file with: n characters
an alphabet of k distinct characters, r is the compression rate (bits/character) Count occurrence of all occurring characters O( n ) Build priority queue O( k lg k ) Build Huffman tree O( k lg k ) Create Table of encodings from tree O( k ) Write Huffman tree and coded data to file O( n )

37 Properties of Huffman coding
Want to minimize weighted path length L(T)of tree T wi is the weight or count of each encoding i di is the leaf corresponding to encoding i How do we calculate character (encoding) frequencies? Huffman coding creates pretty full bushy trees? When would it produce a “bad” tree? How do we produce coded compressed data from input efficiently?

38 Writing code out to file
How do we go from characters to encodings? Build Huffman tree Root-to-leaf path generates encoding Recall LeafTrails APT Need way of writing bits out to file Platform dependent? Complicated to write bits and read in same ordering See BitInputStream and BitOutputStream classes How do we know bits come from compressed file? Store a magic number

39 Creating compressed file
Once we have new encodings, read every character Write encoding, not the character, to compressed file Why does this save bits? What other information needed in compressed file? How do we uncompress? How do we know foo.hf represents compressed file? Is suffix sufficient? Alternatives? Why is Huffman coding a two-pass method? Alternatives?

40 Uncompression with Huffman
We need the trie to uncompress As we read a bit, what do we do? Go left on 0, go right on 1 When do we stop? What to do? How do we get the trie? How did we get it originally? Store 256 int/counts How do we read counts? How do we store a trie? Traversal order? Reading a trie? Leaf indicator? Node values?

41 Decoding a message 00000100001001101 I E N S M A B O T G D L R C F P U
60 37 23 21 16 12 11 10 11 6 I 6 8 8 5 E 5 5 N 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U

42 Decoding a message G 0000100001001101 I E N S M A B O T G D L R C F P
60 37 23 21 16 12 11 10 11 6 I 6 8 8 5 E 5 5 N 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U G

43 Decoding a message G 000100001001101 I E N S M A B O T G D L R C F P U
60 37 23 21 16 12 11 10 11 6 I 6 8 8 5 E 5 5 N 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U G

44 Decoding a message G 00100001001101 I E N S M A B O T G D L R C F P U
60 37 23 21 16 12 11 10 11 6 I 6 8 8 5 E 5 5 N 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U G

45 Decoding a message G 0100001001101 I E N S M A B O T G D L R C F P U
60 37 23 21 16 12 11 10 11 6 I 6 8 8 5 E 5 5 N 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U G

46 Decoding a message G 100001001101 I E N S M A B O T G D L R C F P U 60
37 23 21 16 12 11 10 11 6 I 6 8 8 5 E 5 5 N 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U G

47 Decoding a message GO 00001001101 I E N S M A B O T G D L R C F P U 60
37 23 21 16 12 11 10 11 6 I 6 8 8 5 E 5 5 N 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U GO

48 Decoding a message GO 0001001101 I E N S M A B O T G D L R C F P U 60
37 23 21 16 12 11 10 11 6 I 6 8 8 5 E 5 5 N 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U GO

49 Decoding a message GO 001001101 I E N S M A B O T G D L R C F P U 60
37 23 21 16 12 11 10 11 6 I 6 8 8 5 E 5 5 N 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U GO

50 Decoding a message GO 01001101 I E N S M A B O T G D L R C F P U 60 37
23 21 16 12 11 10 11 6 I 6 8 8 5 E 5 5 N 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U GO

51 Decoding a message GO 1001101 I E N S M A B O T G D L R C F P U 60 37
23 21 16 12 11 10 11 6 I 6 8 8 5 E 5 5 N 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U GO

52 Decoding a message GOO 001101 I E N S M A B O T G D L R C F P U 60 37
23 21 16 12 11 10 11 6 I 6 8 8 5 E 5 5 N 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U GOO

53 Decoding a message GOO 01101 I E N S M A B O T G D L R C F P U 60 37
23 21 16 12 11 10 11 6 I 6 8 8 5 E 5 5 N 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U GOO

54 Decoding a message GOO 1101 I E N S M A B O T G D L R C F P U 60 37 23
21 16 12 11 10 11 6 I 6 8 8 5 E 5 5 N 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U GOO

55 Decoding a message GOO 101 I E N S M A B O T G D L R C F P U 60 37 23
21 16 12 11 10 11 6 I 6 8 8 5 E 5 5 N 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U GOO

56 Decoding a message GOO 01 I E N S M A B O T G D L R C F P U 60 37 23
21 16 12 11 10 11 6 I 6 8 8 5 E 5 5 N 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U GOO

57 Decoding a message GOOD 1 I E N S M A B O T G D L R C F P U 60 37 23
21 16 12 11 10 11 6 I 6 8 8 5 E 5 5 N 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U GOOD

58 Decoding a message GOOD 01100000100001001101
60 37 23 21 16 12 11 10 11 6 I 6 8 8 5 E 5 5 N 6 4 S 4 M 4 4 3 A 3 B 3 O 2 3 T 3 2 G 2 D 2 L 2 R 1 C 1 F 1 P 2 U GOOD

59 Decoding Read in tree data O( k ) Decode bit string with tree O( nr )
Given a file with: n characters an alphabet of k distinct characters, r is the compression rate (bits/character) Read in tree data O( k ) Decode bit string with tree O( nr )

60 Data Compression Why is data compression important?
Year Scheme Bit/Char 1967 ASCII 7.00 1950 Huffman 4.70 1977 Lempel-Ziv (LZ77) 3.94 1984 Lempel-Ziv-Welch (LZW) – Unix compress 3.32 1987 (LZH) used by zip and unzip 3.30 Move-to-front 3.24 gzip 2.71 1995 Burrows-Wheeler 2.29 1997 BOA (statistical data compression) 1.99 Why is data compression important? How well can you compress files losslessly? Is there a limit? How to compare? How do you measure how much information?


Download ppt "CompSci 201 Data Representation & Huffman Coding"

Similar presentations


Ads by Google