CPS 100, Fall 2011 12.1 PF-AT-EOS l Data compression  Seriously important, why? l Priority Queue  Seriously important, why? l Assignment overview  Huff.

Slides:



Advertisements
Similar presentations
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
File Processing - Organizing file for Performance MVNC1 Organizing Files for Performance Chapter 6 Jim Skon.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Trees Chapter 8.
Compression & Huffman Codes
CPSC 231 Organizing Files for Performance (D.H.) 1 LEARNING OBJECTIVES Data compression. Reclaiming space in files. Compaction. Searching. Sorting, Keysorting.
Fall 2007CS 2251 Trees Chapter 8. Fall 2007CS 2252 Chapter Objectives To learn how to use a tree to represent a hierarchical organization of information.
Trees Chapter 8. Chapter 8: Trees2 Chapter Objectives To learn how to use a tree to represent a hierarchical organization of information To learn how.
Version TCSS 342, Winter 2006 Lecture Notes Priority Queues Heaps.
Binary Trees A binary tree is made up of a finite set of nodes that is either empty or consists of a node called the root together with two binary trees,
A Data Compression Algorithm: Huffman Compression
Compression & Huffman Codes Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
CS 206 Introduction to Computer Science II 04 / 29 / 2009 Instructor: Michael Eckmann.
CS 206 Introduction to Computer Science II 12 / 10 / 2008 Instructor: Michael Eckmann.
Fundamentals of Python: From First Programs Through Data Structures
CSE 373 Data Structures and Algorithms Lecture 13: Priority Queues (Heaps)
1 CSC 427: Data Structures and Algorithm Analysis Fall 2010 transform & conquer  transform-and-conquer approach  balanced search trees o AVL, 2-3 trees,
Data Structures and Algorithms Huffman compression: An Application of Binary Trees and Priority Queues.
1 HEAPS & PRIORITY QUEUES Array and Tree implementations.
Data Structures Arrays both single and multiple dimensions Stacks Queues Trees Linked Lists.
CompSci 100 Prog Design and Analysis II
Huffman Encoding Veronica Morales.
1 Analysis of Algorithms Chapter - 08 Data Compression.
CPS 100, Fall YAQ, YAQ, haha! (Yet Another Queue) l What is the dequeue policy for a Queue?  Why do we implement Queue with LinkedList Interface.
Trees Chapter 8. Chapter 8: Trees2 Chapter Objectives To learn how to use a tree to represent a hierarchical organization of information To learn how.
Spring 2010CS 2251 Trees Chapter 6. Spring 2010CS 2252 Chapter Objectives Learn to use a tree to represent a hierarchical organization of information.
CPS 100e, Spring Day, Week, Month Overview l Backtracking p/review  Boggle  APT l Assignment review  Boggle  APT l PQ Overview.
Priority Queues and Binary Heaps Chapter Trees Some animals are more equal than others A queue is a FIFO data structure the first element.
Chapter 11 Heap. Overview ● The heap is a special type of binary tree. ● It may be used either as a priority queue or as a tool for sorting.
Heapsort. Heapsort is a comparison-based sorting algorithm, and is part of the selection sort family. Although somewhat slower in practice on most machines.
Chapter 21 Priority Queue: Binary Heap Saurav Karmakar.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
CompSci 100E 24.1 Data Compression  Compression is a high-profile application .zip,.mp3,.jpg,.gif,.gz, …  What property of MP3 was a significant factor.
Priority Queues, Trees, and Huffman Encoding CS 244 This presentation requires Audio Enabled Brent M. Dingle, Ph.D. Game Design and Development Program.
Compsci 100, Fall Review: Why Compression l What gets compressed?  Save on storage, why is this a good idea?  Save on data transmission, how.
1 Joe Meehean.  We wanted a data structure that gave us... the smallest item then the next smallest then the next and so on…  This ADT is called a priority.
CPS Heaps, Priority Queues, Compression l Compression is a high-profile application .zip,.mp3,.jpg,.gif,.gz, …  What property of MP3 was a significant.
CompSci 100e Program Design and Analysis II April 7, 2011 Prof. Rodger CompSci 100e, Spring p 1 h 1 2 e 1 r 1 4 s 1 * 2 7 g 3 o
CPS 100, Spring Huffman Coding l D.A Huffman in early 1950’s l Before compressing data, analyze the input stream l Represent data using variable.
Heapsort. What is a “heap”? Definitions of heap: 1.A large area of memory from which the programmer can allocate blocks as needed, and deallocate them.
AVL Trees and Heaps. AVL Trees So far balancing the tree was done globally Basically every node was involved in the balance operation Tree balancing can.
CompSci 100e 8.1 Scoreboard l What else might we want to do with a data structure? AlgorithmInsertionDeletionSearch Unsorted Vector/array Sorted vector/array.
CompSci 100e 8.1 Plan for the Course! l Understand Huffman Coding  Data compression  Priority Queues  Bits and Bytes  Greedy Algorithms l Algorithms.
1. What is it? It is a queue that access elements according to their importance value. Eg. A person with broken back should be treated before a person.
CPS Heaps, Priority Queues, Compression l Compression is a high-profile application .zip,.mp3,.jpg,.gif,.gz, …  Why is compression important?
Priority Queues CS 110: Data Structures and Algorithms First Semester,
1 CSC 421: Algorithm Design Analysis Spring 2013 Transform & conquer  transform-and-conquer approach  presorting  balanced search trees, heaps  Horner's.
CompSci 100e 10.1 Binary Digits (Bits) l Yes or No l On or Off l One or Zero l
CompSci From bits to bytes to ints  At some level everything is stored as either a zero or a one  A bit is a binary digit a byte is a binary.
CPS 100, Spring Burrows Wheeler Transform l Michael Burrows and David Wheeler in 1994, BWT l By itself it is NOT a compression scheme  It’s.
Lecture on Data Structures(Trees). Prepared by, Jesmin Akhter, Lecturer, IIT,JU 2 Properties of Heaps ◈ Heaps are binary trees that are ordered.
From bits to bytes to ints
HUFFMAN CODES.
B/B+ Trees 4.7.
Compression & Huffman Codes
Top 50 Data Structures Interview Questions
Data Compression Compression is a high-profile application
October 30th – Priority QUeues
Hashing Exercises.
Compsci 201 Priority Queues & Autocomplete
CompSci 201 Data Representation & Huffman Coding
Heaps, Priority Queues, Compression
Map interface Empty() - return true if the map is empty; else return false Size() - return the number of elements in the map Find(key) - if there is an.
Chapter 8 – Binary Search Tree
Priority Queues & Heaps
Compsci 201 Priority Queues + Heaps Autocomplete
Scoreboard What else might we want to do with a data structure?
Presentation transcript:

CPS 100, Fall PF-AT-EOS l Data compression  Seriously important, why? l Priority Queue  Seriously important, why? l Assignment overview  Huff _______ l Bits, Bytes, Atoms  What's an int? ASCII? double?

CPS 100, Fall PQ Application: Data Compression l Compression is a high-profile application .zip,.mp3,.jpg,.gif,.gz, …  What property of MP3 was a significant factor in what made Napster work (why did Napster ultimately fail?)  Who invented Napster, how old, when? l Why do we care?  Secondary storage capacity doubles every year  Disk space fills up quickly on every computer system  More data to compress than ever before  Will we ever need to stop worrying about storage?

CPS 100, Fall More on Compression l Different compression techniques .mp3 files and.zip files? .gif and.jpg?  Lossless and lossy l Impossible to compress/lossless everything: Why? l Lossy methods  Good for pictures, video, and audio (JPEG, MPEG, etc.) l Lossless methods  Run-length encoding, Huffman, LZW, … 

CPS 100, Fall Huffman Coding l Understand Huffman Coding  Data compression  Priority Queue  Bits and Bytes  Greedy Algorithm l Many compression algorithms  Huffman is optimal, per-character compression  Still used, e.g., basis of Burrows-Wheeler  Other compression 'better', sometimes slower?  LZW, GZIP, BW, …

CPS 100, Fall Compression and Coding l What gets compressed?  Save on storage, why is this a good idea?  Save on data transmission, how and why? l What is information, how is it compressible?  Exploit redundancy, without that, hard to compress l Represent information: code (Morse cf. Huffman)  Dots and dashes or 0's and 1's  How to construct code?

CPS 100, Fall Coding/Compression/Concepts l For ASCII we use 8 bits, for Unicode 16 bits  Minimum number of bits to represent N values?  Representation of genomic data (a, c,g, t)?  What about noisy genomic data? l We can use a variable-length encoding, e.g., Huffman  How do we decide on lengths? How do we decode?  Values for Morse code encodings, why?Morse code encodings  … …

CPS 100, Fall Huffman Coding l D.A Huffman in early 1950’s: story of invention  Analyze and process data before compression  Not developed to compress data “on-the-fly” l Represent data using variable length codes  Each letter/chunk assigned a codeword/bitstring  Codeword for letter/chunk is produced by traversing the Huffman tree  Property: No codeword produced is the prefix of another  Frequent letters/chunk have short encoding, while those that appear rarely have longer ones l Huffman coding is optimal per-character coding method

CPS 100, Fall Mary Shaw l Software engineering and software architecture  Tools for constructing large software systems  Development is a small piece of total cost, maintenance is larger, depends on well-designed and developed techniques l Interested in computer science, programming, curricula, and canoeing, health-care costs l ACM Fellow, Alan Perlis Professor of Compsci at CMU

CPS 100, Fall Huffman coding: go go gophers l choose two smallest weights  combine nodes + weights  Repeat  Priority queue? l Encoding uses tree:  0 left/1 right  How many bits? ASCII 3 bits g ?? o ?? p h e r s sp goers* 33 h p 1 h 1 2 e 1 r 1 3 s 1 * 2 2 p 1 h 1 2 e 1 r 1 4 g 3 o p

CPS 100, Fall Huffman coding: go go gophers l Encoding uses tree/trie:  0 left/1 right  How many bits? 37!!  Savings? Worth it? ASCII 3 bits g o p h e r s sp s 1 * 2 2 p 1 h 1 2 e 1 r 1 4 g 3 o p 1 h 1 2 e 1 r 1 4 s 1 * 2 7 g 3 o

CPS 100, Fall Building a Huffman tree l Begin with a forest of single-node trees/tries (leaves)  Each node/tree/leaf is weighted with character count  Node stores two values: character and count l Repeat until there is only one node left: root of tree  Remove two minimally weighted trees from forest  Create new tree/internal node with minimal trees as children, Weight is sum of children’s weight (no char) l How does process terminate? Finding minimum?  Remove minimal trees, hummm… …

CPS 100, Fall How do we create Huffman Tree/Trie? l Insert weighted values into priority queue  What are initial weights? Why?  l Remove minimal nodes, weight by sums, re-insert  Total number of nodes? PriorityQueue pq = new PriorityQueue (); for(int k=0; k < freq.length; k++){ pq.add(new TreeNode(k,freq[k],null,null)); } while (pq.size() > 1){ TreeNode left = pq.remove(); TreeNode right = pq.remove(); pq.add(new TreeNode(0,left.weight+right.weight, left,right)); } TreeNode root = pq.remove();

CPS 100, Fall Creating compressed file l Once we have new encodings, read every character  Write encoding, not the character, to compressed file  Why does this save bits?  What other information needed in compressed file? l How do we uncompress?  How do we know foo.hf represents compressed file?  Is suffix sufficient? Alternatives? l Why is Huffman coding a two-pass method?  Alternatives?

CPS 100, Fall Uncompression with Huffman l We need the trie to uncompress  l As we read a bit, what do we do?  Go left on 0, go right on 1  When do we stop? What to do? l How do we get the trie?  How did we get it originally? Store 256 int/counts How do we read counts?  How do we store a trie? 20 Questions relevance? Reading a trie? Leaf indicator? Node values?

CPS 100, Fall Other Huffman Issues l What do we need to decode?  How did we encode? How will we decode?  What information needed for decoding? l Reading and writing bits: chunks and stopping  Can you write 3 bits? Why not? Why?  PSEUDO_EOF  BitInputStream and BitOutputStream: API l What should happen when the file won’t compress?  Silently compress bigger? Warn user? Alternatives?

CPS 100, Fall Huffman Complexities l How do we measure? Size of input file, size of alphabet  Which is typically bigger? l Accumulating character counts: ______  How can we do this in O(1) time, though not really l Building the heap/priority queue from counts ____  Initializing heap guaranteed l Building Huffman tree ____  Why? l Create table of encodings from tree ____  Why? l Write tree and compressed file _____

CPS 100, Fall Good Compsci 100 Assignment? l Array of character/chunk counts, or is this a map?  Map character/chunk to count, why array? l Priority Queue for generating tree/trie  Do we need a heap implementation? Why? l Tree traversals for code generation, uncompression  One recursive, one not, why and which? l Deal with bits and chunks rather than ints and chars  The good, the bad, the ugly l Create a working compression program  How would we deploy it? Make it better? l Benchmark for analysis  What’s a corpus ?

CPS 100, Fall Anita Borg “Dr. Anita Borg tenaciously envisioned and set about to change the world for women and for technology. … she fought tirelessly for the development technology with positive social and human impact.” l “Anita Borg sought to revolutionize the world and the way we think about technology and its impact on our lives.” l ch?v=1yPxd5jqz_Q ch?v=1yPxd5jqz_Q

CPS 100, Fall YAQ, YAQ, haha! (Yet Another Queue) l What is the dequeue policy for a Queue?  Why do we implement Queue with LinkedList Interface and class in java.util  Can we remove an element other than first? l How does queue help word-ladder/shortest path?  First item enqueued/added is the one we want  What if different element is “best”? l PriorityQueue has a different dequeue policy  Best item is dequeued, queue manages itself to ensure operations are efficient

CPS 100, Fall PriorityQueue raison d’être l Algorithms Using PQ for efficiency  Shortest Path: Google Maps/Garmin to Internet Routing How is this like word-ladder? How different?  Event based simulation Coping with explosion in number of particles or things  Optimal A* search, game-playing, AI, Can't explore entire search space, can estimate good move l Data compression facilitated by priority queue  Alltime best assignment in a Compsci 100 course? Subject to debate, of course  From A-Z, soup-to-nuts, bits to abstractions

CPS 100, Fall Priority Queue l Compression motivates ADT priority queue  Supports two basic operations add/insert -– an element into the priority queue remove/delete – the minimal element from the priority queue  Implementations allow getmin/peek as well as delete Analogous to top/pop, peek/dequeue in stacks, queues l Think about implementing the ADT, choices?  Add compared to min/remove  Balanced search tree is ok, but can we do better?

CPS 100, Fall Priority Queue sorting l See PQDemo.java,  code below sorts, complexity? String[] array = {...}; // array filled with data PriorityQueue pq = new PriorityQueue (); for(String s : array) pq.add(s); for(int k=0; k < array.length; k++){ array[k] = pq.remove(); } l Bottlenecks, operations in code above  Add words one-at-a-time to PQ v. all-at-once  What if PQ is an array, add or remove fast/slow?  We’d like PQ to have tree characteristics, why?

CPS 100, Fall Priority Queue top-M sorting l What if we have lots and lots and lots of data  code below sorts top-M elements, complexity? Scanner s = … // initialize; PriorityQueue pq = new PriorityQueue (); while (s.hasNext()) { pq.add(s.next()); if (pq.size() > M) pq.remove(); } l What’s advantageous about this code?  Store everything and sort everything?  Store everything, sort first M?  What is complexity of sort : O(n log n)

CPS 100, Fall Priority Queue implementations l Priority queues: average and worst case Insert average Getmin (delete) Insert worst Getmin (delete) Unsorted list Sorted list Search tree Balanced tree Heap log n O(n) O(1)log n O(n)O(1)O(n)O(1) O(n)O(1)O(n)O(1) l Heap has O(n) build heap from n elements

CPS 100, Fall PriorityQueue.java [java.util] l What about objects inserted into pq?  Comparable, e.g., essentially sortable  How can we change what minimal means?  Implementation uses heap, tree stored in an array l Use a Comparator for comparing entries we can make a min-heap act like a max-heap, see PQDemo  Where is class Comparator declaration? How used?  What if we didn't know about Collections.reverseOrder? How do we make this ourselves?

CPS 100, Fall Big-Oh and a tighter look at inserts l log(1) + log(2) + log(3) + … + log(n)  Property of logs, log(a) + log(b) = log(a*b)  log(1*2*3*…*n) = log(n!) l We can show using Sterling’s formula: l log(n!) = c 1 *log(n) + nlog(n) – c 2 *n l We can get O(n log n) easily, this is a tight bound, lower,  n log n) as well

CPS 100, Fall Priority Queue implementation l Heap data structure is fast and reasonably simple  Why not use inheritance hierarchy as was used with Map?  Trade-offs when using HashMap and TreeMap: Time, space, ordering properties, TreeMap support? l Changing comparison when calculating priority?  Create object to replace, or in lieu of compareTo Comparable interface compares this to passed object Comparator interface compares two passed objects  Both comparison methods: compareTo() and compare() Compare two objects (parameters or self and parameter) Returns –1, 0, +1 depending on

CPS 100, Fall Creating Heaps l Heap: array-based implementation of binary tree used for implementing priority queues:  add/insert, peek/getmin, remove/deletemin, O(???) l Array minimizes storage (no explicit pointers), faster too, contiguous (cache) and indexing l Heap has shape property and heap/value property  shape: tree filled at all levels (except perhaps last) and filled left-to-right (complete binary tree)  each node has value smaller than both children

CPS 100, Fall Array-based heap l store “node values” in array beginning at index 1 l for node with index k  left child: index 2*k  right child: index 2*k+1 l why is this conducive for maintaining heap shape? l what about heap property? l is the heap a search tree? l where is minimal node? l where are nodes added? deleted?

CPS 100, Fall Thinking about heaps Where is minimal element?  Root, why? l Where is maximal element?  Leaves, why? l How many leaves are there in an N-node heap (big-Oh)?  O(n), but exact? l What is complexity of find max in a minheap? Why?  O(n), but ½ N? l Where is second smallest element? Why?  Near root?

CPS 100, Fall Adding values to heap l to maintain heap shape, must add new value in left-to-right order of last level  could violate heap property  move value “up” if too small l change places with parent if heap property violated  stop when parent is smaller  stop when root is reached l pull parent down, swapping isn’t necessary (optimization) insert 8 bubble 8 up

CPS 100, Fall Adding values, details (pseudocode) void add(Object elt) { // add elt to heap in myList myList.add(elt); int loc = myList.size()-1; while (1 < loc && elt < myList.get(loc/2)){ myList.set(loc,myList.get(loc/2)); loc = loc/2; // go to parent } // what’s true here? myList.set(loc,elt); } array myList

CPS 100, Fall Removing minimal element l Where is minimal element?  If we remove it, what changes, shape/property? l How can we maintain shape?  “last” element moves to root  What property is violated? l After moving last element, subtrees of root are heaps, why?  Move root down (pull child up) does it matter where? l When can we stop “re-heaping”?  Less than both children  Reach a leaf

CPS 100, Fall Views of programming l Writing code from the method/function view is pretty similar across languages  Organizing methods is different, organizing code is different, not all languages have classes,  Loops, arrays, arithmetic, … l Program using abstractions and high level concepts  Do we need to understand 32-bit twos-complement storage to understand x =x+1?  Do we need to understand how arrays map to contiguous memory to use ArrayLists?

CPS 100, Fall Bottom up meets Top down l We teach programming by teaching abstractions  What's an array? What’s an ArrayList?  What's an array in C?  What's a map, Hashmap? Treemap? l How are programs stored and executed in the JVM?  Differences on different architectures?  Similarities between Java and C++ and Python and PHP? l What about how the CPU works? How memory works? Shouldn't we study this too?

CPS 100, Fall From bit to byte to char to int to long l Ultimately everything is stored as either a 0 or 1  Bit is binary digit a byte is a binary term (8 bits)  We should be grateful we can deal with Strings rather than sequences of 0's and 1's.  We should be grateful we can deal with an int rather than the 32 bits that comprise an int l If we have 255 values for R, G, B, how can we pack this into an int?  Why should we care, can’t we use one int per color?  How do we do the packing and unpacking?

CPS 100, Fall More information on bit, int, long l int values are stored as two's complement numbers with 32 bits, for 64 bits use the type long, a char is 16 bits  Standard in Java, different in C/C++  Facilitates addition/subtraction for int values  We don't need to worry about this, except to note: Infinity + 1 = - Infinity (see Integer.MAX_VALUE ) Math.abs(-Infinity) > Infinity l Java byte, int, long are signed values, char unsigned  What are values for 16-bit char? 8-bit byte?  Why will this matter in Burrows Wheeler?

CPS 100, Fall More details about bits l How is 13 represented?  … _0_ _0_ _1_ _1_ _0_ _1_  Total is = 13 l What is bit representation of 32? Of 15? Of 1023?  What is bit-representation of 2 n - 1 ?  What is bit-representation of 0? Of -1? Study later, but -1 is all 1’s, left-most bit determines < 0 Java signed byte : , # bits?  What if we only want 0-255? (Huff, pixels, …)  Convert negative values or use char, trade-offs? Java char unsigned: 0..65,536 # bits?  Why is char unsigned? Why not as in C++/C?

CPS 100, Fall How are data stored? l To facilitate Huffman coding we need to read/write one bit  Why do we need to read one bit?  Why do we need to write one bit?  When do we read 8 bits at a time? 32 bits? l We can't actually write one bit-at-a-time. We can't really write one char at a time either.  Output and input are buffered,minimize memory accesses and disk accesses  Why do we care about this when we talk about data structures and algorithms? Where does data come from?

CPS 100, Fall How do we buffer char output? l Done for us as part of InputStream and Reader classes  InputStreams are for reading bytes  Readers are for reading char values  Why do we have both and how do they interact? Reader r = new InputStreamReader(System.in);  Do we need to flush our buffers? l In the past Java IO has been notoriously slow  Do we care about I? About O?  This is changing, and the java.nio classes help Map a file to a region in memory in one operation