Compsci 100, Spring 2010 18.1 What’s left to talk about? l Transforms  Making Huffman compress more  Understanding what transforms do Conceptual understanding,

Slides:



Advertisements
Similar presentations
Chapter 8 Technicalities: Functions, etc. Bjarne Stroustrup
Advertisements

Recursion. Recursive Definitions A recursive definition is one which uses the word being defined in the definition Not always useful:  for example, in.
CSC1016 Coursework Clarification Derek Mortimer March 2010.
Week 2: Primitive Data Types 1.  Programming in Java  Everything goes inside a class  The main() method is the starting point for executing instructions.
Computer Science 1620 Variables and Memory. Review Examples: write a program that calculates and displays the average of the numbers 45, 69, and 106.
Lecture 25 Selection sort, reviewed Insertion sort, reviewed Merge sort Running time of merge sort, 2 ways to look at it Quicksort Course evaluations.
1 Homework Turn in HW2 at start of next class. Starting Chapter 2 K&R. Read ahead. HW3 is on line. –Due: class 9, but a lot to do! –You may want to get.
1 Programming & Programming Languages Overview l Machine operations and machine language. l Example of machine language. l Different types of processor.
CSE 143 Lecture 18 Huffman slides created by Ethan Apter
COMP 14: Primitive Data and Objects May 24, 2000 Nick Vallidis.
Week 4-5 Java Programming. Loops What is a loop? Loop is code that repeats itself a certain number of times There are two types of loops: For loop Used.
Cs3102: Theory of Computation Class 18: Proving Undecidability Spring 2010 University of Virginia David Evans.
Compsci 100, Fall ’.1 Views of programming l Writing code from the method/function view is pretty similar across languages  Organizing methods.
A Computer Science Tapestry 1 Recursion (Tapestry 10.1, 10.3) l Recursion is an indispensable technique in a programming language ä Allows many complex.
Lists in Python.
CSIS 123A Lecture 6 Strings & Dynamic Memory. Introduction To The string Class Must include –Part of the std library You can declare an instance like.
ICAPRG301A Week 4Buggy Programming ICAPRG301A Apply introductory programming techniques Program Bugs US Navy Admiral Grace Hopper is often credited with.
CS 102 Computers In Context (Multimedia)‏ 01 / 23 / 2009 Instructor: Michael Eckmann.
CS 114 – Class 02 Topics  Computer programs  Using the compiler Assignments  Read pages for Thursday.  We will go to the lab on Thursday.
Compsci 06/101, Fall Steganography l Hide text in image (or hide information in image)  Why might we do this?  Difference: watermarking v steganography.
Programming With C.
IT253: Computer Organization Lecture 3: Memory and Bit Operations Tonga Institute of Higher Education.
1 C - Memory Simple Types Arrays Pointers Pointer to Pointer Multi-dimensional Arrays Dynamic Memory Allocation.
Compsci 06/101, Spring What is Computing? Informatics? l What is computer science, what is its potential?  What can we do with computers in.
Chapter 13 Recursion. Learning Objectives Recursive void Functions – Tracing recursive calls – Infinite recursion, overflows Recursive Functions that.
1 Joe Meehean.  Problem arrange comparable items in list into sorted order  Most sorting algorithms involve comparing item values  We assume items.
Programming for Beginners Martin Nelson Elizabeth FitzGerald Lecture 15: More-Advanced Concepts.
5 BASIC CONCEPTS OF ANY PROGRAMMING LANGUAGE Let’s get started …
Fundamental Programming: Fundamental Programming Introduction to C++
Computer Organization and Assembly Language Bitwise Operators.
Prepared by: Elsy Torres Shajida Berry Siobhan Westby.
Software Design 8.1 A Rose by any other name…C or Java? l Why do we use Java in our courses (royal we?)  Object oriented  Large collection of libraries.
Compsci 100, Fall What is a transform? l Multiply two near-zero numbers, what happens?  Add their logarithms: log(a)+log(b) = log(ab), invertible.
1 C++ Programming Basics Chapter 1 Lecture CSIS 10A.
Looping and Counting Lecture 3 Hartmut Kaiser
Compsci 06/101, Spring What is Computing? Informatics? l What is computer science, what is its potential?  What can we do with computers in.
Fall 2002CS 150: Intro. to Computing1 Streams and File I/O (That is, Input/Output) OR How you read data from files and write data to files.
CPS 100, Fall What is Computing? Informatics? l What is computer science, what is its potential?  What can we do with computers in our lives?
1 WELCOME TO CIS 1068! Instructor: Alexander Yates.
CompSci Problem: finding subsets l See CodeBloat APT, requires finding sums of all subsets  Given {72, 33, 41, 57, 25} what is sum closest (not.
Computer Organization and Design Pointers, Arrays and Strings in C Montek Singh Sep 18, 2015 Lab 5 supplement.
1 Introduction  Algorithms  Data structures  Abstract data types  Programming with lists and sets © 2008 David A Watt, University of Glasgow Algorithms.
Copyright © 2000, Department of Systems and Computer Engineering, Carleton University 1 Introduction An array is a collection of identical boxes.
Compsci 101, Fall LWoC l Review Recommender, dictionaries, files  How to create recommendations in order? food.txt  Toward a Duke eatery-recommender.
Compsci 06/101, Fall What is Computing? Informatics? l What is computer science, what is its potential?  What can we do with computers in our.
Chapter 11  Getting ready to program  Hardware Model  Software Model  Programming Languages  Facts about C++  Program Development Process  The Hello-world.
CSE 143 Lecture 22 Huffman slides created by Ethan Apter
CompSci 100e 12.1 Sorting: In 2 Slides l Why do people study sorting?  Because we have to  Because sorting is beautiful  Example of algorithm analysis.
Announcements No Labs / Recitation this week On Friday we will talk about Project 3 Release late afternoon / evening tomorrow Cryptography.
CPS 100e, Fall Burrows Wheeler Transform l Michael Burrows and David Wheeler in 1994, BWT l By itself it is NOT a compression scheme  It’s used.
Announcements Assignment 2 Out Today Quiz today - so I need to shut up at 4:25 1.
Compsci 101.2, Fall Plan for LDO101 l Ethical webpage scraping  Illustrate power of regular expressions  Python makes trying things relatively.
Chapter 5 – Part 3 Conditionals and Loops. © 2004 Pearson Addison-Wesley. All rights reserved2/19 Outline The if Statement and Conditions Other Conditional.
Chapter 1 slides1 What is C? A high-level language that is extremely useful for engineering computations. A computer language that has endured for almost.
Software Engineering Algorithms, Compilers, & Lifecycle.
Coming up Implementation vs. Interface The Truth about variables Comparing strings HashMaps.
CPS 100, Spring Burrows Wheeler Transform l Michael Burrows and David Wheeler in 1994, BWT l By itself it is NOT a compression scheme  It’s.
HUFFMAN CODES.
Bottom up meets Top down
Huffman Coding Based on slides by Ethan Apter & Marty Stepp
Views of programming Writing code from the method/function view is pretty similar across languages Organizing methods is different, organizing code is.
Strings, Line-by-line I/O, Functions, Call-by-Reference, Call-by-Value
Hank Childs, University of Oregon
Tonga Institute of Higher Education IT 141: Information Systems
Fundamental Programming
Introduction to Computer Science
Tonga Institute of Higher Education IT 141: Information Systems
LCC 6310 Computation as an Expressive Medium
From bit to byte to char to int to long
Presentation transcript:

Compsci 100, Spring What’s left to talk about? l Transforms  Making Huffman compress more  Understanding what transforms do Conceptual understanding, details left to … All information here, we won’t discuss details l Expressing concepts in different languages  How hard is it to learn C++, Python, …  Are there “different” languages? Ruby, Scheme, …

Compsci 100, Spring What is a transform? l Multiply two near-zero numbers, what happens?  Add their logarithms: log(a)+log(b) = log(ab), invertible  What is log of ? Benefits of transform? l What is FFT: Fast Fourier Transform?  O(n log n) method for computing a Fourier Transform  Better than O(n 2 ), huge difference for lots of data points  Shazam? how shazam might workhow shazam might work l Feature extraction from images: faces, edges, lines, …  Hough transform l Wavelet transforms do something too, but … 

Compsci 100, Spring Burrows Wheeler Transform l Michael Burrows and David Wheeler in 1994, BWT l By itself it is NOT a compression scheme  It’s used to preprocess data, or transform data, to make it more amenable to compression like Huffman Coding  Huff depends on redundancy/repetition, as do many compression schemes l l l Main idea in BWT: transform the data into something more compressible and make the transform fast, though it will be slower than no transform  TANSTAAFL (what does this mean?)

Compsci 100, Spring David Wheeler ( ) l Invented subroutine l “Wheeler was an inspiring teacher who helped to develop computer science teaching at Cambridge from its inception in 1953, when the Diploma in Computer Science was launched as the world's first taught course in computing. ”

Compsci 100, Spring Mike Burrows He's one of the pioneers of the information age. His invention of Alta Vista helped open up an entire new route for the information highway that is still far from fully explored. His work history, intertwined with the development of the high- tech industry over the past two decades, is distinctly a tale of scientific genius.

Compsci 100, Spring BWT efficiency BWT is a block transform – requires storing n copies of the file with time O(n log n) to sort copy (file has length n)  We can’t really do this in practice in terms of storage  Instead of storing n copies of the file, store one copy and an integer index (break file into blocks of size n) But sorting is still O(n log n) and it’s actually worse  Each comparison in the sort looks at the entire file  In normal sort analysis the comparison is O(1), strings are small  Now we have key comparison of O(n), so sort is actually…  O(n 2 log n), why?

Compsci 100, Spring BWT at 10,000 ft: big picture l Remember, goal is to exploit/create repetition (redundancy)  Create repetition as follows  Consider original text: duke blue devils.  Create n copies by shifting/rotating by one character 0: duke blue devils. 1: uke blue devils.d 2: ke blue devils.du 3: e blue devils.duk 4: blue devils.duke 5: blue devils.duke 6: lue devils.duke b 7: ue devils.duke bl 8: e devils.duke blu 9: devils.duke blue 10: devils.duke blue 11: evils.duke blue d 12: vils.duke blue de 13: ils.duke blue dev 14: ls.duke blue devi 15: s.duke blue devil 16:.duke blue devils

Compsci 100, Spring BWT at 10,000 ft: big picture l Once we have n copies (but not really n copies!)  Sort the copies  Remember the comparison will be O(n)  We’ll look at the last column, see next slide What’s true about first column? 4: blue devils.duke 9: devils.duke blue 16:.duke blue devils 5: blue devils.duke 10: devils.duke blue 0: duke blue devils. 3: e blue devils.duk 8: e devils.duke blu 11: evils.duke blue d 13: ils.duke blue dev 2: ke blue devils.du 14: ls.duke blue devi 6: lue devils.duke b 15: s.duke blue devil 7: ue devils.duke bl 1: uke blue devils.d 12: vils.duke blue de

Compsci 100, Spring |ees.kudvuibllde| |.bddeeeikllsuuv| 4: blue devils.duke 9: devils.duke blue 16:.duke blue devils 5: blue devils.duke 10: devils.duke blue 0: duke blue devils. 3: e blue devils.duk 8: e devils.duke blu 11: evils.duke blue d 13: ils.duke blue dev 2: ke blue devils.du 14: ls.duke blue devi 6: lue devils.duke b 15: s.duke blue devil 7: ue devils.duke bl 1: uke blue devils.d 12: vils.duke blue de l Properties of first column  Lexicographical order  Maximally ‘clumped’ why?  From it, can we create last? l Properties of last column  Some clumps (real files)  Can we create first? Why? l See row labeled 8:  Last char precedes first in original! True for all rows! l Can recreate everything:  Simple (code) but hard (idea)

Compsci 100, Spring What do we know about last column? l Contains every character of original file  Why is there repetition in the last column?  Is there repetition in the first column? l Keep the last column because we can recreate the first  What’s in every column of the sorted list?  If we have the last column we can create the first Sorting the last column yields first  We can create every column which means if we know what row the original text is in we’re done! Look back at sorted rows, what row has index 0?

Compsci 100, Spring BWT from a 5,000 ft view l How do we avoid storing n copies of the input file?  Store once with index of what the first character is  0 and “duke blue devils.” is the original string  3 and “duke blue devils.” is “e blue devils. du”  What is 7 and “duke blue devils.” l You’ll be given a class Rotatable that can be sorted  Construct object from original text and index  When compared, use the index as a place to start  Rotatable can report the last char of any “row”  Rotatable can report its index (stored on construction)

Compsci 100, Spring BWT 2,000 feet l To transform all we need is the last column and the row at which the original string is in the list of sorted strings  We take these two pieces of information and either compress them or transform them further  After the transform we run Huff on the result l We can’t store/sort a huge file, what do we do?  Process big files in chunks/blocks Read block, transform block, Huff block Read block, transform block, Huff block… Block size may impact performance

Compsci 100, Spring Toward BWT from zero feet l First look at code for HuffProcessor.compress  Tree already made, preprocessCompress  How writeHeader,writeCompressedData work? public int compress(InputStream in, OutputStream out) { BitOutputStream bout = new BitOutputStream(out); BitInputStream bin = new BitInputStream(in); int bitCount = 0; myRoot = makeTree(); makeMapEncodings(myRoot,””); bitCount += writeHeader(bout); bitCount += writeCompressedData(bin,bout); bout.flush(); return bitCount; }

Compsci 100, Spring BWT from zero feet, part I l Read a block of data, transform it, then huff it  To huff we write a magic number, write header/tree, and write compressed bits based on Huffman encodings  We already have huff code, need to use on a transformed bunch of characters rather than on the input file So process input stream by passing it to BW transform which reads a chunk and returns char[], the last column  A char is a 16-bit, unsigned value, we only need 8-bit value, but use char because we can’t use byte In Java byte is signed, -128, What does all that mean?

Compsci 100, Spring Use what we have, need new stream l We want to use existing compression code we wrote before  Read a block of 8-bit/chunks, store in char[] array  Repeat until no more blocks, last block not full ?  Block as char[], treat as stream and feed it to Huff Count characters, make tree, compress l We need an Adapter, something that takes char[] array and turns it into an InputStream which we feed to Huff compressor  ByteArrayInputStream, turns byte[] to stream  We can store 8-bit chunks as bytes for stream purposes

Compsci 100, Spring ByteArrayInputStream and blocks public int compress(InputStream in, OutputStream out) { BitOutputStream boout = new BitOutputStream(out); BitInputStream bin = new BitInputStream(in); int bitCount = 0; BurrowsWheeler bwt = new BurrowsWheeler(); while (true){ char[] chunk = bw.transform(bin); if (chunk.length < 1) break; chunk = btw.mtf(chunk); byte[] array = new byte[chunk.length]; for(int k=0; k < array.length; k++){ array[k] = (byte) chunk[k]; } ByteArrayInputStream bas = new ByteArrayInputStream(array); preprocessInitialize(bas); myRoot = makeTree(); makeMapEncodings(myRoot,””); BitInputStream blockBis = new BitInputStream(new ByteArrayInputStream(array)); bitCount += writeHeader(bout); bitCount += writeCompressedData(blockBis,bout); } bout.flush(); return bitCount; }

Compsci 100, Spring How do we untransform? l Untransforming is very slick  Basically sort the last column in O(n) time  Run an O(n) algorithm to get back original block l We sort the last column in O(n) time using a counting sort, which is sometimes one phase of radix sort  Call sort: easier to code and a good first step  The counting sort leverages that we’re sorting “characters” --- whatever we read when doing compression which is an 8-bit chunk  How many different 8-bit chunks are there?

Compsci 100, Spring Counting sort l If we have an array of integers all of whose values are between 0 and 255, how can we sort by counting number of occurrences of each integer?  Suppose we have 4 occurrences of one, 1 occurrence of two, 3 occurrences of five and 2 occurrences of seven, what’s the sorted array? (we don’t know the original, just the counts)  What’s the answer? How do we write code to do this? l More than one way, as long as O(n) doesn’t matter really

Compsci 100, Spring Another transform: Move To Front l In practice we can introduce more repetition and redundancy using a Move-to-front transform (MTF)  We’re going to compress a sequence of numbers (the 8- bit chunks we read, might be the last column from BWT)  Instead of just writing the numbers, use MTF to write l Introduce more redundancy/repetition if there are runs of characters. For example: consider “AAADDDFFFF”  As numbers this is  Using MTF, start with index[k] = k 0,1,2,3,4,…,96,97,98,99,…,255  Search for 97, initially it’s at index[97], then MTF 97,0,1,2,3,4,5,…, 96,98,99,…,255

Compsci 100, Spring More on why MTF works l As numbers this is  Using MTF, start with index[k] = k  Search for 97, initially it’s at index[97], then MTF 97,0,1,2,3,4,5,…,96,98,99,100,101,…  Next time we search for 97 where is it? At 0! l So, to write out we actually write , then we write out 100, where is it? Still at 100, why? Then MTF:  100,97,0,1,2,3,…96,98,99,101,102,… l So, to write out we write:  97, 0, 0, 100, 0, 0, 102, 0, 0  Lots of zeros, ones, etc. Thus more Huffable, why?

Compsci 100, Spring Complexity of MTF and UMTF l Given n characters, we have to look through 256 indexes (worst case)  So, 256*n, this is …. O(n)  Average case is much better, the whole point of MTF is to find repeats near the beginning (what about MTF complexity?) l How to untransform, undo MTF, e.g., given  97, 0, 0, 100, 0, 0, 102, 0, 0 l How do we recover AAADDDFFF (97,97,97,100,100,…102)  Initially index[k] = k, so where is 97? O(1) look up, then MTF

Compsci 100, Spring Burrows Wheeler Summary l Transform data: make it more “compressable”  Introduce redundancy  First do BWT, then do MTF (latter provided)  Do this in chunks  For each chunk array (after BWT and MTF) huff it l To uncompress data  Read block of huffed data, uncompress it, untransform  Undo MTF, undo BWT: this code is given to you  Don’t forget magic numbers

Compsci 100, Spring John Tukey: l Cooley-Tukey FFT l Bit: Binary Digit l Box-plot l “software” used in print Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.

Compsci 100, Spring A Rose by any other name…C or Java? l Why do we use Java in our courses (royal we?)  Object oriented  Large collection of libraries  Safe for advanced programming and beginners  Harder to shoot ourselves in the foot l Why don't we use C++ (or C)?  Standard libraries weak or non-existant (comparatively)  Easy to make mistakes when beginning  No GUIs, complicated compilation model  What about other languages?

Compsci 100, Spring Why do we learn other languages? l Perl, Python, PHP, Ruby, C, C++, Java, Scheme, ML,  Can we do something different in one language? Depends on what different means. In theory: no; in practice: yes  What languages do you know? All of them.  In what languages are you fluent? None of them l In later courses why do we use C or C++?  Closer to the machine, understand abstractions at many levels  Some problems are better suited to one language Writing an operating system? Linux?

Compsci 100, Spring Unique words in Java import java.util.*; import java.io.*; public class Unique { public static void main(String[] args) throws IOException{ Scanner scan = new Scanner(new File("/data/melville.txt")); TreeSet set = new TreeSet (); while (scan.hasNext()){ String str = scan.next(); set.add(str); } for(String s : set){ System.out.println(s); }

Compsci 100, Spring Bjarne Stroustrup, Designer of C++ l Numerous awards, engineering and science  ACM Grace Hopper l Formerly at Bell Labs  Now Texas A&M l “There's an old story about the person who wished his computer was as easy to use as his telephone. That wish has come true, since I no longer know how to use my telephone.” Bjarne Stroustrup

Compsci 100, Spring Unique words in C++ #include using namespace std; int main(){ ifstream input("/data/melville.txt"); set unique; string word; while (input >> word){ unique.insert(word); } set ::iterator it = unique.begin(); for(; it != unique.end(); it++){ cout << *it << endl; } return 0; }

Compsci 100, Spring PHP, Rasmus Lerdorf and Others l Rasmus Lerdorf  Qeqertarsuaq, Greenland  1995 started PHP, now part of it  l Personal Home Page  No longer an acronym l “When the world becomes standard, I will start caring about standards.” Rasmus Lerdorf

Compsci 100, Spring Unique words in PHP <?php $wholething = file_get_contents("file:///data/melville.txt"); $wholething = trim($wholething); $array = preg_split("/\s+/",$wholething); $uni = array_unique($array); sort($uni); foreach ($uni as $word){ echo $word." "; } ?>

Compsci 100, Spring Guido van Rossum l BDFL for Python development  Benevolent Dictator For Life  Late 80’s began development l Python is multi-paradigm  OO, Functional, Structured, … l We're looking forward to a future where every computer user will be able to "open the hood" of their computer and make improvements to the applications inside. We believe that this will eventually change the nature of software and software development tools fundamentally. Guido van Rossum, 1999!

Compsci 100, Spring Unique Words in Python #! /usr/bin/env python import sys import re def main(): f = open('/data/melville.txt', 'r') words = re.split('\s+',f.read().strip()) allWords = set() for w in words: allWords.add(w) for word in sorted(allWords): print "%s" % word if __name__ == "__main__": main()

Compsci 100, Spring Kernighan and Ritchie l First C book, 1978 l First ‘hello world’ l Ritchie: Unix too!  Turing award 1983 l Kernighan: tools  Strunk and White l Everyone knows that debugging is twice as hard as writing a program in the first place. So if you are as clever as you can be when you write it, how will you ever debug it? Brian Kernighan

Compsci 100, Spring How do we read a file in C? #include int strcompare(const void * a, const void * b){ char ** stra = (char **) a; char ** strb = (char **) b; return strcmp(*stra, *strb); } int main(){ FILE * file = fopen("/data/melville.txt","r"); char buf[1024]; char ** words = (char **) malloc(5000*sizeof(char **)); int count = 0; int k;

Compsci 100, Spring Storing words read when reading in C while (fscanf(file,"%s",buf) != EOF){ int found = 0; // look for word just read for(k=0; k < count; k++){ if (strcmp(buf,words[k]) == 0){ found = 1; break; } if (!found){ // not found, add to list words[count] = (char *) malloc(strlen(buf)+1); strcpy(words[count],buf); count++; } l Complexity of reading/storing? Allocation of memory ?

Compsci 100, Spring Sorting, Printing, Freeing in C qsort(words,count,sizeof(char *), strcompare); for(k=0; k < count; k++) { printf("%s\n",words[k]); } for(k=0; k < count; k++){ free(words[k]); } free(words); } l Sorting, printing, and freeing  How to sort? What’s analgous to comparator?  Why do we call free? Necessary in this program? Why ?