(Important to algorithm analysis )

Slides:



Advertisements
Similar presentations
Counting. Counting = Determining the number of elements of a finite set.
Advertisements

Lecture 4 (week 2) Source Coding and Compression
4/16/2015 MATH 224 – Discrete Mathematics Counting Basic counting techniques are important in many aspects of computer science. For example, consider the.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
Multiplication Rule. A tree structure is a useful tool for keeping systematic track of all possibilities in situations in which events happen in order.
Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information.
Compression & Huffman Codes
School of Computing Science Simon Fraser University
Huffman Encoding 16-Apr-17.
Fundamental limits in Information Theory Chapter 10 :
A Data Compression Algorithm: Huffman Compression
Combinations We should use permutation where order matters
Data Representation CS105. Data Representation Types of data: – Numbers – Text – Audio – Images & Graphics – Video.
Lecture 2: Basic Information Theory Thinh Nguyen Oregon State University.
EEE377 Lecture Notes1 EEE436 DIGITAL COMMUNICATION Coding En. Mohd Nazri Mahmud MPhil (Cambridge, UK) BEng (Essex, UK) Room 2.14.
Information Theory and Security
1 Lossless Compression Multimedia Systems (Module 2) r Lesson 1: m Minimum Redundancy Coding based on Information Theory: Shannon-Fano Coding Huffman Coding.
Huffman Coding Vida Movahedi October Contents A simple example Definitions Huffman Coding Algorithm Image Compression.
Compression Algorithms Robert Buckley MCIS681 Online Dr. Smith Nova Southeastern University.
Basic Counting. This Lecture We will study some basic rules for counting. Sum rule, product rule, generalized product rule Permutations, combinations.
Information Coding in noisy channel error protection:-- improve tolerance of errors error detection: --- indicate occurrence of errors. Source.
(Important to algorithm analysis )
A powerful strategy: Symmetry The world is full of symmetry, so use it!
COMPRESSION. Compression in General: Why Compress? So Many Bits, So Little Time (Space) CD audio rate: 2 * 2 * 8 * = 1,411,200 bps CD audio storage:
Error Control Code. Widely used in many areas, like communications, DVD, data storage… In communications, because of noise, you can never be sure that.
CS654: Digital Image Analysis
Combinatorics (Important to algorithm analysis ) Problem I: How many N-bit strings contain at least 1 zero? Problem II: How many N-bit strings contain.
STATISTIC & INFORMATION THEORY (CSNB134) MODULE 11 COMPRESSION.
Quiz highlights 1.Probability of the song coming up after one press: 1/N. Two times? Gets difficult. The first or second? Or both? USE THE MAIN HEURISTICS:
Multi-media Data compression
The Multiplication Rule
File Compression 3.3.
The Pigeonhole Principle
What Is Probability?.
CSE15 Discrete Mathematics 04/19/17
Shannon Entropy Shannon worked at Bell Labs (part of AT&T)
(Important to algorithm analysis )
3.3 Fundamentals of data representation
Computer Science Higher
Discrete Structures for Computer Science
COCS DISCRETE STRUCTURES
Hashing Alexandra Stefan.
Image Compression The still image and motion images can be compressed by lossless coding or lossy coding. Principle of compression: - reduce the redundant.
Permutations and Combinations
Hashing Alexandra Stefan.
Functions Defined on General Sets
Chapter 1 Data Storage.
CS100: Discrete structures
Counting How Many Elements Computing “Moments”
COUNTING AND PROBABILITY
Why Compress? To reduce the volume of data to be transmitted (text, fax, images) To reduce the bandwidth required for transmission and to reduce storage.
A Brief Introduction to Information Theory
Probability Probability underlies statistical inference - the drawing of conclusions from a sample of data. If samples are drawn at random, their characteristics.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Basic Counting.
UNIT IV.
Binomial & Geometric Random Variables
Image Transforms for Robust Coding
RAID Redundant Array of Inexpensive (Independent) Disks
Hashing Alexandra Stefan.
Digital Encodings.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Basic Counting Lecture 9: Nov 5, 6.
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Counting II: Recurring Problems And Correspondences
Huffman Encoding.
Chapter 8 – Compression Aims: Outline the objectives of compression.
Gene Discovery.
Presentation transcript:

(Important to algorithm analysis ) Combinatorics (Important to algorithm analysis ) Problem I: How many N-bit strings contain at least 1 zero? Problem II: How many N-bit strings contain more than 1 zero?

Which areas of CS need all this? Information Theory Data Storage, Retrieval Data Transmission Encoding

Data Compression Lossless Compression: 25.888888888 => 25.[9]8 Lossy Compression: 25.888888888 => 26 Lossless compression: exploit statistical redundancy. For example, In English, letter “e” is common, but “z” is not. And you never have a “q” followed by a “z”. Drawback: not universal. If no pattern - no compression. Lossy compression (e.g. JPEG images).

Information (Shannon) Entropy quantifies, in the sense of an expected value, the information contained in a message. Example 1: A fair coin has an entropy of 1 [bit]. If the coin is not fair, then the uncertainty is lower (if asked to bet on the next outcome, we would bet preferentially on the most frequent result) => the Shannon entropy is lower than 1. Example 2: A long string of repeating characters: S=0 Example 3: English text: S ~ 0.6 to 1.3 The source coding theorem: as the length of a stream of independent and identically-distributed random variable data tends to infinity, it is impossible to compress the data such that the code rate (average number of bits per symbol) is less than the Shannon entropy of the source, without information loss.

Information (Shannon) Entropy Cont’d The source coding theorem: It is impossible to compress the data such that the code rate (average number of bits per symbol) is less than the Shannon entropy of the source, without information loss. (works in the limit of large length of a stream of independent and identically-distributed random variable data )

Combinatorics cont’d Problem: “Random Play” on your I-Touch works like this. When pressed once, it plays a random song from your library of N songs. The song just played is excluded from the library. Next time “Random Play” is pressed, it draws another song at random from the remaining N-1 songs. Suppose you have pressed “Random Play” k times. What is the probability you will have heard your one most favorite song?

But first, let’s be ready to check your solution once we find it, using “extreme” case test.

But first, let’s be ready to check your solution once we find it, using “extreme” case test.

Combinatorics cont’d The “Random Play” problem. Tactics. Get your hands dirty. P(1) = 1/N. P(2) = ? Need to be careful: what if the song has not played 1st? What if it has? Becomes complicated for large N. Let’s compute the complement: the probability the song has NOT played: Table. Press # (k) vs. P. 1 1 - 1/N 2 1 - 1/(N-1) k 1- 1/(N-k +1) Key Tactics: find complimentary probability P(not) = (1 - 1/N)(1 - 1/(N-1))*…(1- 1/(N-k +1)); then P(k) = 1 - P(not). Re-arrange. P(not) = (N-1)/N * (N-2)/ (N-1) * (N-3)/(N-2)*…(N-k)/(N-k+1) = (N-k)/N. Thus, P = 1 - P(not) = k/N. If you try to guess the solution, make sure your guess works for simple cases when the answer is obvious. E.g. k=1, k=N. Also, P <= 1. The very simple answer suggests that a simpler solution may be possible. Can you find it?

The “clever” solution. K N favorite

What is DNA? All organisms on this planet are made of the same type of genetic blueprint. Within the cells of any organism is a substance called DNA which is a double-stranded helix of nucleotides. DNA carries the genetic information of a cell. This information is the code used within cells to form proteins and is the building block upon which life is formed. Strands of DNA are long polymers of millions of linked nucleotides.

Graphical Representation of inherent bonding properties of DNA

DNA information content How much data (Mb? ) is in human genome?

DNA in bits and bytes: 1 bit = single “0” or “1”. That is 1 bit = 2 possibilities. 2 bits = ?

DNA in bits and bytes: 1 bit = single “0” or “1”. That is 1 bit = 2 possibilities. 2 bits = 2 x 2 = 4 possibilities: 00, 01, 10, 11. 3 bits = ?

DNA in bits and bytes: 1 bit = single “0” or “1”. That is 1 bit = 2 possibilities. 2 bits = 2 x 2 = 4 possibilities: 00, 01, 10, 11. 3 bits = 2x2x2 = 8. In general n bits = 2^n possibilities.

DNA bits: 4 possibilities for a single position (e.g. 1st letter) = 2 bits. That is 2 bits per ”letter”.

DNA bits: 4 possibilities for a single position (e.g. 1st letter) = 2 bits. That is 2 bits per letter.

DNA bits: 4 possibilities for a single position (e.g. 1st letter) = 2 bits. That is 2 bits per letter (bp). A single byte = 8 bits (by definition). That is 4 bp = 1 byte. Human genome ~ 3*109 letters. Human genome = 3*109 bp/4 (bytes/letter) ~ 750 Mb Is this a lot or a little?

750 Mb of DNA code = a 15 min long movie clip (HD). Does that make sense? Difference between John and Jane?

750 Mb of the DNA code = a 15 min long movie clip (HD). Does that make sense? Difference between John and Jane? About 1% difference in DNA letters, that is about 7.5 Mb of data = about 2 songs. (or a single one by Adele).

750 Mb of the DNA code = a 15 min long movie clip (HD). Difference between John and Jane? About 1% difference in DNA letters, that is about 7.5 Mb of data = about 2 songs. (or a single one by Adele). Does not make sense. There must be more info stored someplace else.

Combinatorics Cont’d Problem: DNA sequence contains only 4 letters (A,T,G and C). Short “words” made of K consecutive letters are the genetic code. Each word (called “codon”, ) codes for a specific amino-acid in proteins. For example, ATTTC is a 5 letter word. There are a total of 20 amino-acids. Prove that a working genetic code based on a fixed K is degenerate, that is there are amino-acids which are coded for by more than one “word”. It is assumed that every word codes for an amino-acid

Solution. Start with the heuristics that almost always helps: simplify. Consider K=1, K=2, …

How can you use the solution in an argument For intelligent design Against intelligent design

Permutations How many three digit integers (decimal representation) are there if you cannot use a digit more that once?

P(n,r) The number of ways a subset of r elements can be chosen from a set of n elements is given by

Theorem Suppose we have n objects of k different types, with nk identical objects of the kth type. Then the number of distinct arrangements of those n objects is equal to Visualize: nk identical balls of color k. total # of balls = n

The Mississippi formula. Combinatorics Cont’d The Mississippi formula. Example: What is the number of letter permutations of the word BOOBOO? 6!/(2! 4!)

Permutations and Combinations with Repetitions How many distinct arrangements are there of the letters in the word MISSISSIPPI? Each of the 11 letters in this word has a unique color (The example shows only 4 letters “I” having different colors.) Each arrangement must still read the same word MISSISSIPPI.

The “shoot the hoops” wager. Google interview You are offered one of the two wagers: (1) You make one hoop in one throw or (2) At least 2 hoops in 3 throws. You get $1000 if you succeed. Which wager would you choose?

Warm-up Use our “test the solution” heuristic to immediately see that P(2/3) so calculated is wrong.

The hoops wager. Find P(2/3):

The hoops wager. Compare P and P(2/3) Notice: if p <<1, p is always greater than 3p2 – 2p3, therefore if you are a poor shot like myself, than you want to choose the 1st wager, 1 out of 1. At p=1/2 p(1/1) = p(2/3). But at p> ½, 3p2 – 2p3 > p, and so, if you are LeBron James you are more likely to get $1000 if you go with the 2nd wager, 2 out of 3.