(Important to algorithm analysis )

Slides:



Advertisements
Similar presentations
Counting. Counting = Determining the number of elements of a finite set.
Advertisements

4/16/2015 MATH 224 – Discrete Mathematics Counting Basic counting techniques are important in many aspects of computer science. For example, consider the.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
Multiplication Rule. A tree structure is a useful tool for keeping systematic track of all possibilities in situations in which events happen in order.
Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information.
Compression & Huffman Codes
School of Computing Science Simon Fraser University
Huffman Encoding 16-Apr-17.
Fundamental limits in Information Theory Chapter 10 :
A Data Compression Algorithm: Huffman Compression
Combinations We should use permutation where order matters
Data Representation CS105. Data Representation Types of data: – Numbers – Text – Audio – Images & Graphics – Video.
Lecture 2: Basic Information Theory Thinh Nguyen Oregon State University.
EEE377 Lecture Notes1 EEE436 DIGITAL COMMUNICATION Coding En. Mohd Nazri Mahmud MPhil (Cambridge, UK) BEng (Essex, UK) Room 2.14.
Information Theory and Security
1 Lossless Compression Multimedia Systems (Module 2) r Lesson 1: m Minimum Redundancy Coding based on Information Theory: Shannon-Fano Coding Huffman Coding.
Huffman Coding Vida Movahedi October Contents A simple example Definitions Huffman Coding Algorithm Image Compression.
Compression Algorithms Robert Buckley MCIS681 Online Dr. Smith Nova Southeastern University.
Too much information running through my brain.. We live in the information age. Knowledge comes from careful investigation of information. Information.
Basic Counting. This Lecture We will study some basic rules for counting. Sum rule, product rule, generalized product rule Permutations, combinations.
(Important to algorithm analysis )
A powerful strategy: Symmetry The world is full of symmetry, so use it!
COMPRESSION. Compression in General: Why Compress? So Many Bits, So Little Time (Space) CD audio rate: 2 * 2 * 8 * = 1,411,200 bps CD audio storage:
Error Control Code. Widely used in many areas, like communications, DVD, data storage… In communications, because of noise, you can never be sure that.
CS654: Digital Image Analysis
Discrete Mathematics Lecture # 25 Permutation & Combination.
Combinatorics (Important to algorithm analysis ) Problem I: How many N-bit strings contain at least 1 zero? Problem II: How many N-bit strings contain.
STATISTIC & INFORMATION THEORY (CSNB134) MODULE 11 COMPRESSION.
Multi-media Data compression
COUNTING Discrete Math Team KS MATEMATIKA DISKRIT (DISCRETE MATHEMATICS ) 1.
The Multiplication Rule
File Compression 3.3.
The Pigeonhole Principle
What Is Probability?.
CSC 321: Data Structures Fall 2015 Counting and problem solving
CSE15 Discrete Mathematics 04/19/17
Shannon Entropy Shannon worked at Bell Labs (part of AT&T)
3.3 Fundamentals of data representation
Computer Science Higher
Discrete Structures for Computer Science
COCS DISCRETE STRUCTURES
CSC 321: Data Structures Fall 2016 Counting and proofs
Hashing Alexandra Stefan.
Image Compression The still image and motion images can be compressed by lossless coding or lossy coding. Principle of compression: - reduce the redundant.
Permutations and Combinations
Minds on! If you choose an answer to this question at random, what is the probability you will be correct? A) 25% B) 50% C) 100% D) 25%
4.5 – Finding Probability Using Tree Diagrams and Outcome Tables
Hashing Alexandra Stefan.
Functions Defined on General Sets
Chapter 1 Data Storage.
CS100: Discrete structures
Counting How Many Elements Computing “Moments”
COUNTING AND PROBABILITY
(Important to algorithm analysis )
Why Compress? To reduce the volume of data to be transmitted (text, fax, images) To reduce the bandwidth required for transmission and to reduce storage.
Probability Probability underlies statistical inference - the drawing of conclusions from a sample of data. If samples are drawn at random, their characteristics.
Basic Counting.
UNIT IV.
Honors Statistics From Randomness to Probability
Now it’s time to look at…
Image Transforms for Robust Coding
RAID Redundant Array of Inexpensive (Independent) Disks
Hashing Alexandra Stefan.
Digital Encodings.
Basic Counting Lecture 9: Nov 5, 6.
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Huffman Encoding.
Chapter 8 – Compression Aims: Outline the objectives of compression.
Presentation transcript:

(Important to algorithm analysis ) Combinatorics (Important to algorithm analysis ) Problem I: How many N-bit strings contain at least 1 zero? Problem II: How many N-bit strings contain more than 1 zero?

Which areas of CS need all this? Information Theory Data Storage, Retrieval Data Transmission Encoding

Information (Shannon) Entropy quantifies, in the sense of an expected value, the information contained in a message. Example 1: A fair coin has an entropy of 1 [bit]. If the coin is not fair, then the uncertainty is lower (if asked to bet on the next outcome, we would bet preferentially on the most frequent result) => the Shannon entropy is lower than 1. Example 2: A long string of repeating characters: S=0 Example 3: English text: S ~ 0.6 to 1.3 The source coding theorem: as the length of a stream of independent and identically-distributed random variable data tends to infinity, it is impossible to compress the data such that the code rate (average number of bits per symbol) is less than the Shannon entropy of the source, without information loss.

Information (Shannon) Entropy Cont’d The source coding theorem: It is impossible to compress the data such that the code rate (average number of bits per symbol) is less than the Shannon entropy of the source, without information loss. (works in the limit of large length of a stream of independent and identically-distributed random variable data )

Data Compression Lossless Compression: 25.888888888 => 25.[9]8 Lossy Compression: 25.888888888 => 26 Lossless compression: exploit statistical redundancy. For example, In English, letter “e” is common, but “z” is not. And you never have a “q” followed by a “z”. Drawback: not universal. If no pattern - no compression. Lossy compression (e.g. JPEG images).

Combinatorics cont’d Problem: “Random Play” on your I-Touch works like this. When pressed once, it plays a random song from your library of N songs. The song just played is excluded from the library. Next time “Random Play” is pressed, it draws another song at random from the remaining N-1 songs. Suppose you have pressed “Random Play” k times. What is the probability you will have heard your one most favorite song?

Combinatorics cont’d The “Random Play” problem. Tactics. Get your hands dirty. P(1) = 1/N. P(2) = ? Need to be careful: what if the song has not played 1st? What if it has? Becomes complicated for large N. Let’s compute the complement: the probability the song has NOT played: Table. Press # (k) vs. P. 1 1 - 1/N 2 1 - 1/(N-1) k 1- 1/(N-k +1) Key Tactics: find complimentary probability P(not) = (1 - 1/N)(1 - 1/(N-1))*…(1- 1/(N-k +1)); then P(k) = 1 - P(not). Re-arrange. P(not) = (N-1)/N * (N-2)/ (N-1) * (N-3)/(N-2)*…(N-k)/(N-k+1) = (N-k)/N. Thus, P = 1 - P(not) = k/N. If you try to guess the solution, make sure your guess works for simple cases when the answer is obvious. E.g. k=1, k=N. Also, P <= 1. The very simple answer suggests that a simpler solution may be possible. Can you find it?

The “clever” solution. K N favorite

What is DNA? All organisms on this planet are made of the same type of genetic blueprint. Within the cells of any organism is a substance called DNA which is a double-stranded helix of nucleotides. DNA carries the genetic information of a cell. This information is the code used within cells to form proteins and is the building block upon which life is formed. Strands of DNA are long polymers of millions of linked nucleotides.

Graphical Representation of inherent bonding properties of DNA

DNA information content How much data (Mb? ) is in human genome?

DNA in bits and bytes: 1 bit = single “0” or “1”. That is 1 bit = 2 possibilities. 2 bits = ?

DNA in bits and bytes: 1 bit = single “0” or “1”. That is 1 bit = 2 possibilities. 2 bits = 2 x 2 = 4 possibilities: 00, 01, 10, 11. 3 bits = ?

DNA in bits and bytes: 1 bit = single “0” or “1”. That is 1 bit = 2 possibilities. 2 bits = 2 x 2 = 4 possibilities: 00, 01, 10, 11. 3 bits = 2x2x2 = 8. In general n bits = 2^n possibilities.

DNA bits: 4 possibilities for a single position (e.g. 1st letter) = 2 bits. That is 2 bits per ”letter”.

DNA bits: 4 possibilities for a single position (e.g. 1st letter) = 2 bits. That is 2 bits per letter.

DNA bits: 4 possibilities for a single position (e.g. 1st letter) = 2 bits. That is 2 bits per letter (bp). A single byte = 8 bits (by definition). That is 4 bp = 1 byte. Human genome ~ 3*109 letters. Human genome = 3*109 bp/4 (bytes/letter) ~ 750 Mb Is this a lot or a little?

750 Mb of DNA code = a 15 min long movie clip (HD). Does that make sense? Difference between John and Jane?

750 Mb of the DNA code = a 15 min long movie clip (HD). Does that make sense? Difference between John and Jane? About 1% difference in DNA letters, that is about 7.5 Mb of data = about 2 songs. (or a single one by Adele).

750 Mb of the DNA code = a 15 min long movie clip (HD). Difference between John and Jane? About 1% difference in DNA letters, that is about 7.5 Mb of data = about 2 songs. (or a single one by Adele). Does not make sense. There must be more info stored someplace else.

Combinatorics Cont’d Problem: DNA sequence contains only 4 letters (A,T,G and C). Short “words” made of K consecutive letters are the genetic code. Each word (called “codon”, ) codes for a specific amino-acid in proteins. For example, ATTTC is a 5 letter word. There are a total of 20 amino-acids. Prove that genetic code based on fixed K is degenerate, that is there are amino-acids which are coded for by more than one “word”. It is assumed that every word codes for an amino-acid

How can you use the solution in an argument For intelligent design Against intelligent design

Permutations How many three digit integers (decimal representation) are there if you cannot use a digit more that once?

P(n,r) The number of ways a subset of r elements can be chosen from a set of n elements is given by

Theorem Suppose we have n objects of k different types, with nk identical objects of the kth type. Then the number of distinct arrangements of those n objects is equal to Visualize: nk identical balls of color k. total # of balls = n

The Mississippi formula. Combinatorics Cont’d The Mississippi formula. Example: What is the number of letter permutations of the word BOOBOO? 6!/(2! 4!)

Permutations and Combinations with Repetitions How many distinct arrangements are there of the letters in the word MISSISSIPPI? Each of the 11 letters in this word has a unique color (e.g. 1st “I” is blue, the 2nd is red, etc. ). Each arrangement must still read the same MISSISSIPPI.

The “shoot the hoops” wager. Google interview You are offered one of the two wagers: (1) You make one hoop in one throw or (2) At least 2 hoops in 3 throws. You get $1000 if you succeed. Which wager would you choose?

Warm-up Use our “test the solution” heuristic to immediately see that P(2/3) so calculated is wrong.

The hoops wager. Find P(2/3):

The hoops wager. Compare P and P(2/3) Notice: if p <<1, p is always greater than 3p2 – 2p3, therefore if you are a poor shot like myself, than you want to choose the 1st wager, 1 out of 1. At p=1/2 p(1/1) = p(2/3). But at p> ½, 3p2 – 2p3 > p, and so, if you are LeBron James you are more likely to get $1000 if you go with the 2nd wager, 2 out of 3.