Information Theory Basics What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel.

Slides:



Advertisements
Similar presentations
Probability Distribution
Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Point Estimation Notes of STAT 6205 by Dr. Fan.
Lecture 4A: Probability Theory Review Advanced Artificial Intelligence.
Random Variables ECE460 Spring, 2012.
Random Variable A random variable X is a function that assign a real number, X(ζ), to each outcome ζ in the sample space of a random experiment. Domain.
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.
Protein- Cytokine network reconstruction using information theory-based analysis Farzaneh Farhangmehr UCSD Presentation#3 July 25, 2011.
Lec 18 Nov 12 Probability – definitions and simulation.
Chapter 6 Information Theory
Middle Term Exam 03/04, in class. Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign.
Maximum likelihood estimates What are they and why do we care? Relationship to AIC and other model selection criteria.
Background Knowledge Brief Review on Counting,Counting, Probability,Probability, Statistics,Statistics, I. TheoryI. Theory.
Descriptive statistics Experiment  Data  Sample Statistics Sample mean Sample variance Normalize sample variance by N-1 Standard deviation goes as square-root.
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Machine Learning CMPT 726 Simon Fraser University
Probability and Statistics Review
Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model.
June 1, 2004Computer Security: Art and Science © Matt Bishop Slide #32-1 Chapter 32: Entropy and Uncertainty Conditional, joint probability Entropy.
Lecture 2: Basic Information Theory Thinh Nguyen Oregon State University.
3-1 Introduction Experiment Random Random experiment.
Information Theory and Security
Probability and Statistics Review Thursday Sep 11.
X= {x 0, x 1,….,x J-1 } Y= {y 0, y 1, ….,y K-1 } Channel Finite set of input (X= {x 0, x 1,….,x J-1 }), and output (Y= {y 0, y 1,….,y K-1 }) alphabet.
Chapter 6 Probability.
1 Statistical NLP: Lecture 5 Mathematical Foundations II: Information Theory.
If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all.
Basic Concepts in Information Theory
Chapter 9 Introducing Probability - A bridge from Descriptive Statistics to Inferential Statistics.
Some basic concepts of Information Theory and Entropy
§1 Entropy and mutual information
2. Mathematical Foundations
1. Entropy as an Information Measure - Discrete variable definition Relationship to Code Length - Continuous Variable Differential Entropy 2. Maximum Entropy.
Estimation and Hypothesis Testing. The Investment Decision What would you like to know? What will be the return on my investment? Not possible PDF for.
1 CY1B2 Statistics Aims: To introduce basic statistics. Outcomes: To understand some fundamental concepts in statistics, and be able to apply some probability.
§4 Continuous source and Gaussian channel
Binomial Distributions Calculating the Probability of Success.
1 Chapters 6-8. UNIT 2 VOCABULARY – Chap 6 2 ( 2) THE NOTATION “P” REPRESENTS THE TRUE PROBABILITY OF AN EVENT HAPPENING, ACCORDING TO AN IDEAL DISTRIBUTION.
Probability Distributions. Essential Question: What is a probability distribution and how is it displayed?
1 Lecture 7: Discrete Random Variables and their Distributions Devore, Ch
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory I AI-lab
Summer 2004CS 4953 The Hidden Art of Steganography A Brief Introduction to Information Theory  Information theory is a branch of science that deals with.
1 Since everything is a reflection of our minds, everything can be changed by our minds.
Tim Marks, Dept. of Computer Science and Engineering Random Variables and Random Vectors Tim Marks University of California San Diego.
Consistency An estimator is a consistent estimator of θ, if , i.e., if
Machine Learning Chapter 5. Evaluating Hypotheses
POSC 202A: Lecture 4 Probability. We begin with the basics of probability and then move on to expected value. Understanding probability is important because.
Basic Concepts of Information Theory Entropy for Two-dimensional Discrete Finite Probability Schemes. Conditional Entropy. Communication Network. Noise.
1 Lecture 7 System Models Attributes of a man-made system. Concerns in the design of a distributed system Communication channels Entropy and mutual information.
Basic Concepts of Information Theory A measure of uncertainty. Entropy. 1.
Channel Coding Theorem (The most famous in IT) Channel Capacity; Problem: finding the maximum number of distinguishable signals for n uses of a communication.
This file contains figures from the book: Information Theory A Tutorial Introduction by Dr James V Stone 2015 Sebtel Press. Copyright JV Stone. These.
Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
(C) 2000, The University of Michigan 1 Language and Information Handout #2 September 21, 2000.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.
Chapter5 Statistical and probabilistic concepts, Implementation to Insurance Subjects of the Unit 1.Counting 2.Probability concepts 3.Random Variables.
Statistical methods in NLP Course 2 Diana Trandab ă ț
Statistical methods in NLP Course 2
Applied statistics Usman Roshan.
Learning Tree Structures
COT 5611 Operating Systems Design Principles Spring 2012
COT 5611 Operating Systems Design Principles Spring 2014
A Brief Introduction to Information Theory
Probability & Statistics Probability Theory Mathematical Probability Models Event Relationships Distributions of Random Variables Continuous Random.
Advanced Artificial Intelligence
Entropy CSCI284/162 Spring 2009 GWU.
ASV Chapters 1 - Sample Spaces and Probabilities
M248: Analyzing data Block A UNIT A3 Modeling Variation.
Presentation transcript:

Information Theory Basics

What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel coding Compression Useful for lots of other things Claude Shannon, mid- to late- 40's

Requirements “This data will compress to at most N bits” “This channel will allow us to transmit N bits per second” “This plaintext will require at least N bans of ciphertext” N is a number for the amount of information/uncertainty/entropy of a random variable X, that is, H(X) = N

??? What are the requirements for such a measure? E.g., Continuity: changing the probabilities a small amount should change the measure by only a small amount.

Maximum What distribution should be the maximum entropy? For equiprobable events, what should happen if we increase the number of outcomes.

Maximum

Symmetry The measure should be unchanged if the outcomes are re-ordered

Additivity Amount of entropy should be independent of how we divide the process into parts.

Entropy of Discrete RVs Expected value of the amount of information for an event

Flip a fair coin (-0.5 lg 0.5) + (-0.5 lg 0.5) = 1.0 Flip three fair coins?

Flip three fair coins (-0.5 lg 0.5) + (-0.5 lg 0.5) + (-0.5 lg 0.5) + (-0.5 lg 0.5) + (-0.5 lg 0.5) + (-0.5 lg 0.5) = 3.0 ( lg 0.125)+( lg 0.125)+( lg 0.125)+( lg 0.125)+( lg 0.125)+( lg 0.125)+( lg 0.125)+( lg 0.125) = 3.0

Flip biased coin A 60% heads

Biased coin A (-0.6 lg 0.6) + (-0.4 lg 0.4) =

Biased coin B 95% heads (-0.95 lg 0.95) + (-0.05 lg 0.05) = Why is there less information in biased coins?

Information=uncertainty=entropy

Flip A, then flip B A: (-0.6 lg 0.6) + (-0.4 lg 0.4) = B: (-0.95 lg 0.95) + (-0.05 lg 0.05) = ((-0.6 lg 0.6) + (-0.4 lg 0.4)) + ((-0.95 lg 0.95) + (-0.05 lg 0.05)) = = (-(0.6*0.95)lg(0.6*0.95))+(- (0.6*0.05)lg(0.6*0.05))+(- (0.4*0.95)lg(0.4*0.95))+(-(0.4*0.05)lg(0.4*0.05)) =

Entropy (summary) Continuity, maximum, symmetry, additivity

Example: Maximum Entropy Wikipedia: “Maximum-likelihood estimators can lack asymptotic normality and can be inconsistent if there is a failure of one (or more) of the below regularity conditions... Estimate on boundary, Data boundary parameter- dependent, Nuisance parameters, Increasing information...” “Subject to known constraints, the probability distribution which best represents the current state of knowledge is the one with largest entropy.” What distribution maximizes entropy?

Beyond Entropy Flip fair coin for X if heads flip coin A for Y if tails flip coin B for Y H(X) = 1.0 H(Y) = (- (0.5* *0.95)lg(0.5* *0.95))+(- (0.5* *0.05)lg(0.5* *0.05)) = Joint entropy H(X,Y) = ((-(0.5 * 0.6)) * lg(0.5 * 0.6)) + ((-(0.5 * 0.95)) * lg(0.5 * 0.95)) + ((-(0.5 * 0.4)) * lg(0.5 * 0.4)) + ((-(0.5 * 0.05)) * lg(0.5 * 0.05)) = Where is the other – = bits of information?

Mutual Information I(X;Y) = I(X;Y) = H(X) + H(Y) – H(X,Y) What are H(X|Y) and H(Y|X)?

Example: sufficient statistics Students asked to flip a coin 100 times and record the result How to detect the cheaters?

Example: sufficient statistics f(x) is a family of probability mass functions indexed by θ, X is a sample from a distribution in this family. T(X) is a statistic Function of the sample, like sample mean, sample variance, … I(θ;T(X)) ≤ I(θ;X) Equality only if no information is lost

Kullback-Leibler divergence (a.k.a differential entropy) Process 1: Flip unbiased coin, if heads flip biased coin A (60% heads), if tails flip biased coin B (95% heads) Process 2: Roll a fair die. 1, 2, or 3 = (tails, heads). 4 = (heads, heads). 5 = (heads, tails). 6 = (tails, tails). Process 3: Flip two fair coins, just record the results. Which, out of 2 and 3, is a better approximate model of 1?

Kullback-Leibler divergence (a.k.a. Differential entropy) P is true distribution, Q is model Dkl(P1||P2) = Dkl(P1||P3) = Note that Dkl is not symmetric Dkl(P2||P1)= , Dkl(P3||P1)=

Conditional mutual information I(X;Y|Z) is the expected value of the mutual information between X and Y conditioned on Z

Interaction information I(X;Y;Z) is the information bound up in a set of variables beyond that which is present in any subset I(X;Y;Z) = I(X;Y|Z) – I(X;Y) = I(X;Z|Y) – I(X;Z) = I(Y;Z|X) - I(Y;Z) Negative interaction information: X is rain, Y is dark, Z is clouds Positive interaction information: X is fuel pump blocked, Y is battery dead, Z is car starts

Other fun things you should look into if you're interested... Writing on dirty paper Wire-tap channels Algorithmic complexity Chaitin's constant Goldbach's conjecture, Riemann hypothesis Portfolio theory