Today: Entropy Information Theory. Claude Shannon Ph.D. 1916-2001.

Slides:



Advertisements
Similar presentations
Clear your desk for your quiz. Unit 2 Day 8 Expected Value Average expectation per game if the game is played many times Can be used to evaluate and.
Advertisements

Sampling Distributions
Section 7A: Fundamentals of Probability Section Objectives Define outcomes and event Construct a probability distribution Define subjective and empirical.
Mathematics in Today's World
Probability - 1 Probability statements are about likelihood, NOT determinism Example: You can’t say there is a 100% chance of rain (no possibility of.
Chapter 18 Sampling Distribution Models
22C:19 Discrete Structures Discrete Probability Fall 2014 Sukumar Ghosh.
Chain Rules for Entropy
Shin Ishii Nara Institute of Science and Technology
BCS547 Neural Encoding.
Estimating mutual information Kenneth D. Harris 25/3/2015.
The University of Manchester Introducción al análisis del código neuronal con métodos de la teoría de la información Dr Marcelo A Montemurro
Visual Recognition Tutorial
For stimulus s, have estimated s est Bias: Cramer-Rao bound: Mean square error: Variance: Fisher information How good is our estimate? (ML is unbiased:
Middle Term Exam 03/04, in class. Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign.
Application of Statistical Techniques to Neural Data Analysis Aniket Kaloti 03/07/2006.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Tuesday, October 22 Interval estimation. Independent samples t-test for the difference between two means. Matched samples t-test.
Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 17: 10/24.
Lecture 2: Basic Information Theory Thinh Nguyen Oregon State University.
SIMPLE LINEAR REGRESSION
Information Theory and Security
Recognition stimulus input Observer (information transmission channel) response Response: which category the stimulus belongs to ? What is the “information.
Albert Gatt Corpora and Statistical Methods. Probability distributions Part 2.
If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all.
Basic Concepts in Information Theory
1 Advanced Smoothing, Evaluation of Language Models.
§1 Entropy and mutual information
SIMPLE LINEAR REGRESSION
1. Entropy as an Information Measure - Discrete variable definition Relationship to Code Length - Continuous Variable Differential Entropy 2. Maximum Entropy.
1 9/8/2015 MATH 224 – Discrete Mathematics Basic finite probability is given by the formula, where |E| is the number of events and |S| is the total number.
Tahereh Toosi IPM. Recap 2 [Churchland and Abbott, 2012]
§4 Continuous source and Gaussian channel
1 9/23/2015 MATH 224 – Discrete Mathematics Basic finite probability is given by the formula, where |E| is the number of events and |S| is the total number.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 1 – Slide 1 of 34 Chapter 11 Section 1 Random Variables.
1 Chapters 6-8. UNIT 2 VOCABULARY – Chap 6 2 ( 2) THE NOTATION “P” REPRESENTS THE TRUE PROBABILITY OF AN EVENT HAPPENING, ACCORDING TO AN IDEAL DISTRIBUTION.
Special Topics. General Addition Rule Last time, we learned the Addition Rule for Mutually Exclusive events (Disjoint Events). This was: P(A or B) = P(A)
Chapter 11: The Data Survey Supplemental Material Jussi Ahola Laboratory of Computer and Information Science.
Synergy, redundancy, and independence in population codes, revisited. or Are correlations important? Peter Latham* and Sheila Nirenberg † *Gatsby Computational.
Correlation Analysis. Correlation Analysis: Introduction Management questions frequently revolve around the study of relationships between two or more.
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory I AI-lab
Error Control Code. Widely used in many areas, like communications, DVD, data storage… In communications, because of noise, you can never be sure that.
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Information Theory Metrics Giancarlo Schrementi. Expected Value Example: Die Roll (1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+( 1/6)*6 = 3.5 The equation.
Physics 270 – Experimental Physics. Let say we are given a functional relationship between several measured variables Q(x, y, …) x ±  x and x ±  y What.
Expected Value.
Chapter 6. Probability What is it? -the likelihood of a specific outcome occurring Why use it? -rather than constantly repeating experiments to make sure.
POSC 202A: Lecture 4 Probability. We begin with the basics of probability and then move on to expected value. Understanding probability is important because.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Discrete Random Variables. Introduction In previous lectures we established a foundation of the probability theory; we applied the probability theory.
What is the probability of two or more independent events occurring?
R.Kass/F02 P416 Lecture 1 1 Lecture 1 Probability and Statistics Introduction: l The understanding of many physical phenomena depend on statistical and.
9/14/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing Probability AI-Lab
Lecture 3 Appendix 1 Computation of the conditional entropy.
Independent Events. Examples of independent events 1.You flip a coin and get a heads and you flip a second coin and get a tails. The two coins don't influence.
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory II AI-lab
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
(C) 2000, The University of Michigan 1 Language and Information Handout #2 September 21, 2000.
AP Statistics From Randomness to Probability Chapter 14.
Probability and Probability Distributions. Probability Concepts Probability: –We now assume the population parameters are known and calculate the chances.
Statistical methods in NLP Course 2 Diana Trandab ă ț
Statistical methods in NLP Course 2
Introduction to Information theory
Probability and Statistics Chapter 3 Notes
Hiroki Sayama NECSI Summer School 2008 Week 3: Methods for the Study of Complex Systems Information Theory p I(p)
Digital Multimedia Coding
CHAPTER 22: Inference about a Population Proportion
Binomial Distribution Prof. Welz, Gary OER –
Connecting Data with Domain Knowledge in Neural Networks -- Use Deep learning in Conventional problems Lizhong Zheng.
Presentation transcript:

Today: Entropy Information Theory

Claude Shannon Ph.D

Entropy

A measure of the disorder in a system

Entropy The (average) number of yes/no questions needed to completely specify the state of a system

What if there were two coins?

2 states.1 question. 4 states.2 questions. 8 states.3 questions. 16 states.4 questions. number of states = 2 number of yes-no questions

number of states = 2 number of yes-no questions log 2 (number of states) = number of yes-no questions

H is entropy, the number of yes-no questions required to specify the state of the system n is the number of states of the system, assumed (for now) to be equally likely

Consider Dice

The Six Sided Die H = log 2 (6) = bits

The Four Sided Die H = log 2 (4) = bits

The Twenty Sided Die H = log 2 (20) = bits

What about all three dice? H = log 2 (4  6  20)

What about all three dice? H = log 2 (4)+log 2 (6)+log 2 (20)

What about all three dice? H = bits

What about all three dice? Entropy, from independent elements of a system, adds

Let’s the rewrite this a bit... Trivial Fact 1: log 2 (x) = - log 2 (1/x)

Trivial Fact 1: log 2 (x) = - log 2 (1/x) Trivial Fact 2: if there are n equally likely possibilites p = (1/n)

Trivial Fact 2: if there are n equally likely possibilites p = (1/n)

What if the n states are not equally probable? Maybe we should use the expected value of the entropies, a weighted average by probability

Let’s do a simple example: n = 2, how does H change as we vary p 1 and p 2 ?

n = 2 p 1 + p 2 = 1

how about n = 3 n = 3 p 1 + p 2 + p 3 = 1

The bottom line intuitions for Entropy: Entropy is a statistic for describing a probability distribution. Probabilities distributions which are flat, broad, sparse, etc. have HIGH entropy. Probability distributions which are peaked, sharp, narrow, compact etc. have LOW entropy. Entropy adds for independent elements of a system, thus entropy grows with the dimensionality of the probability distribution. Entropy is zero IFF the system is in a definite state, i.e. p = 1 somewhere and 0 everywhere else.

Pop Quiz:

Entropy The (average) number of yes/no questions needed to completely specify the state of a system

11:16 am (Pacific) on June 29th of the year 2001, there were approximately 816,119 words in the English Language H(english) = 19.6 bits Twenty Questions: 2 20 = 1,048,576 What’s a winning 20 Questions Strategy?

So, what is information? It’s a change in what you don’t know. It’s a change in the entropy.

xy Information as a measure of correlation

xy

heads tails probability 0 1 1/21/2 P(Y) I (X;Y) = H(Y) - H(Y|X) = 0 bits heads tails probability 0 1 1/21/2 P(Y|x=heads ) H(Y|x=heads) = 1H(Y) = 1

xy Information as a measure of correlation

xy

heads tails probability 0 1 1/21/2 P(Y) I (X;Y) = H(Y) - H(Y|X) ~ 1 bit heads tails probability 0 1 1/21/2 P(Y|x=heads ) H(Y|x=heads) ~ 0H(Y) = 1

xy Information Theory in Neuroscience

The Critical Observation: Information is Mutual I(X;Y) = I(Y;X) H(Y)-H(Y|X) = H(X)-H(X|Y)

The Critical Observation: What a spike tells the Brain about the stimulus, is the same as what our stimulus choice tells us about the likelihood of a spike. I(Stimulus;Spike) = I(Spike;Stimulus)

The Critical Observation: What our stimulus choice tells us about the likelihood of a spike. stimulusresponse This, we can measure....

Show your system stimuli. Measure neural responses. P( neural response | stimulus presented ) Estimate: P( neural repsones ) From that, Estimate: Compute: H(neural response) and H(neural response | stimulus presented) Calculate: I(response ; stimulus) How to use Information Theory:

Choose stimuli which are not representative. Measure the “wrong” aspect of the response. Don’t take enough data to estimate P( ) well. Use a crappy method of computing H( ). Calculate I( ) and report it without comparing it to anything... How to screw it up:

Here’s an example of Information Theory applied appropriately Temporal Coding of Visual Information in the Thalamus Pamela Reinagel and R. Clay Reid J. Neurosci. 20(14): (2000)

LGN responses are very reliable. Is there information in the temporal pattern of spikes?

xy Patterns of Spikes in the LGN …….0…….. …….1…….. spikes

xy Patterns of Spikes in the LGN …….00…….. …….10…….. …….01…….. …….11…….. spikes

xy Patterns of Spikes in the LGN …….000…….. …….101…….. …….011…….. …….100…….. spikes

xy Patterns of Spikes in the LGN …… …….. …… …….. …… …….. …… …….. spikes

P( spike pattern)

P( spike pattern | stimulus )

There is some extra Information in Temporal Patterns of spikes.

Claude Shannon Ph.D

Prof. Tom Cover EE376A & B