Hiroki Sayama sayama@binghamton.edu NECSI Summer School 2008 Week 3: Methods for the Study of Complex Systems Information Theory 1 2 3 4 5 6 0.5 p I(p)

Slides:



Advertisements
Similar presentations
Sampling and Pulse Code Modulation
Advertisements

Chapter 2 Concepts of Prob. Theory
Chapter 4 Probability and Probability Distributions
Business Statistics for Managerial Decision
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 4-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter.
Protein- Cytokine network reconstruction using information theory-based analysis Farzaneh Farhangmehr UCSD Presentation#3 July 25, 2011.
Entropy Rates of a Stochastic Process
Chapter 6 Information Theory
Linear Transformations
6 1 Linear Transformations. 6 2 Hopfield Network Questions.
Lecture 2: Basic Information Theory Thinh Nguyen Oregon State University.
Information Theory and Security
CEEN-2131 Business Statistics: A Decision-Making Approach CEEN-2130/31/32 Using Probability and Probability Distributions.
Modern Navigation Thomas Herring
Probability and Probability Distributions
Some basic concepts of Information Theory and Entropy
§1 Entropy and mutual information
INFORMATION THEORY BYK.SWARAJA ASSOCIATE PROFESSOR MREC.
: Appendix A: Mathematical Foundations 1 Montri Karnjanadecha ac.th/~montri Principles of.
Sets, Combinatorics, Probability, and Number Theory Mathematical Structures for Computer Science Chapter 3 Copyright © 2006 W.H. Freeman & Co.MSCS SlidesProbability.
Sets, Combinatorics, Probability, and Number Theory Mathematical Structures for Computer Science Chapter 3 Copyright © 2006 W.H. Freeman & Co.MSCS SlidesProbability.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
Applied Business Forecasting and Regression Analysis Review lecture 2 Randomness and Probability.
Mathematical Preliminaries. 37 Matrix Theory Vectors nth element of vector u : u(n) Matrix mth row and nth column of A : a(m,n) column vector.
5.3 Random Variables  Random Variable  Discrete Random Variables  Continuous Random Variables  Normal Distributions as Probability Distributions 1.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
COMMUNICATION NETWORK. NOISE CHARACTERISTICS OF A CHANNEL 1.
Chap 4-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 4 Using Probability and Probability.
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory I AI-lab
Communication System A communication system can be represented as in Figure. A message W, drawn from the index set {1, 2,..., M}, results in the signal.
Generalized Semi- Markov Processes (GSMP). Summary Some Definitions The Poisson Process Properties of the Poisson Process  Interarrival times  Memoryless.
Random Variables Presentation 6.. Random Variables A random variable assigns a number (or symbol) to each outcome of a random circumstance. A random variable.
PROBABILITY, PROBABILITY RULES, AND CONDITIONAL PROBABILITY
POSC 202A: Lecture 4 Probability. We begin with the basics of probability and then move on to expected value. Understanding probability is important because.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Basic Concepts of Information Theory Entropy for Two-dimensional Discrete Finite Probability Schemes. Conditional Entropy. Communication Network. Noise.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Probability Theory Modelling random phenomena. Permutations the number of ways that you can order n objects is: n! = n(n-1)(n-2)(n-3)…(3)(2)(1) Definition:
Computational Intelligence: Methods and Applications Lecture 33 Decision Tables & Information Theory Włodzisław Duch Dept. of Informatics, UMK Google:
Chapter 2: Probability. Section 2.1: Basic Ideas Definition: An experiment is a process that results in an outcome that cannot be predicted in advance.
Channel Coding Theorem (The most famous in IT) Channel Capacity; Problem: finding the maximum number of distinguishable signals for n uses of a communication.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Chap 4-1 Chapter 4 Using Probability and Probability Distributions.
Probability and Probability Distributions. Probability Concepts Probability: –We now assume the population parameters are known and calculate the chances.
Basic Concepts of Information Theory Entropy for Two-dimensional Discrete Finite Probability Schemes. Conditional Entropy. Communication Network. Noise.
Virtual University of Pakistan
Mathematical Formulation of the Superposition Principle
Hiroki Sayama NECSI Summer School 2008 Week 2: Complex Systems Modeling and Networks Network Models Hiroki Sayama
Availability Availability - A(t)
Chapter 4 Using Probability and Probability Distributions
Hidden Markov Models.
Applied Discrete Mathematics Week 11: Relations
Introduction to Information theory
Postulates of Quantum Mechanics
Markov Chains Mixing Times Lecture 5
Hiroki Sayama NECSI Summer School 2008 Week 3: Methods for the Study of Complex Systems Stochastic Systems Hiroki Sayama
Streaming & sampling.
Graduate School of Information Sciences, Tohoku University
3.1 Expectation Expectation Example
Review of Probability and Estimators Arun Das, Jason Rebello
Digital Multimedia Coding
COT 5611 Operating Systems Design Principles Spring 2012
Chapter 4 – Part 3.
COT 5611 Operating Systems Design Principles Spring 2014
Subject Name: Information Theory Coding Subject Code: 10EC55
Hidden Markov Models Part 2: Algorithms
STOCHASTIC HYDROLOGY Random Processes
Welcome to the wonderful world of Probability
EE513 Audio Signals and Systems
Sets, Combinatorics, Probability, and Number Theory
Presentation transcript:

Hiroki Sayama sayama@binghamton.edu NECSI Summer School 2008 Week 3: Methods for the Study of Complex Systems Information Theory 1 2 3 4 5 6 0.5 p I(p) Hiroki Sayama sayama@binghamton.edu

Four approaches to complexity Nonlinear Dynamics Complexity = No closed-form solution, Chaos Information Complexity = Length of description, Entropy Computation Complexity = Computational time/space, Algorithmic complexity Collective Behavior Complexity = Multi-scale patterns, Emergence

Information? Matter Known since ancient times Energy Knows since 19th century (industrial revolution) Information Known since 20th century (WW’s, rise of computers)

An informal definition of information Aspects of some physical phenomenon that can be used to select a smaller set of options out of the original set of options (Things that reduce the number of possibilities) An observer or interpreter involved A default set of options needed

Quantitative Definition of Information

Quantitative definition of information If something is expected to occur almost certainly, its occurrence should have nearly zero information If something is expected to occur very rarely, its occurrence should have very large information If an event is expected to occur with probability p, the information produced by its occurrence (self-information) is given by I(p) = - log p

Quantitative definition of information I(p) = - log p 2 is often used as the base of log Unit of information is bit (binary digit) 1 2 3 4 5 6 0.5 I(p) p

Why log? To fulfill the additivity of information For independent events A and B: Self-information of “A happened”: I(pA) Self-information of “B happened”: I(pB) Self-information of “A and B happened”: I(pApB) = I(pA) + I(pB) “I(p) = - log p” satisfies this additivity

Exercise You picked up a card from a well-shuffled deck of cards (w/o jokers): How much self-information does the event “the card is of spade” have? How much self-information does the event “the card is a king” have? How much self-information does the event “the card is a king of spades” have?

Information Entropy

Some terminologies Event: An individual outcome (or a set of outcomes) to which a probability of its occurrence can be assigned Sample space: A set of all possible individual events Probability space: A combination of sample space and probability distribution (i.e., probabilities assigned to individual events)

Probability distribution and expected self-information Probability distribution in probability space A: pi (i = 1…n, Si pi = 1) Expected self-information H(A) when one of the individual events happened: H(A) = Si pi I(pi) = - Si pi log pi

What does H(A) mean? Average amount of self-information the observer could obtain by one observation Average “newsworthiness” the observer should expect for one event Ambiguity of knowledge the observer had about the system before observation Amount of “ignorance” the observer had about the system before observation

Information Entropy What does H(A) mean? Amount of “ignorance” the observer had about the system before observation It quantitatively shows the lack of information (not the presence of information) before observation Information Entropy

Information entropy Similar to thermodynamic entropy both conceptually and mathematically Entropy is zero if the system state is uniquely determined with no fluctuation Entropy increases as the randomness increases within the system Entropy is maximal if the system is completely random (i.e., if every event is equally likely to occur)

Exercise Prove the following: Entropy is maximal if the system is completely random (i.e., if every event is equally likely to occur) Show that f(p1, p2, …, pn) = - Si=1~n pi log pi (with Si=1~n pi = 1) takes its maximum when pi = 1/n Remove one variable using the constraint Or use the method of Lagrange multipliers

Entropy and complex systems Entropy shows how much information would be needed to fully specify the system’s state in every single detail Ordered -> low information entropy Disordered -> high information entropy May not be consistent with the usual notion of “complexity” Multiscale views are needed to address this issue

Information Entropy and Multiple Probability Spaces

Probability of composite events Probability of composite event (x, y): p(x, y) = p(y, x) = p(x | y) p(y) = p(y | x) p(x) p(x | y): Conditional probability for x to occur when y already occurred p(x | y) = p(x) if X and Y are independent from each other

Exercise: Bayes’ theorem Define p(x | y) using p(y | x) and p(x) Use the following formula as needed p(x) = Sy p(x, y) p(y) = Sx p(x, y) p(x, y) = p(y | x) p(x) = p(x | y) p(y)

Product probability space Prob. space X: {x1, x2}, {p(x1), p(x2)} Prob. space Y: {y1, y2}, {p(y1), p(y2)} Product probability space XY: {(x1, y1), (x1, y2), (x2, y1), (x2, y2)}, {p(x1, y1), p(x1, y2), p(x2, y1), p(x2, y2)} Composite events

Joint entropy Entropy of product probability space XY: H(XY) = - Sx Sy p(x, y) log p(x, y) H(XY) = H(YX) If X and Y are independent: H(XY) = H(X) + H(Y) If Y completely depends on X: H(XY) = H(X) ( >= H(Y) )

Conditional entropy Expected entropy of Y when a specific event occurred in X: H(Y | X) = Sx p(x) H(Y | X=x) = - Sx p(x) Sy p(y | x) log p(y | x) = - Sx Sy p(y, x) log p(y | x) If X and Y are independent: H(Y | X) = H(Y) If Y completely depends on X: H(Y | X) = 0

Exercise Prove the following: H(Y | X) = H(YX) - H(X) Hint: Use Bayes’ theorem

Mutual Information

Mutual information H(Y) – H(Y | X) I(Y; X) = Mutual information Conditional entropy measures how much ambiguity still remains on Y after observing an event on X Reduction of ambiguity on Y by one observation on X can be written as: H(Y) – H(Y | X) I(Y; X) = Mutual information

Symmetry of mutual information I(Y; X) = H(Y) – H(Y | X) = H(Y) + H(X) – H(YX) = H(X) + H(Y) – H(XY) = I(X; Y) Mutual information is symmetric in terms of X and Y

Exercise Prove the following: If X and Y are independent: I(X; Y) = 0 If Y completely depends on X: I(X; Y) = H(Y)

Exercise System A System B 1 a b 2 3 c Measure the mutual information between the two systems on the right:

Use of mutual information Mutual information can be used to measure how much interaction exists between two subsystems in a complex system Correlation only works for quantitative measures and detects only linear relationships Mutual information works for qualitative (discrete, symbolic) measures and nonlinear relationships as well

Information Source

Information source Sequence of values of a random variable that obeys some probabilistic rules Sequence may be over time or space Values (events) may or may not be independent from each other Example: Repeated coin tosses Sound Visual image

Memoryless and Markov information sources 01010010001011011001101000110 Memoryless information source p(0) = p(1) = 1/2 01000000111111001110001111111 Markov information source p(1|0) = p(0|1) = 1/4

Markov information source Information source whose probability distribution at time t depends only on its immediate past value Xt-1 (or past n values Xt-1, Xt-2, ..., Xt-n) Cases n>1 can be converted into n=1 form by defining composite events Probabilistic rules are given as a set of conditional probabilities, which can be written in the form of a transition probability matrix (TPM)

State-transition diagram 01000000111111001110001111111 Markov information source p(1|0) = p(0|1) = 1/4 1 1/4 3/4

Matrix representation 01000000111111001110001111111 Markov information source p(1|0) = p(0|1) = 1/4 3/4 1/4 Probability vector at time t-1 at time t TPM p0 p1 p0 p1 =

abcaccaabccccaaabc aaccacaccaaaaabcc Exercise abcaccaabccccaaabc aaccacaccaaaaabcc Consider the above sequence as a Markov information source and create its state-transition diagram and matrix representation

Review: Convenient properties of transition probability matrix The product of two TPMs is also a TPM All TPMs have eigenvalue 1 |l|  1 for all eigenvalues of any TPM If the transition network is strongly connected, the TPM has one and only one eigenvalue 1 (no degeneration)

Review: TPM and asymptotic probability distribution |l|  1 for all eigenvalues of any TPM If the transition network is strongly connected, the TPM has one and only one eigenvalue 1 (no degeneration) → This eigenvalue is a unique dominant eigenvalue and the probability vector will eventually converge to its corresponding eigenvector

Markov information source Exercise Calculate the asymptotic probability distribution of the following: 01000000111111001110001111111 Markov information source p(1|0) = p(0|1) = 1/4 3/4 1/4 p0 p1 p0 p1 =

Calculating Entropy of Markov Information Source

Review: Information entropy Expected information H(A) when one of the individual events happened: H(A) = Si pi I(pi) = - Si pi log pi This applies only to memoryless information source in which events are independent from each other

Generalizing information entropy For other types of information source where events are not independent, information entropy is defined as: H{X} = limk→∞ H(Xk+1 | X1X2…Xk) Xk: k-th value of random variable X

Calculating information entropy of Markov information source (1) H{X} = limk→∞ H(Xk+1 | X1X2…Xk) This means the expected entropy of the k+1-th value given a specific history of past k values All that matter is the last value of the history, so let’s focus on Xk

Calculating information entropy of Markov information source (2) p(Xk=x): Probability for the last (k-th) value to be x H(Xk+1 | X1X2…Xk) = Sx p(Xk=x) H(Xk+1 | Xk=x) = - Sx p(Xk=x) Sy ayx log ayx = Sx p(Xk=x) h(ax) ayx: y-th row x-th column element in TPM h(ax): Entropy of x-th column vector in TPM

Calculating information entropy of Markov information source (3) H(Xk+1 | X1X2…Xk) = Sx p(Xk=x) h(ax) If the information source has only one asymptotic probability distribution q: limk→∞ p(Xk=x) = qx (q’s x-th element) H{X} = limk→∞ H(Xk+1 | X1X2…Xk) = h·q h: A row vector whose x-th element is h(ax)

Calculating information entropy of Markov information source (4) H{X} = limk→∞ H(Xk+1 | X1X2…Xk) = h·q Information entropy of Markov information source is given by the average of entropies of its TPM’s column vectors weighted by its asymptotic probability distribution If the information source has only one asymptotic probability distribution

abcaccaabccccaaabc aaccacaccaaaaabcc Exercise Calculate information entropy of the following Markov information source we discussed earlier: 01000000111111001110001111111 abcaccaabccccaaabc aaccacaccaaaaabcc

Summary Complexity of a system may be characterized using information Length of description Entropy (ambiguity of knowledge) Mutual information quantifies the coupling between two components within a system Entropy may be measured for Markov information sources as well