Hiroki Sayama sayama@binghamton.edu NECSI Summer School 2008 Week 3: Methods for the Study of Complex Systems Information Theory 1 2 3 4 5 6 0.5 p I(p) Hiroki Sayama sayama@binghamton.edu
Four approaches to complexity Nonlinear Dynamics Complexity = No closed-form solution, Chaos Information Complexity = Length of description, Entropy Computation Complexity = Computational time/space, Algorithmic complexity Collective Behavior Complexity = Multi-scale patterns, Emergence
Information? Matter Known since ancient times Energy Knows since 19th century (industrial revolution) Information Known since 20th century (WW’s, rise of computers)
An informal definition of information Aspects of some physical phenomenon that can be used to select a smaller set of options out of the original set of options (Things that reduce the number of possibilities) An observer or interpreter involved A default set of options needed
Quantitative Definition of Information
Quantitative definition of information If something is expected to occur almost certainly, its occurrence should have nearly zero information If something is expected to occur very rarely, its occurrence should have very large information If an event is expected to occur with probability p, the information produced by its occurrence (self-information) is given by I(p) = - log p
Quantitative definition of information I(p) = - log p 2 is often used as the base of log Unit of information is bit (binary digit) 1 2 3 4 5 6 0.5 I(p) p
Why log? To fulfill the additivity of information For independent events A and B: Self-information of “A happened”: I(pA) Self-information of “B happened”: I(pB) Self-information of “A and B happened”: I(pApB) = I(pA) + I(pB) “I(p) = - log p” satisfies this additivity
Exercise You picked up a card from a well-shuffled deck of cards (w/o jokers): How much self-information does the event “the card is of spade” have? How much self-information does the event “the card is a king” have? How much self-information does the event “the card is a king of spades” have?
Information Entropy
Some terminologies Event: An individual outcome (or a set of outcomes) to which a probability of its occurrence can be assigned Sample space: A set of all possible individual events Probability space: A combination of sample space and probability distribution (i.e., probabilities assigned to individual events)
Probability distribution and expected self-information Probability distribution in probability space A: pi (i = 1…n, Si pi = 1) Expected self-information H(A) when one of the individual events happened: H(A) = Si pi I(pi) = - Si pi log pi
What does H(A) mean? Average amount of self-information the observer could obtain by one observation Average “newsworthiness” the observer should expect for one event Ambiguity of knowledge the observer had about the system before observation Amount of “ignorance” the observer had about the system before observation
Information Entropy What does H(A) mean? Amount of “ignorance” the observer had about the system before observation It quantitatively shows the lack of information (not the presence of information) before observation Information Entropy
Information entropy Similar to thermodynamic entropy both conceptually and mathematically Entropy is zero if the system state is uniquely determined with no fluctuation Entropy increases as the randomness increases within the system Entropy is maximal if the system is completely random (i.e., if every event is equally likely to occur)
Exercise Prove the following: Entropy is maximal if the system is completely random (i.e., if every event is equally likely to occur) Show that f(p1, p2, …, pn) = - Si=1~n pi log pi (with Si=1~n pi = 1) takes its maximum when pi = 1/n Remove one variable using the constraint Or use the method of Lagrange multipliers
Entropy and complex systems Entropy shows how much information would be needed to fully specify the system’s state in every single detail Ordered -> low information entropy Disordered -> high information entropy May not be consistent with the usual notion of “complexity” Multiscale views are needed to address this issue
Information Entropy and Multiple Probability Spaces
Probability of composite events Probability of composite event (x, y): p(x, y) = p(y, x) = p(x | y) p(y) = p(y | x) p(x) p(x | y): Conditional probability for x to occur when y already occurred p(x | y) = p(x) if X and Y are independent from each other
Exercise: Bayes’ theorem Define p(x | y) using p(y | x) and p(x) Use the following formula as needed p(x) = Sy p(x, y) p(y) = Sx p(x, y) p(x, y) = p(y | x) p(x) = p(x | y) p(y)
Product probability space Prob. space X: {x1, x2}, {p(x1), p(x2)} Prob. space Y: {y1, y2}, {p(y1), p(y2)} Product probability space XY: {(x1, y1), (x1, y2), (x2, y1), (x2, y2)}, {p(x1, y1), p(x1, y2), p(x2, y1), p(x2, y2)} Composite events
Joint entropy Entropy of product probability space XY: H(XY) = - Sx Sy p(x, y) log p(x, y) H(XY) = H(YX) If X and Y are independent: H(XY) = H(X) + H(Y) If Y completely depends on X: H(XY) = H(X) ( >= H(Y) )
Conditional entropy Expected entropy of Y when a specific event occurred in X: H(Y | X) = Sx p(x) H(Y | X=x) = - Sx p(x) Sy p(y | x) log p(y | x) = - Sx Sy p(y, x) log p(y | x) If X and Y are independent: H(Y | X) = H(Y) If Y completely depends on X: H(Y | X) = 0
Exercise Prove the following: H(Y | X) = H(YX) - H(X) Hint: Use Bayes’ theorem
Mutual Information
Mutual information H(Y) – H(Y | X) I(Y; X) = Mutual information Conditional entropy measures how much ambiguity still remains on Y after observing an event on X Reduction of ambiguity on Y by one observation on X can be written as: H(Y) – H(Y | X) I(Y; X) = Mutual information
Symmetry of mutual information I(Y; X) = H(Y) – H(Y | X) = H(Y) + H(X) – H(YX) = H(X) + H(Y) – H(XY) = I(X; Y) Mutual information is symmetric in terms of X and Y
Exercise Prove the following: If X and Y are independent: I(X; Y) = 0 If Y completely depends on X: I(X; Y) = H(Y)
Exercise System A System B 1 a b 2 3 c Measure the mutual information between the two systems on the right:
Use of mutual information Mutual information can be used to measure how much interaction exists between two subsystems in a complex system Correlation only works for quantitative measures and detects only linear relationships Mutual information works for qualitative (discrete, symbolic) measures and nonlinear relationships as well
Information Source
Information source Sequence of values of a random variable that obeys some probabilistic rules Sequence may be over time or space Values (events) may or may not be independent from each other Example: Repeated coin tosses Sound Visual image
Memoryless and Markov information sources 01010010001011011001101000110 Memoryless information source p(0) = p(1) = 1/2 01000000111111001110001111111 Markov information source p(1|0) = p(0|1) = 1/4
Markov information source Information source whose probability distribution at time t depends only on its immediate past value Xt-1 (or past n values Xt-1, Xt-2, ..., Xt-n) Cases n>1 can be converted into n=1 form by defining composite events Probabilistic rules are given as a set of conditional probabilities, which can be written in the form of a transition probability matrix (TPM)
State-transition diagram 01000000111111001110001111111 Markov information source p(1|0) = p(0|1) = 1/4 1 1/4 3/4
Matrix representation 01000000111111001110001111111 Markov information source p(1|0) = p(0|1) = 1/4 3/4 1/4 Probability vector at time t-1 at time t TPM p0 p1 p0 p1 =
abcaccaabccccaaabc aaccacaccaaaaabcc Exercise abcaccaabccccaaabc aaccacaccaaaaabcc Consider the above sequence as a Markov information source and create its state-transition diagram and matrix representation
Review: Convenient properties of transition probability matrix The product of two TPMs is also a TPM All TPMs have eigenvalue 1 |l| 1 for all eigenvalues of any TPM If the transition network is strongly connected, the TPM has one and only one eigenvalue 1 (no degeneration)
Review: TPM and asymptotic probability distribution |l| 1 for all eigenvalues of any TPM If the transition network is strongly connected, the TPM has one and only one eigenvalue 1 (no degeneration) → This eigenvalue is a unique dominant eigenvalue and the probability vector will eventually converge to its corresponding eigenvector
Markov information source Exercise Calculate the asymptotic probability distribution of the following: 01000000111111001110001111111 Markov information source p(1|0) = p(0|1) = 1/4 3/4 1/4 p0 p1 p0 p1 =
Calculating Entropy of Markov Information Source
Review: Information entropy Expected information H(A) when one of the individual events happened: H(A) = Si pi I(pi) = - Si pi log pi This applies only to memoryless information source in which events are independent from each other
Generalizing information entropy For other types of information source where events are not independent, information entropy is defined as: H{X} = limk→∞ H(Xk+1 | X1X2…Xk) Xk: k-th value of random variable X
Calculating information entropy of Markov information source (1) H{X} = limk→∞ H(Xk+1 | X1X2…Xk) This means the expected entropy of the k+1-th value given a specific history of past k values All that matter is the last value of the history, so let’s focus on Xk
Calculating information entropy of Markov information source (2) p(Xk=x): Probability for the last (k-th) value to be x H(Xk+1 | X1X2…Xk) = Sx p(Xk=x) H(Xk+1 | Xk=x) = - Sx p(Xk=x) Sy ayx log ayx = Sx p(Xk=x) h(ax) ayx: y-th row x-th column element in TPM h(ax): Entropy of x-th column vector in TPM
Calculating information entropy of Markov information source (3) H(Xk+1 | X1X2…Xk) = Sx p(Xk=x) h(ax) If the information source has only one asymptotic probability distribution q: limk→∞ p(Xk=x) = qx (q’s x-th element) H{X} = limk→∞ H(Xk+1 | X1X2…Xk) = h·q h: A row vector whose x-th element is h(ax)
Calculating information entropy of Markov information source (4) H{X} = limk→∞ H(Xk+1 | X1X2…Xk) = h·q Information entropy of Markov information source is given by the average of entropies of its TPM’s column vectors weighted by its asymptotic probability distribution If the information source has only one asymptotic probability distribution
abcaccaabccccaaabc aaccacaccaaaaabcc Exercise Calculate information entropy of the following Markov information source we discussed earlier: 01000000111111001110001111111 abcaccaabccccaaabc aaccacaccaaaaabcc
Summary Complexity of a system may be characterized using information Length of description Entropy (ambiguity of knowledge) Mutual information quantifies the coupling between two components within a system Entropy may be measured for Markov information sources as well