Download presentation
Presentation is loading. Please wait.
1
(information transmission channel)
Recognition stimulus input Observer (information transmission channel) response Response: which category the stimulus belongs to ? In many situations in processing of sensory stimuli we are interested in deciding what the stimulus represents, i.e. to which category of expected stimuli one can put the incoming signal. Thus, when listening to a speech sound, we may want to decide which of phonemes the sound represents (about 50 categories), or perhaps whether the speaker is male or female (2 categories). Intuitively, we feel that the information gained when deciding between different categories of phonemes is different to the information gained when deciding between two genders of the speaker. What is the “information value” of recognizing the category?
2
Information area reduced to 63/64 area reduced to 1/64
One can think about a game of “ships” where there is a ship somewhere in the field of 64 squares. We know, that the ship is somewhere in this field. When missing it on the first trial, there is still some information gained, namely now we know that the ship is somewhere on the remaining 63 squares. However, when hitting the ship on the first trial, the information gained should be larger, since now we know more, namely that the ship is (was) on this particular square. When we learn by some means ([perhaps a spy tells us) that the ship is not on the right part of the field, there is also some information obtained (knowledge gained). The question is how to quantify these different amounts of the information. NOT HERE
3
(possible space of signals)
Prior information (possible space of signals) The amount of information gained by receiving the signal is proportional to ratio of these two areas Posterior (possible space after the signal is received) We may agree that the information gained by receiving the signal should be somehow related to sizes of the whole area where the ship could be located (prior probability of the occurrence of the ship) and the size of the area where the ship may be located after we receive the signal (posterior probability of the occurrence of the ship after the action). Let’s decide that the information gained is proportional to the ratio of these two areas. Larger this ratio, more information is gained. Imagine your parents, one being generous (more likely your mother) who almost always gives you money when you ask, and more strict (father?) who almost always refuses. When you ask your mother and she gives you the money, not much information is gained – you expected that. When your father gives you the money, much more information is gained (you did not expect it). When you are lucky and have a parent who always gives you money when you ask, getting the money brings no new information – you fully expected it. Makes sense, no? The less likely the outcome, the more information is gained! The information in a symbol should be inversely proportional to the probability of the symbol p.
4
Basics of Information Theory
Claude Elwood Shannon ( ) Observe output message Try to make up the input message (gain new information) Measuring the information was put on firm basis by Shannon shortly after the Second World War. Also a juggling machine, rocket-powered Frisbees, motorized Pogo sticks, a device that could solve the Rubik's Cube puzzle,…..
5
Measuring the information
information in an event Let’s start with to build up the notion. First, we agree that the information should be positive – by simply receiving a symbol which occurs with a probability p , some information should be gained. Another common sense requirement is that if we receive two independent symbols with their probabilities of occurrences, the information in these two events should add. Defining the amount of information in a symbol which occurs with probability p being proportional to log(1/p) satisfies these requirements. Multiplication turns to addition Is always positive (since p<1)
6
Suppose we receive a message consisting of M symbols (which come from some alphabet with n characters). When the message is sufficiently long, each symbol occurs in the message approximately Mpi times. The probability P of the whole message occurring will be given a product of each symbol occurring. Then, according to our definition of information, the information obtained by receiving the message is given by log(1/P). However, we are not interested in the amount of information in a particular message (this depends on the length of the message), we are actually interested in the average amount of information in each of the characters which form the message. This is obtained by dividing the information in the whole message by the length of the message M. This entity is called entropy of the alphabet (entropy of the source). To know the entropy, we need to know how large the alphabel is (number of characters in the alphabet n), and probabilities of occurrence of each character in the alphabet pi.
7
when deciding among N (equally likely) alternatives
When log2 , then entropy is in bits Information gained when deciding among N (equally likely) alternatives Number of stimulus alternatives N Number of Bits (log2N) 21=2 1 22 = 4 2 8 3 16 4 32 5 64 6 128 7 28 = 256 1 bit of information reduces the area of possible messages to half When the logarithm of base 2 is applied, the entropy is in units of bits. ! bit of information is the information which reduces the space of possible messages to ½ of the original space.
8
experiments with two possible outcomes
with probabilities p1 and p2 total probability must be 1, so p2=1- p1 H=-p1 log2 p1 – (1– p1) log2 (1-p1) i.e. H=0 for p1=0 (the second outcome certain) or p1=1 (the first outcome certain) for p1 = 0.5, p2=0.5 H=-0.5 log log2 0.5 = log2 0.5 = 1 Probabilities of occurrence of each character may not be all the same (e.g. in English alphabet, i occurs much more often that e.g. z). However, a short argument (here shown for an alphabet with only two characters) may convince us that the highest entropy (highest potential for transmitting information) is in the alphabet in which all symbols occur with the equal probability. Entropy H (information) is maximum when the outcome is the least predictable !
9
Equal prior probability of each category.
need 3 binary numbers (3 bits) to describe 23 = 8 categories 1st or 2nd half ? need more bits when dealing with symbols that are not all equally likely When different categories (symbols) have different probabilities of occurrence, more bits will be required to express the six categories (you remember that when all categories are equally likely, we need 3 bits). Here, the first bit is used to separate the first category (which has a probability of occurrence p1=0.5) from the remaining 5 categories (which together also has a probability of occurrence p2+p3+p4+p5+p6=0.5 ). The second bit is used to separate the second category eith its probability of occurrence p2=0.25) from the remaining 4 categories, etc. The bottom line here is that the non-equal probabilities of occurrences of the different categories result in less efficient code. 5 bits
10
The Bar Code Here is a nice example of information transfer. Using one hand to signal to a barman how many drinks you want, you can generate 6 different numbers (including zero). This represents information capacity of 2.58 bits. When you agree on a new code, where the order of fingers you rise has a meaning, you can generate 32 different numbers, which represents much higher information capacity of 6 bits. You have created more efficient code, however, this is at a cost of less reliability – the barman has harder time to decode your message, and the code breaks e.g. when evaluating the order in a mirror.
11
Information transfer through a communication channel
transmitter (source) channel receiver p(X) p(Y|X) p(Y) noise With no noise in the channel, p(xi|yi)=1 and p(xi,yj) = 0 p(x) p(y|x) p(y) 1 p(x1) p(y1)=p(x1) p(x2) p(y2)=p(x2) p(y1|x1) p(x1) p(y1) p(x2) p(y2) p(y2|x1) p(y1|x2) Two element (binary) channel With noise, p(xi|yi)<1 and p(xi,yj) > 0 5/8 3/4 1/4 3/8 0.8 0.55 0.2 0.45 p(y1)=(5/8x0.8)+(1/4x0.2)=0.55 p(y2)=(3/8x0.8)+(3/4x0.2)=0.45 A general model of information transmission consists of a transmitter, the channel (which may be affected by a noise), and of a receiver of the information. The most basic model is the model of binary channel where the transmitted information is binary (think yes/no situation). One state (e.g. “yes”) occurs with a probability p(x1) and the second state (e.g. “no”) occurs with the probabilty p(x2). When the channel is perfect (no noise) the information at the receiver is identical as the information at the transmitter (there is no loss of the information). Probability that one state is transmitted (e.g. x1 and a different state y2 is received , which is given by the conditional probability p(y2|x1) = 0. Hower, in a typicial situation, there is some noise on the channel, the p(yk|xj) ≠ 0, and the output states are different from the input states, i.e. there is some information loss in the channel.
12
N11 N12 Nstim 1 N21 N22 Nstim 2 Nres 1 Nres 2 N p( xj ) = Nstim j / N
response 1 response 2 number of stimuli p(y1|x1) p(x1) p(y1) p(x2) p(y2) p(y||x1) p(y1|x2) Binary Channel N11 N12 Nstim 1 N21 N22 Nstim 2 Nres 1 Nres 2 N stimulus 1 stimulus 2 total number of stimuli (or responses) number of responses p( xj ) = Nstim j / N p( yk ) = Nres k / N p( xj|yk ) = Njk / Nres k It is possible to estimate the properties of such a binary channel in an experiment of categorical judgments, where we know what the category of the input stimulus is and the experimental subject is asked to respond with the estimate of the category. We record the responses in so called confusion matrix, which contains the number of correct responses on its diagonal and the number of errors in its off-diagonal terms. Probabilities of occurrence of the individual categories on the input p(xi) can be determined by dividing the number of times the given category was called by a total number of stimuli presented. The probabilities of output categories p(yk) are computed in a similar manner by dividing the number of times the given category was responded by the total number of stimuli. Conditonal probabilities p(xj|yk) – i.e. what is the probability that the category xj was presented when the category yk was responded are computed by dividing the off-diagonal terms by a total number of times the given response was observed. Joint probabilities of observing at the same time both xi and yj , denoted as p(xi,yj) are computed by dividing each individual term in the confusion matrix by the total number of presented (or received) stimuli N. joint probability that both xj and yk happen is p( xj,y k) = Njk / N
13
Stimulus-Response Confusion Matrix
received response probability of xjth symbol p( xj ) = Nstim j / N y1 y2 yn total x1 N11 N12 N1n Nstim 1 x2 N21 N22 xn Nn1 Nnn Nstim n N1row Nnrow N probability of ykth symbol p( yk ) = Nres k / N called stimulus conditional probability that xj was sent when yk was received p( xj|yk ) = Njk / Nres k The confusion matrix can be made following the same principles as described in the construction of the binary confusion matrix for an arbitrary number of categories. joint probability that both xj and yk happen p( xj,y k) = Njk / N number of j-th stimuli Σk Njk=N stim j number of k-th responses Σj Njk=N res k number of called stimuli = number of responses = ΣkN res k = ΣjNstim j = N
14
information transferred by a system
This happens when the input and the output are independent (joint probabilities are given by products of the individual probabilities). There is no relation of the output to the input, i.e. no information transfer) information transferred by a system I (X|Y)= H(X)-H(X|Y) = Hmax(X,Y)-H(X,Y)
15
transferred information I(X|Y)=H(X)-H(X|Y) = H(X) = 1
run experiment 20 times get it always RIGHT input probabilities p(x1)=0.5 p(x2)=0.5 output probabilities p(y1)=0.5 conditional probabilities p(yk|xj) stim 1 stim 2 resp 1 10 resp 2 20 1.0 transferred information I(X|Y)=H(X)-H(X|Y) = H(X) = 1
16
transferred information I(X|Y)=Hmax(X,Y)-H(X,Y) =2-1=1 bit
run experiment 20 times get it always RIGHT joint probabilities p(xj,yk) input probabilities p(x1)=0.5 p(x2)=0.5 output probabilities p(y1)=0.5 stim 1 stim 2 resp 1 10 resp 2 20 0.5 probabilities of independent events 0.25 transferred information I(X|Y)=Hmax(X,Y)-H(X,Y) =2-1=1 bit
17
transferred information I(X|Y)=Hmax(X,Y)-H(X,Y) =2-1=1 bit
run experiment 20 times get it always WRONG input probabilities p(x1)=0.5 p(x2)=0.5 output probabilities p(y1)=0.5 joint probabilities p(xj,yk) stim 1 stim 2 resp 1 10 resp 2 20 0.5 probabilities of independent events 0.25 transferred information I(X|Y)=Hmax(X,Y)-H(X,Y) =2-1=1 bit
18
transferred information I(X|Y)=Hmax(X,Y)-H(X,Y) =2-2=0 bit
run experiment 20 times get it 10 times right and 10 times wrong joint probabilities p(xj,yk) stim 1 stim 2 resp 1 5 10 resp 2 20 input probabilities p(x1)=0.5 p(x2)=0.5 output probabilities p(y1)=0.5 0.25 02.5 probabilities of independent events 0.25 transferred information I(X|Y)=Hmax(X,Y)-H(X,Y) =2-2=0 bit
19
response categories number of stimuli
y1 y2 y3 y4 y5 x1 20 5 25 x2 15 x3 6 17 2 x4 12 8 x5 19 number of responses 26 27 125 Here is one example of the confusion matrix.
20
Matrix of Joint Probabilities
(stimulus-response matrix divided by total number of stimuli) number of called stimuli=number of responses=N p(xi,yj) = Nij/N stimuli-responses joint probabilities y1 y2 yn x1 N11 N12 N1n x2 N21 N22 N2n xn Nn1 Nnn y1 y2 yn x1 p(x1,y1) p(x1,y2) p(x1,yn) x2 p(x2,y1) p(x2,y2) p(x2,yn) xn p(xn,y1) p(xn,y2) p(xn,yn) From the confusion matrix the matrix of joint probabilities can be computed (by dividing each entry in the confusion matrix by the total number of stimuli N).
21
probability of stimulus
stimulus/response confusion matrix responses number of stimuli probability of stimulus stimuli y1 y2 y3 y4 y5 x1 20 5 25 25/125= 0.2 x2 15 25/125=0.2 x3 6 17 2 x4 12 8 x5 19 number of responses 26 27 125 probability of response 25/125 = 0.2 26/125 =0.208 27/125 =0.216 20/125 =0.16 Here is is done for our concrete example. See that the probabilities of the presented categories are all the same in this case (they in general do not have to be), the probabilities of responses for each category are not equal (but they have to add to 1).
22
total number of stimuli (responses) N = 125
joint probability p(x\xj,yk) = xiyj/N matrix of joint probabilities p(xj,yk) y1 y2 y3 y4 y5 x1 20/125 =0.16 5/125 =0.04 x2 5/125 =0.04 15/125 =0.12 x3 6/125 =0.048 17/125 =0.136 2/125 =0.016 x4 12/125 =0.096 8/125 =0.064 x5 19/125 =0.152 This is the matrix of joint probabilities. Having all joint probabilities, we can computed the joint entropy H(X,Y) using the formula for an entropy. IN this case, the entropy of the joint probability matrix is 3.43 bit.
23
when xi and yj are independent events (i. e
when xi and yj are independent events (i.e. output does not depend on input), the joint probability would be given by a product of probabilities of these independent events P(xi,yj) = p(xi) p(yj), and the entropy of the system would be maximum Hmax (the system would be entirely useless for transmission of the information, since its output would not depend on its input) y1 y2 y3 y4 y5 x1 20/125 =0.16 5/125 =0.04 x2 5/125 =0.04 15/125 =0.12 x3 6/125 =0.048 17/125 =0.136 2/125 =0.016 x4 12/125 =0.096 8/125 =0.064 x5 19/125 =0.152 What would be joint probabilities in the case that the output is entirely independent of the input? In this case, there would be no relation between the input and output symbols, i.e. no information would be transmitted. The input and the output symbols would be independent and their joint probabilities would be given by a product of probabilities of the individual symbols (this is a definition of stochastic independence). The entropy of such matrix of joint probabilities would be the maximum possible Hmax(X,Y), in this case it is 4.63 bits.
24
I(X|Y) =Hmax (X,Y) – X(X,Y) = 4.63 – 3.41 = 1.2 bits
The information that is transmitted by the system is given by a difference between the maximum joint entropy of the matrix of independent events Hmax (X,Y) and the joint entropy of the real system (derived from the confusion matrix H(X,Y). I(X|Y) =Hmax (X,Y) – X(X,Y) = 4.63 – 3.41 = 1.2 bits The information which is transmitted by the system with a given confusion matrix is given by a difference between the maximum joint entropy of the joint probabilities of stochastically independent events Hmax(X,Y) and the entropy of the true matrix of joint probabilities H(X,Y), in our case it is 1.2 bits. FYI, this is not the only way to compute the transmitted information from the confusion matrix, but it is probably the most intuitive one. Other methods involve matrixes of conditional probabilities.
25
Capacity of human channel for one-dimensional stimuli
When there is only one category to classify sensory stimuli to, the transmitted information is obviously 0 – no information is transmitted since the response is entirely predictable. For two categories when no external noise is present, there are almost no errors, i.e. the transmitted information is almost equal to the information in the input, i.e. 1 bit. In an ideal system, this trend would continue (marked as “perfect information transmission”). However, for 4 categories (2 bits of information) there is typically larger loss of information, the transmitted information is less than 4 bits. When human subjects try to classify to more than 5 categories (about 2.5 bits of information), the transmitted information tends to flatten at this value.
26
Magic number 7±2 (George Miller 1956)
This limit on information transmission by human subjects was observed along many modes of perception. Miller’s paper on Magic Number 7 (plus/minus two) is a delight to read!
27
Magic number 7±2 (George Miller 1956)
Human perception seems to distinguish only among 7 (plus or minus 2) different entities along one perceptual dimension To recognize more items long training (musicians) use more than one perceptual dimension (e.g. pitch and loudness) chunk the items into larger chunks (phonemes to words, words to phrases,..) It is possible to process larger amounts of information either by longer training of subjects, by using more than one-dimensional stimuli, or by chunking the items into larger chunks.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.