Information and Coding Theory Transmission over noisy channels. Channel capacity, Shannon’s theorem. Juris Viksna, 2015
Information transmission
Noisy channel In practice channels are always noisy (sometimes this could be ignored). There are several types of noisy channels one can consider. We will restrict attention to binary symmetric channels.
Noisy channel - the problem Assume BSC with probability of transmission error p. In this case we assume that we have already decided on the optimal string of bits for transmission - i.e. each bit could have value 1 or 0 with equal probabilities ½. We want to maximize our chances to receive a message without errors, to do this we are allowed to modify the message that we have to transmit. Usually we will assume that message is composed of blocks of m bits each, and we are allowed to replace a given m bit block with an n bit block of our choice (likely we should have n m :) Such replacement procedure we will call a block code. We also would like to maximize the ratio R=m/n.
Entropy - reminder of some definitions Conditional entropy Mutual information Binary entropy function
Entropy (summarized) Relations between entropies, conditional entropies, joint entropy and mutual information. [Adapted from D.MacKay]
How could we characterize noisy channels? Noisy channel yiyi xixi Assume x i We have received some y i What we can infer about x i from y i ? If error probability is 0, we will have H(X|y=b k )=0 If (for BSC) error probability is ½, we will have H(X|y=b k )=H(X) [Adapted from D.MacKay]
How could we characterize noisy channels? Noisy channel yiyi xixi If error probability is 0, we will have H(X|y=b k )=0 If (for BSC) error probability is ½, we will have H(X|y=b k )=H(X) I(X;Y) seems a good parameter to characterize the channel [Adapted from D.MacKay]
Channel capacity Noisy channel yiyi xixi [Adapted from D.MacKay] For simplicity we will restrict our attention only to BSC (so will assume uniform distribution). The distribution however usually can not be ignored for “real channels”, due to e.g. higher probabilities for burst errors etc.
Channel capacity for BSC Assuming error probability p and probabilities of transmission of both 0 and 1 to be ½,we will have: P(1|y=0) = p, P(0|y=0)=1 p H(X|y=0) = p log p (1 p)log(1 p) = H 2 (p) H(X|y=1) = p log p (1 p)log(1 p) = H 2 (p) H(X|Y) = 1/2 H(X|y=0) + 1/2 H(X|y=1) = H 2 (p) Thus C(Q) = I(X;Y) = H 2 (1/2) H 2 (p) = 1 H 2 (p) This value will be in interval [0,1] [Adapted from D.MacKay]
Block and bit error probabilities [Adapted from D.MacKay] The computation of probability of block error p B is relatively simple. If we have a code that guarantees to correct any t errors we should have:
Block and bit error probabilities [Adapted from D.MacKay] The computation of probability of bit error p b in general is complicated. It is not enough to know only code parameters (n,k,t), but also weight distribution of code vectors. However, assuming code is completely “random” we can argue that any non-corrected transmission of block of length n leads to selection of randomly chosen message of length k. It remains to derive p b from the equation above
Shannon's theorem Theorem (Shannon, 1948): 1. For every discrete memoryless channel, the channel capacity has the following property. For any ε > 0 and R < C, for large enough N, there exists a code of length N and rate R ≥ R and a decoding algorithm, such that the maximal probability of block error is ≤ ε. 2. If a probability of bit error p b is acceptable, rates up to R(p b ) are achievable, where and H 2 (p b ) is the binary entropy function 3. For any p b, rates greater than R(p b ) are not achievable. [Adapted from
Shannon Channel Coding Theorem [Adapted from D.MacKay]
Shannon's theorem - a closer look If a probability of bit error p b is acceptable, rates up to R(p b ) are achievable, where and H 2 (p b ) is the binary entropy function For any p b, rates greater than R(p b ) are not achievable. Thus, if we want p b to get as close to 0 as possible, we still need only codes with rates that are just below C. [Adapted from
Shannon’s theorem [Adapted from D.MacKay] Actually we will prove only the first (and the correspondingly restricted third) part of this result - that with rates below C we can achieve as low transmission error as we wish (if p b 0 then 1 H 2 (p b ) 1). Still, this is probably the most interesting/illustrative case.
Shannon's theorem – a simpler special case Theorem: For every BSC with bit error probability p, the channel capacity C = 1 H 2 (p) has the following properties: 1)For any ε > 0 and R < C, for large enough n there exists a code of length n and rate R ≥ R and a decoding algorithm, such that the maximal probability of block error is ≤ ε. 2)For any R > C there exists ε > 0 such that for any code with rate R ≥ R probability of block error is ≥ ε. So, how could we prove this? #1 - we could try to construct specific codes that just will do the job. #2 - we need to prove that whatever code with rate above R we use, it will not work.
Entropy argument? Noisy channel yx To transmit x X without errors we need to receive H(X) bits. The length of vectors from X (transmitted) is n bits and H(X) = k. The length of vectors from Y (received) is n bits and H(Y) n. Rate R = k/n. Capacity C = I(X;Y) = H(X) H(X|Y) = 1 H 2 (p). If n bits are transferred we receive nC information bits about X. To receive enough bits for error-less decoding we should have nC k, i.e. C k/n = R. Well almost ok, provided we ”know” that entropy measures information... p
Achievability - how to prove? We could assume that we have a perfect code with rate R=k/n. From Hamming inequality we could compute the number t of errors that code can correct. Then we could check whether the probability of t errors in blocks of length n will not exceed p B. In principle this probably could be done k messages 2 n codewords
Achievability - how to prove? Still, we are not that sure that perfect codes should exist... Probably we can try to use for encoding set of randomly chosen vertices and hope that this will work well enough? [Adapted from D.MacKay]
Probabilities of errors Noisy channel yx p – probability that a single bit will be changed. pn – an average number of changed bits in block of length n. (1–p) n – probability that there were no transmission errors. – probability of exactly i transmission errors. p
Binomial distribution n=20, p = 0.1, 0.5, 0.8
Shannon’s theorem - achievability Let n > k be decided upon later. Pick the encoding function E: {0,1} k {0,1} n at random, i.e. for every m {0,1} k, E(m) is chosen uniformly at random from {0,1} n, independently of all other choices. Note, that we don’t fix encoding, retransmission of m likely will use different E(m). The decoding function D: {0,1} n {0,1} k works as follows: Given y {0,1} n, we find (non-constructively) the x {0,1} k such that distance d(y,E(m)) is minimized. This m is value of D(y). [Adapted from M.Sudan] B(y,r) – ball with center at y and radius r, i.e. set of all vectors z from {0,1} n with d(y,z) r. Let m be given. For D(y) m we must have at least one of the following: 1)y B(E(m),r)(more than r errors in transmission) 2)There exists m m with E(m) B(y,r).(possibility of wrong decoding)
Shannon’s theorem - achievability B(y,r) – ball with center at y and radius r, i.e. set of all vectors z from {0,1} n with d(y,z) r. Let m be given. For D(y) m we must have at least one of the following: 1)y B(E(m),r)(more than r errors in transmission) 2)There exists m m with E(m) B(y,r).(possibility of wrong decoding) r r E(m) y y If we want to achieve arbitrarily low block error probability, we need to achieve arbitrarily low probabilities for both of these events. So, what value of r should we chose?
Binomial distribution p – probability that a single bit will be changed. pn – an average number of changed bits in block of length n. The probability to have exactly pn errors approaches 0 with increase of n, however it turns out the same happens with the probability to have number of errors that differs “too much” from pn. (This actually already implies that we should chose r pn).
Chernoff bound A somewhat simplified version (but still sufficient for our purposes): Assume we flip a coin n times, coin is such that with probability p it falls a “head” (denoted by 1) and with probability 1 p it falls a “tail” (denoted by 0). After the n flips the average number of 1-s will be pn. Although the probability that the number N of 1-s is exactly pn is small, so is the probability that N significantly deviates from pn. This is the result stated by Chernoff bound: For an arbitrary small >0 the probability P that number of heads N differs form the mean value pn by more than (p+ )n is smaller than 2 /2 pn. This means that P 0 if n . Although not particularly difficult, we will omit the proof of this result here.
Chernoff bound X 1,X 2,...,X n – independent random variables. X = X 1 +X X n. E[X i ] – expectation of X i. = E[X] = E[X 1 ] +E[X 2 ] E[X n ] – expectation of X. Theorem (Chernoff bound, ~1958): For any > 0: P(X > (1+ ) ) 2 /2 . We will just be interested in special case where p(X i =1) = p, p(X i =0) = 1 p and E(X i ) = p, E(X) =np. This gives us inequality: P(X > (1+ )pn) 2 /2 pn, i.e. for any chosen > 0 probability to have more than (1+ )pn errors approaches 0 with increasing n. Essentially this allows us to chose r=(p+ )n with arbitrarily small and be sure that probability y B(E(m),r) can be made arbitrarily small by choosing sufficiently large n. Note. The result above is one of alternative versions the Chernoff bound is being stated; coefficients in the inequality also can be improved.
Chernoff bound Theorem (Chernoff bound): For any > 0: P(X > (1+ ) ) 2 /2 . Proof? Somewhat technical, but can be obtained by considering Y i =e tX i, applying a familiar (an introductory result from probability theory course :) Markov’s inequality P(X a) E[X]/a: P(X > (1+ ) ) = P(e tX > e ((1+ ) ) ) P(X > (1+ ) ) < E[e tX ]/e ((1+ ) ) and using the fact that e x < 1+x. (This leads to somewhat more complicated expression of Chernoff bound, which then can be easily bounded from above by 2 /2 ).
Shannon’s theorem - achievability We fix the “number of errors” as r=(p+ )n and for each y attempt to decode it to m such that y B(E(m),r). To show that we can get arbitrarily small error probability we need to show that two following probabilities can be made arbitrarily small by choosing a sufficiently large length n of the code: 1) Probability that the number of errors have exceeded r (so decoding fails by definition); 2) Probability that y falls within intersection of B(E(m),r) with another B(E(m),r) (thus a possibility to decode y wrongly as m). [Adapted from M.Sudan] Let m be given. For D(y) m we must have at least one of the following: 1)y B(E(m),r)(more than r errors in transmission) 2)There exists m m with E(m) B(y,r).(possibility of wrong decoding)
Shannon’s theorem - achievability Why we should chose r=(p+ )n? 1) the average error rate is pn, however for any >0 the probability to get more than (p+ )n errors approaches 0 with increasing n. Thus a good (and probably the only) way to bring the probability of event #1 to 0. Note that this part of the proof works regardless of code rate R. There is no relation between this and the desired error probability - just that we need a value >0 to prove the part #1 and then can chose it as small as we wish for it to be ignored in proof of part #2. [Adapted from M.Sudan] Let m be given. For D(y) m we must have at least one of the following: 1)y B(E(m),r)(more than r errors in transmission) 2)There exists m m with E(m) B(y,r).(possibility of wrong decoding)
Shannon’s theorem - achievability Why we should chose r=(p+ )n? 2) the probability of event #2 is dependent form n, r and k. So this is from where we will get the estimate for error rate. It will turn out that the probability of event #2 will approach 0 with increasing n, if we chose R=k/n<1 H 2 (p) and a sufficiently small (which we are free to do). This essentially will mean that good codes exist provided that R<C. [Adapted from M.Sudan] Let m be given. For D(y) m we must have at least one of the following: 1)y B(E(m),r)(more than r errors in transmission) 2)There exists m m with E(m) B(y,r).(possibility of wrong decoding)
Shannon’s theorem - achievability So, the Chernoff bound guaranties that probability of event #1 can be achieved to be arbitrarily small for any > 0 and r=(p+ )n, if we take n to be sufficiently large. What about the probability for the event #2? Let m be given. For D(y) m we must have at least one of the following: 1)y B(E(m),r)(more than r errors in transmission) 2)There exists m m with E(m) B(y,r).(possibility of wrong decoding)
By definition vol(B(y,r))=. We will show that vol(B(y,pn)) 2 H(p)n. This will give error probability P 2 H(p)n n+k. Thus for R=k/n < 1 H(p) we have P 0 if n . Shannon’s theorem - achievability How to estimate the probability of event #2? Denote the number of vectors in distance r from y by vol(B(y,r)). The probability to decode y wrongly to one particular m is vol(B(y,r))/2 n. The probability to decode y wrongly to any vector is vol(B(y,r))/2 n k. r r E(m) y
B(y,r) – ball with center at y and radius r, i.e. set of all vectors z from {0,1} n with d(y,z) r. vol(B(y,r)) = |{z {0,1} n | d(y,z) r}|. We obviously have: But a bit surprisingly it turns out that vol(B(y,pn)) 2 H(p)n The value obviously does not depend from 0, thus let consider vol(B(0,pn)) = vol(B(pn)). Volume of Hamming balls
Theorem (volume of Hamming balls): B(r) – ball in n dimensional binary vector space with center at 0 and radius r, i.e. B(r) = {z {0,1} n | w(z) r}. Let p < 1/2 and H 2 (p) = p log p (1 p) log (1 p). Then for large enough n: 1.Vol(B(pn)) 2 H 2 (p)n 2.Vol(B(pn)) 2 H 2 (p)n o(n) Here f(n) o(n) for any c >0 we have f(n) < cn for all large n lim n f(n)/n = 0. Volume of Hamming balls
Vol(B(pn)) 2 H 2 (p)n ? Consider (an obvious) equality 1 = (p + (1 p)) n. Volume of Hamming balls [Adapted from A.Rudra] Thus: 1 Vol(B(pn))/2 H 2 (p)n Vol(B(pn)) 2 H 2 (p)n
Shannon’s theorem - achievability The probability to decode y wrongly to one particular m is vol(B(y,r))/2 n. The probability to decode y wrongly to any vector is vol(B(y,r))/2 n k. We have chosen r=(p+ )n and need to have vol(B(y,r))/2 n k n 0 We also assume that rate R = k/n = 1 H 2 (p) < C = 1 H 2 (p) P(error #2) = vol(B(y,r))/2 n k 2 H 2 (p+ )n /2 n k = 2 H 2 (p+ )n /2 n n+nH 2 (p)+n = 2 n(H 2 (p+ ) H 2 (p) ) 0,n 0 r r E(m) y We have vol(B(n,pn)) 2 H 2 (p)n. Probability P of incorrect decoding? For k/n 0. So, by choosing sufficiently large n we can get P as small as we wish.
Shannon’s theorem - achievability Are we done? Almost. We have proved that m will likely be decoded correctly. What about other messages? [Adapted from M.Sudan]
Theorem (volume of Hamming balls): B(r) – ball in n dimensional binary vector space with center at 0 and radius r, i.e. B(r) = {z {0,1} n | w(z) r}. Let p < 1/2 and H 2 (p) = p log p (1 p) log (1 p). Then for large enough n: 1.Vol(B(pn)) 2 H 2 (p)n OK 2.Vol(B(pn)) 2 H 2 (p)n o(n) still need to prove this Here f(n) o(n) for any c >0 we have f(n) < cn for all large n lim n f(n)/n = 0. Volume of Hamming balls
[Adapted from A.Rudra] Stirling’s formula: More precisely:
Volume of Hamming balls [Adapted from A.Rudra] since Thus:
Volume of Hamming balls (a simplified proof) A simpler estimate for n! (can be easily proven by induction): Then:
Shannon's theorem – a simpler special case Theorem: For every BSC with bit error probability p, the channel capacity C = 1 H 2 (p) has the following properties: 1)For any ε > 0 and R < C, for large enough n there exists a code of length n and rate R ≥ R and a decoding algorithm, such that the maximal probability of block error is ≤ ε. 2)For any R > C there exists ε > 0 such that for any code with rate R ≥ R probability of block error is ≥ ε. #1 - we could construct specific codes that just will do the job. OK #2 - we need to prove that whatever code with rate above R we use, it will not work. Still need to prove this.
Shannon’s theorem - unachievability [Adapted from M.Sudan] A draft of simple proof by M.Sudan. Does it work and can we fill-in the details?
Shannon’s theorem - unachievability Assume we have code with rate R = k/n > C = 1 H 2 (p). Then R = k/n = 1 H 2 (p)+ for some > 0. “The usual” approach: 1)Show than with some probability c > 0 (independent of n) the number of errors in single block transmission will exceed the expected value pn. 2)Probability to have i pn errors in specific places is p j (1 p) n j. Since p ½ we have p i (1 p) n i >p j (1 p) n j. 3)To achieve arbitrarily small block errors therefore we need to correct almost errors.
Shannon’s theorem - unachievability R = k/n = 1 H 2 (p)+ for some > 0. 3)To achieve arbitrarily small block errors therefore we need to correct almost errors. 4)This applies to 2 k vectors, thus we need to have almost distinct vectors of length n. 5)This leads to: i.e. to a contradiction.
Shannon’s theorem - unachievability r The vector space with 2 n elements should be partitioned in 2 k disjoint parts containing elements each. Thus:
Shannon’s theorem - unachievability R = k/n = 1 H 2 (p)+ for some > 0. 1)P(number of errors > pn) > c for some c>0. 2)P(i P(i > pn errors) (assuming known error bits). 3)For p B 0 we must correct almost errors. 4)Thus we need almost distinct vectors of length n. 5)This gives a contradiction: Are all these steps well justified? The are some difficulties with #1 – whilst it is a very well known fact, it is difficult to give a simple proof for it
Binomial distribution Let X i be independent random values, with p(X i =0) = p and p(X i =1) = 1 p. What can we say about ?
Probability density function: Or, in more general form: In this case is expected value and 2 variance. Normal distribution
Central limit theorem (CLT) X 1,X 2,...,X n – independent random variables with expected values and variances 2. S n = X 1 +X X n. Theorem (Central Limit Theorem): A binomial distribution is just a special case with p(X i =1) = p, p(X i =0) = 1 p, = pn and 2 = np(1 p). This gives us P(number of errors > pn) n 1/2. Should be familiar from probability theory course Short proofs are known, but require mastering some techniques first.
Direct computation of median Median c n,p is defined as follows: c n,p = min{c = 0…n| B n,c (p) > ½} It turns out that np/2 ≤ c n,p ≤ np/2 . [Adapted from R.Gob]
Direct computation of median We just need also a few inequalities: [Adapted from R.Gob]
Unachievability – the lengths of codes To prove that for codes with R = k/n > C = 1 H 2 (p) an arbitrarily small error probability can not be achieved we used assumption that we can take the length n of code to be as large as we wish. Apparently OK, if we have to prove the existence of codes, however have we shown that there are no good codes R > C if n is small? Formally not yet, however, provided a “good” code with R > C and length n exists, just by used repeated code we can obtain code with same rate for any length sn. Thus, if there are no “good” codes with large length, there can not be “good” codes with any length n.