A Bit of Information Theory Unsupervised Learning Working Group Assaf Oron, Oct Based mostly upon: Cover & Thomas, “Elements of Inf. Theory”, 1991
Contents Coding and Transmitting Information Entropy etc. Information Theory and Statistics Information Theory and “Machine Learning”
What is Coding? ( 1 ) We keep coding all the time Crucial requirement for coding: “source” and “receiver” agree on the key. Modern coding: telegraph->radio->… –Practical problems: How efficient can we make it? Tackled from 20’s on. –1940’s: Claude Shannon
What is Coding? ( 2 ) Shannon’s greatness: finding a solution of the “specific” problem, by working on the “general” problem. Namely: how does one quantify information, its coding and its transmission? –ANY type of information
Some Day-to-Day Codes Code“Channel”Unique? Instant? Spoken LanguageSounds via airWell… Written LanguageSigns on paper/screen Well… Numbers and mathSigns on paper/screen, electronic, etc. Usually (decimal point, operation signs, etc.) DNA protein codeNucleotide pairsYes (start, end, 3- somes)
Information Complexity of Some Coded Messages Let’s think written numbers: –k digits → 10 k possible messages How about written English? –k letters → 26 k possible messages –k words → D k possible messages, where D is English dictionary size ∴ Length ~ log(complexity)
Information Entropy The expected length (bits) of a binary message conveying x-type information – other common descriptions: “code complexity”, “uncertainty”, “missing/required information”, “expected surprise”, “information content” (BAD), etc.
Why “Entropy”? Thermodynamics (mid 19 th ): “amount of un-usable heat in system” Statistical Physics (end 19 th ): “log (complexity of current system state)” – ⇉ amount of “mess” in the system –The two were proven to be equivalent –Statistical entropy is proportional to information entropy if p(x) is uniform 2 nd Law of Thermodynamics… –Entropy never decreases (more later)
Entropy Properties, Examples.
Kullback-Leibler Divergence (“Relative Entropy”) In words: “the excess message length needed to use p(x)-optimized code for messages based on q(x)” Properties, Relation to H:
Mutual Information Relationship to D,H (hint: cond. Prob.) : Properties, Examples:
Entropy for Continuous RV’s “Little” h, Defined in the “natural” way However it is not the same measure: –h of discrete RV’s is always 0, and H of continuous RV’s is infinite (measure theory…) For many continuous distributions, h is log (variance) plus some constant –Why?
The Statistical Connection ( 1 ) K-L D ⇔ Likelihood Ratio Law of large numbers can be rephrased as a limit on D For dist.’s with same variance, normal is the one with maximum h. –(2 nd law of thermodynamics revisited) –h is an average quantity. Is the CLT, then, a “law of nature”?… (I think: “YES”!)
The Statistical Connection ( 2 ) Mutual information is very useful –Certainly for discrete RV’s –Also for continuous (no dist. assumptions!) A lot of implications for stochastic processes, as well –I just don’t quite understand them –English?
Machine Learning? (1) So far, we haven’t mentioned noise –In inf. Theory, noise exists in the channel –Channel capacity: max(mutual information) between “source”, “receiver” –Noise directly decreases the capacity Shannon’s “Biggest” result: this can be (almost) achieved with (almost) zero error –Known as the “Channel Coding Theorem”
Machine Learning? (2) The CCT inspired practical developments –Now it all depends on code and channel! –Smarter, “error-correcting” codes –Tech developments focus on channel capacity
Machine Learning? (3) Can you find analogy between coding and classification/clustering? (can it be useful??) CodingM. Learning Source Entropy Variability of Interest Choice of Channel Parameterization Choice of Code Classification Rules Channel noise “Noise”, random errors Channel Capacity Maximum accuracy I (source,receiver) Actual Accuracy
Machine Learning? (4) Inf. Theory tells us that: –We CAN find a nearly optimal classification or clustering rule (“coding”) –We CAN find a nearly optimal parameterization+classification combo –Perhaps the newer wave of successful, but statistically “intractable” methods (boosting etc.) works by increasing channel capacity (i.e, high- dim parameterization)?